Models Overview

RCLI runs a complete AI pipeline entirely on your Mac using Apple Silicon with Metal GPU acceleration. All models are stored locally and run without any cloud dependency.

Model Architecture

RCLI uses 5 model types working together in a multi-threaded pipeline:

STT (Speech-to-Text)

Converts voice input to text using Zipformer (streaming) or Whisper/Parakeet (offline)

LLM (Language Model)

Processes requests and generates responses using Qwen3 or Liquid LFM2 models

TTS (Text-to-Speech)

Synthesizes natural speech output using Piper, Kokoro, or KittenTTS voices

VAD (Voice Activity)

Detects speech vs. silence in real-time using Silero VAD (0.6 MB)

Embeddings

Generates text embeddings for RAG using Snowflake Arctic Embed S (34 MB)

Storage Location

All models are stored in a single directory on your Mac:

~/Library/RCLI/models/
├── lfm2-1.2b-tool-q4_k_m.gguf          # LLM (731 MB)
├── zipformer/                           # Streaming STT (~50 MB)
├── whisper-base.en/                     # Offline STT (~140 MB)
├── piper-voice/                         # TTS voice (~60 MB)
├── silero_vad.onnx                      # VAD (0.6 MB)
└── snowflake-arctic-embed-s-q8_0.gguf  # Embeddings (34 MB)

Total default install size: ~1 GB. Models are downloaded once during rcli setup and persist across sessions.

Default Models

The default model set installed by rcli setup is optimized for speed, quality, and disk efficiency:

Component	Default Model	Size	Key Feature
LLM	Liquid LFM2 1.2B Tool	731 MB	Excellent tool calling (~180 t/s)
STT (Streaming)	Zipformer	50 MB	Real-time streaming for live mic
STT (Offline)	Whisper base.en	140 MB	~5% WER, fast batch transcription
TTS	Piper Lessac	60 MB	Fast synthesis, clear English voice
VAD	Silero VAD	0.6 MB	Real-time speech detection
Embeddings	Snowflake Arctic S	34 MB	High-quality text embeddings for RAG

Model Inference Engines

RCLI uses optimized inference engines for each model type:

llama.cpp — LLM + embedding inference with Metal GPU, KV caching, Flash Attention
sherpa-onnx — STT, TTS, and VAD via ONNX Runtime with Metal acceleration
USearch — HNSW vector index for fast hybrid retrieval (~4ms)

Context Windows

LLM models support different context sizes:

Model Family	Default Context	Max Context
Qwen3 / Qwen3.5	4,096 tokens	32K - 262K tokens
Liquid LFM2	4,096 tokens	128K tokens

Context size can be adjusted with --ctx-size <n> flag.

Model Selection Persistence

Active model choices persist across sessions in:

~/Library/RCLI/config

Example config file:

model=qwen3.5-4b
tts_model=kokoro-en
stt_model=parakeet-tdt

You can switch models at any time using:

rcli models          # Interactive model browser
rcli upgrade-llm     # Guided LLM upgrade
rcli upgrade-stt     # Upgrade to Parakeet TDT
rcli voices          # Switch TTS voices

Next Steps

LLM Models

Compare all 9 language models with specs and benchmarks

STT Models

Explore speech-to-text options (Zipformer, Whisper, Parakeet)

TTS Models

Browse 6 TTS voices with quality ratings

Switching Models

Learn how to hot-swap models without restart

Documentation Index

​Model Architecture

STT (Speech-to-Text)

LLM (Language Model)

TTS (Text-to-Speech)

VAD (Voice Activity)

Embeddings

​Storage Location

​Default Models

​Model Inference Engines

​Context Windows

​Model Selection Persistence

​Next Steps

LLM Models

STT Models

TTS Models

Switching Models

Model Architecture

Storage Location

Default Models

Model Inference Engines

Context Windows

Model Selection Persistence

Next Steps