Skip to main content
RCLI is organized into distinct modules for voice processing, RAG, actions, and the CLI. This guide explains the directory structure and how components interact.

High-Level Architecture

Mic → VAD → STT (Zipformer) → [RAG Retrieval] → LLM (Qwen3) → TTS (Piper) → Speaker

                                              Tool Calling → macOS Actions
Core components:
  • Engines — ML inference wrappers (STT, LLM, TTS, VAD, embeddings)
  • Pipeline — Orchestrator coordinates data flow between engines
  • RAG — Hybrid retrieval (vector + BM25) over local documents
  • Actions — 43 macOS integrations via AppleScript and shell
  • CLI — Interactive TUI and command-line interface

Directory Structure

RCLI/
├── src/                    # C++ source code
│   ├── engines/            # ML engine wrappers
│   ├── pipeline/           # Orchestrator and sentence detection
│   ├── rag/                # RAG retrieval system
│   ├── core/               # Core types and utilities
│   ├── audio/              # CoreAudio I/O
│   ├── tools/              # Tool calling engine
│   ├── bench/              # Benchmark harness
│   ├── actions/            # macOS action implementations
│   ├── api/                # Public C API
│   ├── cli/                # TUI and CLI commands
│   ├── models/             # Model registries
│   └── test/               # Test harness
├── deps/                   # Dependencies (gitignored)
│   ├── llama.cpp/          # Cloned by scripts/setup.sh
│   └── sherpa-onnx/        # Cloned by scripts/setup.sh
├── scripts/                # Build and setup scripts
├── Formula/                # Homebrew formula
├── CMakeLists.txt          # CMake build configuration
└── README.md

src/ Modules

engines/

ML inference wrappers for each modality:
FilePurpose
stt_engine.cpp/.hSpeech-to-text via sherpa-onnx (Zipformer, Whisper, Parakeet)
llm_engine.cpp/.hLLM inference via llama.cpp with Metal GPU
tts_engine.cpp/.hText-to-speech via sherpa-onnx (Piper, Kokoro, KittenTTS)
vad_engine.cpp/.hVoice activity detection (Silero VAD)
embedding_engine.cpp/.hText embeddings for RAG (Snowflake Arctic Embed)
model_profile.cpp/.hModel metadata, chat templates, tool call parsing
Design:
  • Each engine wraps a C API (llama.cpp, sherpa-onnx)
  • Engines are initialized once and reused across queries
  • Metal GPU acceleration for LLM and embeddings
  • ONNX Runtime for STT/TTS/VAD

pipeline/

Orchestrates data flow between engines:
FilePurpose
orchestrator.cpp/.hCentral class that owns all engines and coordinates the pipeline
sentence_detector.cpp/.hAccumulates LLM tokens and flushes complete sentences to TTS
text_sanitizer.hRemoves non-speech text (markdown, XML tags) before TTS
Orchestrator responsibilities:
  • Manages pipeline state (IDLE → LISTENING → PROCESSING → SPEAKING)
  • Runs STT/LLM/TTS threads
  • Dispatches tool calls to ActionRegistry
  • Maintains conversation history with token-budget trimming
  • System prompt KV caching for fast response

rag/

Hybrid retrieval system for local documents:
FilePurpose
vector_index.cpp/.hHNSW vector search via USearch
bm25_index.cpp/.hFull-text search with BM25 ranking
hybrid_retriever.cpp/.hCombines vector + BM25 via Reciprocal Rank Fusion
document_processor.cpp/.hChunks documents (PDF, DOCX, TXT) into 512-token segments
index_builder.cpp/.hBuilds and persists indices
Retrieval flow:
  1. Query is embedded via embedding_engine
  2. Vector search (HNSW) finds nearest chunks
  3. BM25 search finds keyword-matching chunks
  4. Results fused via RRF (Reciprocal Rank Fusion)
  5. Top-k chunks injected into LLM context
Performance: ~4ms retrieval over 5K+ chunks (M3 Max)

core/

Core types and utilities:
FilePurpose
types.hShared types (ToolCall, ToolResult, PipelineState, etc.)
ring_buffer.hLock-free ring buffer for zero-copy audio transfer
memory_pool.hPre-allocated 64 MB arena (no runtime malloc)
hardware_profile.hDetects P-cores, E-cores, Metal GPU, RAM
log.hLogging macros (LOG_INFO, LOG_ERROR)
base64.hBase64 encoding/decoding
string_utils.hString manipulation utilities
file_utils.hFile I/O helpers
Key design patterns:
  • Lock-free ring buffer — zero-copy audio passing between threads
  • Pre-allocated memory pool — 64 MB arena allocated at init
  • Hardware profiling — adapts thread count and GPU layers to hardware

audio/

CoreAudio microphone and speaker I/O:
FilePurpose
audio_io.cpp/.hCoreAudio input/output streams
mic_permission.h/.mmMicrophone permission request (Objective-C)
Features:
  • 16 kHz mono capture for STT
  • 24 kHz mono playback for TTS
  • Buffer size: 512 samples (32ms at 16 kHz)
  • Minimal latency configuration

tools/

Tool calling engine:
FilePurpose
tool_engine.cpp/.hParses LLM tool calls and dispatches to ActionRegistry
Tool calling flow:
  1. LLM generates tool call in model-native format (e.g., Qwen3’s <tool_call>)
  2. ToolEngine parses via ModelProfile::parse_tool_calls()
  3. Dispatches to ActionRegistry::execute()
  4. Returns result to LLM
Supported formats:
  • Qwen3: <tool_call>{...}</tool_call>
  • LFM2: <|tool_call_start|>{...}<|tool_call_end|>
  • Generic JSON: {"name": "...", "arguments": {...}}

bench/

Benchmark harness:
FilePurpose
benchmark.cpp/.hRuns STT, LLM, TTS, E2E, RAG, tools, memory benchmarks
Suites:
  • stt — Transcription latency and accuracy
  • llm — Time to first token, throughput (tok/s)
  • tts — Synthesis latency
  • e2e — Voice-in to audio-out latency
  • rag — Retrieval latency (vector + BM25)
  • tools — Tool calling accuracy and latency
  • memory — Peak memory usage
  • all — All suites
Usage:
rcli bench --suite llm
rcli bench --all-llm --suite llm    # Compare all LLMs
rcli bench --output results.json

actions/

macOS action implementations:
FilePurpose
action_registry.cpp/.hRegisters actions and dispatches execution
action_helpers.hJSON parsing, string escaping utilities
applescript_executor.cpp/.hExecutes AppleScript and shell commands
register_all.cppCalls all registration functions
Category files:
notes_actions.cpp/.hApple Notes integration
reminders_actions.cpp/.hReminders integration
messages_actions.cpp/.hMessages/iMessage
app_control_actions.cpp/.hOpen/quit apps
window_actions.cpp/.hWindow management
system_actions.cpp/.hSystem settings (volume, dark mode, lock)
media_actions.cpp/.hSpotify/Apple Music
web_actions.cpp/.hWeb search
browser_actions.cpp/.hSafari/Chrome control
clipboard_actions.cpp/.hClipboard read/write
files_actions.cpp/.hFile search
navigation_actions.cpp/.hMaps integration
communication_actions.cpp/.hFaceTime
43 actions total — see Adding Actions

api/

Public C API:
FilePurpose
rcli_api.hPublic C API header (all engine functionality)
rcli_api.cppAPI implementation
Exported functions:
  • rcli_init() — Initialize pipeline
  • rcli_query() — One-shot text query
  • rcli_start_listen() — Start continuous voice mode
  • rcli_stop_listen() — Stop listening
  • rcli_cleanup() — Shutdown pipeline
Use case: Embed RCLI in other applications

cli/

CLI and TUI:
FilePurpose
main.cppEntry point, argument parsing, command dispatch
tui_dashboard.hInteractive TUI dashboard (FTXUI)
tui_app.hTUI event loop
actions_cli.hActions panel (browse, enable/disable, execute)
model_pickers.hModel management (LLM, STT, TTS)
help.hCLI help text
setup_cmds.hrcli setup, rcli cleanup commands
visualizer.hWaveform visualizer
cli_common.hShared CLI utilities
TUI features:
  • Push-to-talk (SPACE bar)
  • Models panel (M) — browse, download, hot-swap
  • Actions panel (A) — enable/disable actions
  • Benchmarks panel (B) — run performance tests
  • RAG panel (R) — ingest documents
  • Cleanup panel (D) — remove unused models
  • Tool call trace (T) — debug LLM tool calls

models/

Model registries:
FilePurpose
model_registry.hLLM model definitions (id, URL, size, speed, tool calling)
tts_model_registry.hTTS voice definitions
stt_model_registry.hSTT model definitions
Model metadata:
  • Download URL (Hugging Face)
  • Size (MB)
  • Speed estimate (tokens/sec)
  • Tool calling capability
  • Default/recommended flags
Usage:
rcli models              # Interactive picker
rcli upgrade-llm         # Guided LLM upgrade
rcli voices              # TTS voice picker

test/

Test harness:
FilePurpose
test_pipeline.cppPipeline integration tests
Test modes:
  • --actions-only — Fast, no models needed
  • --llm-only — LLM inference tests
  • --stt-only — STT transcription tests
  • --tts-only — TTS synthesis tests
  • --api-only — C API tests
Usage:
./rcli_test ~/Library/RCLI/models
./rcli_test ~/Library/RCLI/models --actions-only

Key Design Patterns

Orchestrator Pattern

The Orchestrator class owns all engines and coordinates data flow:
src/pipeline/orchestrator.h
class Orchestrator {
    STTEngine       stt_;
    LLMEngine       llm_;
    TTSEngine       tts_;
    VADEngine       vad_;
    EmbeddingEngine embedding_;
    ToolEngine      tool_engine_;
    ActionRegistry  action_registry_;
    HybridRetriever rag_retriever_;

    std::atomic<PipelineState> state_;
    // ...
};
Benefits:
  • Single source of truth for pipeline state
  • Simplified thread coordination
  • Easy to add new engines

Lock-Free Ring Buffer

Zero-copy audio transfer between threads:
src/core/ring_buffer.h
template<typename T>
class RingBuffer {
    std::atomic<size_t> read_pos_;
    std::atomic<size_t> write_pos_;
    std::vector<T> buffer_;
    // ...
};
Benefits:
  • No mutex contention
  • Zero-copy (pointers only)
  • Fixed allocation (no runtime malloc)

Pre-Allocated Memory Pool

64 MB arena allocated at init:
src/core/memory_pool.h
class MemoryPool {
    std::vector<uint8_t> pool_;
    size_t offset_ = 0;

    void* allocate(size_t size) {
        void* ptr = &pool_[offset_];
        offset_ += size;
        return ptr;
    }
};
Benefits:
  • No runtime malloc during inference
  • Predictable latency
  • Cache-friendly (contiguous memory)

System Prompt KV Caching

Reuses llama.cpp KV cache across queries:
src/engines/llm_engine.cpp
void LLMEngine::generate(const std::string& prompt) {
    // First query: process system prompt + user input
    if (!kv_cache_initialized_) {
        eval_tokens(system_prompt_tokens_);
        save_kv_cache();
        kv_cache_initialized_ = true;
    } else {
        // Subsequent queries: restore system prompt cache
        restore_kv_cache();
    }
    eval_tokens(user_input_tokens_);
    // ...
}
Benefits:
  • Avoids reprocessing system prompt (saves ~20-30ms)
  • Lower latency for multi-turn conversations

Sentence-Level TTS Scheduling

TTS synthesizes complete sentences, not token-by-token:
src/pipeline/sentence_detector.cpp
void SentenceDetector::add_token(const std::string& token) {
    buffer_ += token;
    if (is_sentence_boundary(buffer_)) {
        flush_sentence(buffer_);
        buffer_.clear();
    }
}
Benefits:
  • Natural prosody (TTS sees full sentences)
  • Double-buffered playback (next sentence synthesizes while current plays)
  • Lower latency than waiting for full LLM response

Threading Model

Three threads run concurrently in live mode:
1

STT Thread

  • Captures mic audio via CoreAudio
  • Runs Silero VAD to filter silence
  • Detects speech endpoints
  • Transcribes via Zipformer (streaming) or Whisper (batch)
  • Signals LLM thread when transcription is ready
2

LLM Thread

  • Waits for STT output (std::condition_variable)
  • Generates tokens via llama.cpp with Metal GPU
  • Parses tool calls and dispatches to ActionRegistry
  • Feeds sentences to TTS via SentenceDetector
  • Maintains conversation history with token-budget trimming
3

TTS Thread

  • Queues sentences from LLM
  • Synthesizes audio via sherpa-onnx (Piper/Kokoro)
  • Double-buffered playback (synthesizes next while playing current)
  • Outputs to CoreAudio speaker
Synchronization:
  • std::condition_variable for thread wakeup
  • std::atomic<PipelineState> for state transitions
  • Lock-free ring buffers for audio transfer

Dependencies

Vendored (deps/)

Cloned by scripts/setup.sh:
  • llama.cpp — LLM + embedding inference with Metal GPU
  • sherpa-onnx — STT/TTS/VAD via ONNX Runtime

Fetched by CMake

Automatic via FetchContent:
  • USearch v2.16.5 — HNSW vector index (header-only)
  • FTXUI v5.0.0 — Terminal UI library

macOS System Frameworks

  • CoreAudio, AudioToolbox, AudioUnit
  • Foundation, AVFoundation
  • IOKit (hardware monitoring)
  • Metal, MetalKit (GPU acceleration)

Build Outputs

build/
├── rcli                # Main CLI executable
├── rcli_test           # Test executable
├── librcli.a           # Static library (engine + actions)
└── lib/
    ├── libllama.dylib
    ├── libggml.dylib
    └── libsherpa-onnx-c-api.dylib

Configuration Files

Runtime configuration stored in ~/Library/RCLI/:
~/Library/RCLI/
├── models/             # Downloaded models
│   ├── llm/
│   ├── stt/
│   ├── tts/
│   ├── vad/
│   └── embeddings/
├── index/              # RAG indices
│   ├── vector.index
│   ├── bm25.index
│   └── metadata.json
└── config/
    ├── actions.json    # Enabled/disabled actions
    ├── active_models.json  # Active model selection
    └── settings.json   # User preferences

Next Steps

Building from Source

Build and install RCLI locally

Adding Actions

Extend RCLI with custom macOS actions

Contributing

Submit changes and improvements