Architecture

Pipeline Overview

RCLI implements a complete voice AI pipeline running on Apple Silicon with Metal GPU acceleration. The architecture prioritizes minimal latency through pre-allocated memory, zero-copy audio transfer, and intelligent caching.

Mic → VAD → STT → [RAG] → LLM → TTS → Speaker
                             |
                      Tool Calling → 43 macOS Actions
                             |
                      [Tool Trace] → TUI (optional)

Pipeline States

The orchestrator maintains an atomic state machine with five states:

enum class PipelineState : uint8_t {
    IDLE,         // Ready for input
    LISTENING,    // Capturing audio
    PROCESSING,   // LLM generation
    SPEAKING,     // TTS playback
    INTERRUPTED   // User cancelled
};

State transitions are atomic and thread-safe using std::atomic<PipelineState>. See src/core/types.h:17-23.

Threading Model

RCLI uses three dedicated threads in live mode, synchronized via condition variables:

STT Thread

Role: Audio capture, VAD filtering, speech detection

Reads 100ms chunks (1600 samples @ 16kHz) from lock-free ring buffer
Computes RMS energy to filter noise (threshold: 0.005 RMS)
Feeds speech segments to Zipformer streaming STT
Emits final transcripts to LLM thread via condition variable
Thread name: rastack.stt

void Orchestrator::stt_thread_fn() {
    pthread_setname_np("rastack.stt");
    
    while (live_running_.load(std::memory_order_relaxed)) {
        // Read audio chunks from ring buffer
        capture_rb_->read(chunk_buf.data(), to_read);
        
        // Energy-based filtering prevents phantom transcripts
        float rms = compute_rms(chunk_buf);
        if (rms > ENERGY_FLOOR || vad_.is_speech()) {
            stt_.feed_audio(chunk_buf.data(), to_read);
        }
        
        // Emit final transcripts
        if (result.is_final) {
            std::lock_guard lock(text_mutex_);
            pending_text_ = result.text;
            text_ready_ = true;
            text_cv_.notify_one();
        }
    }
}

See src/pipeline/orchestrator.cpp:772-841 for implementation.

LLM Thread

Role: Token generation, tool calling, conversation history management

Waits on text_cv_ for transcribed text from STT thread
Spawns async TTS worker thread for parallel synthesis
Speculative tool detection: Buffers first 15 tokens to detect tool calls before streaming
Adaptive buffering extends window if partial tool tag detected (e.g., <tool)
Trims conversation history to fit context window (token-budget trimming)
Thread name: rastack.llm

void Orchestrator::llm_thread_fn() {
    pthread_setname_np("rastack.llm");
    
    while (live_running_) {
        // Wait for STT result
        std::unique_lock lock(text_mutex_);
        text_cv_.wait(lock, [this]() { 
            return text_ready_ || !live_running_; 
        });
        
        // Spawn TTS worker (double-buffered)
        std::thread tts_worker([&]() {
            // Synthesizes sentences while LLM generates next ones
        });
        
        // Speculative first-token detection
        auto callback = [&](const TokenOutput& tok) {
            token_buffer += tok.text;
            if (token_buffer.find(tool_call_start) != std::string::npos) {
                detected_tool_call = true;
            } else {
                detector.feed(tok.text);  // Stream to TTS
            }
        };
        
        llm_.generate_with_cached_prompt(prompt, callback);
    }
}

See src/pipeline/orchestrator.cpp:843-1021.

TTS Thread

Role: Sentence-level synthesis, double-buffered playback

Consumes sentences from queue populated by SentenceDetector
Synthesizes audio while previous sentence plays (double-buffering)
Writes directly to playback ring buffer (zero-copy)
Thread name: rastack.tts.live

The SentenceDetector accumulates LLM tokens and flushes complete sentences based on:

Primary breaks (., !, ?, \n) with min 3 words
Secondary breaks (;, :) with min 25 words (prevents long wait)
Word-level flush at 7 words if no punctuation (early TTS start)

See src/pipeline/sentence_detector.cpp:6-84.

Synchronization Primitives

Lock-Free Ring Buffers

Single-Producer Single-Consumer (SPSC) ring buffers with atomic head/tail pointers.

Capture buffer: Mic → STT (16384 samples, ~1 sec @ 16kHz)
Playback buffer: TTS → Speaker (44032 samples, ~2 sec @ 22kHz)
Zero-copy: Direct memcpy from producer to consumer
Power-of-2 sizing enables fast modulo via bitwise AND
Cache-line aligned to prevent false sharing

template<typename T>
class RingBuffer {
    alignas(64) std::atomic<size_t> write_pos_{0};
    alignas(64) std::atomic<size_t> read_pos_{0};
    size_t capacity_;  // Power of 2
    size_t mask_;      // capacity - 1
    T* data_;          // Pre-allocated from pool
};

See src/core/ring_buffer.h:23-146.

Condition Variables

Used for STT → LLM communication:

std::mutex text_mutex_;
std::condition_variable text_cv_;
std::string pending_text_;
bool text_ready_ = false;

STT thread signals when final transcript ready:

{
    std::lock_guard lock(text_mutex_);
    pending_text_ = result.text;
    text_ready_ = true;
}
text_cv_.notify_one();

LLM thread waits:

std::unique_lock lock(text_mutex_);
text_cv_.wait(lock, [this]() { return text_ready_; });

Atomic State

Pipeline state is atomic for lock-free reads:

std::atomic<PipelineState> state_{PipelineState::IDLE};

void set_state(PipelineState new_state) {
    PipelineState old = state_.exchange(new_state, 
        std::memory_order_release);
    if (state_cb_) state_cb_(old, new_state);
}

Allows TUI to poll state without blocking threads.

Memory Management

Pre-Allocated Memory Pool

RCLI allocates a fixed-size memory pool at startup (64-256 MB depending on available RAM). Zero runtime malloc during inference.

class MemoryPool {
    void* base_;                    // Pre-allocated arena
    size_t total_size_;             // Total bytes
    std::atomic<size_t> used_;      // Current offset
    std::atomic<size_t> high_water_; // Peak usage
    
public:
    template<typename T>
    T* alloc(size_t count) {
        // Wait-free allocation via CAS loop
        size_t current = used_.load(std::memory_order_relaxed);
        size_t aligned = (current + CACHE_LINE_SIZE - 1) & ~(CACHE_LINE_SIZE - 1);
        size_t new_used = aligned + count * sizeof(T);
        
        while (!used_.compare_exchange_weak(current, new_used)) {
            // CAS retry
        }
        return reinterpret_cast<T*>(base_ + aligned);
    }
};

See src/core/memory_pool.h:26-179.

Allocation Strategy

Huge pages (2MB superpages) for pools ≥4MB
- macOS: vm_allocate() with VM_FLAGS_SUPERPAGE_SIZE_2MB
- Linux: mmap() + madvise(MADV_HUGEPAGE)
- Reduces TLB misses by 10-15%
Zero-fill at init to pre-fault pages (avoid runtime page faults)
Cache-line alignment (64 bytes) prevents false sharing

Scratch regions for temporary allocations:

size_t mark = pool.mark();
float* temp = pool.alloc<float>(4096);
// Use temp...
pool.reset_to_mark(mark);  // O(1) reset

Pool Sizing by Hardware

RAM Tier	Pool Size	LLM Batch	Use Case
64+ GB	256 MB	4096	Mac Studio / Pro
32-48 GB	128 MB	2048	M3 Max
16-24 GB	64 MB	1024	M3 / M2 / M1
<16 GB	64 MB	512	iOS / Android

See src/core/hardware_profile.h:131-146 for RAM-based pool sizing.

Design Patterns

Orchestrator Pattern

Central class owns all engines and coordinates data flow. Single point of initialization and state management.

class Orchestrator {
    SttEngine stt_;              // Streaming STT (Zipformer)
    SttEngine offline_stt_;      // Offline STT (Whisper/Parakeet)
    LlmEngine llm_;              // LLM with KV cache
    TtsEngine tts_;              // TTS synthesis
    VadEngine vad_;              // Voice activity detection
    ToolEngine tools_;           // Tool calling
    AudioIO audio_;              // CoreAudio I/O
    
    std::unique_ptr<MemoryPool> pool_;
    std::unique_ptr<RingBuffer<float>> capture_rb_;
    std::unique_ptr<RingBuffer<float>> playback_rb_;
};

See src/pipeline/orchestrator.cpp:12-88 for initialization.

System Prompt KV Caching

Llama.cpp KV cache reused across queries. System prompt (including tool definitions) cached once at init:

bool Orchestrator::init(const PipelineConfig& config) {
    // Cache tool-aware system prompt in KV cache
    std::string tool_system = llm_.profile().build_tool_system_prompt(
        config.system_prompt, tools_.get_tool_definitions_json());
    llm_.cache_system_prompt(tool_system);
}

Subsequent queries only send user turn:

std::string prompt = llm_.profile().build_user_turn(user_text);
llm_.generate_with_cached_prompt(prompt, callback);

This reduces time-to-first-token by 50-70% (from ~45ms to ~22ms). See src/pipeline/orchestrator.cpp:75-79 and src/pipeline/orchestrator.cpp:232-239.

Double-Buffered TTS

SentenceDetector queues complete sentences. TTS worker synthesizes next sentence while current one plays.

LLM tokens: "Hello" "there" "." "How" "are" "you" "?" 
                    ↓
SentenceDetector:  ["Hello there."]  ["How are you?"]
                         ↓                    ↓
TTS worker:      Synthesize #1          Synthesize #2
                         ↓                    ↓
Playback:           Play #1  ────────────  Play #2

This overlaps TTS synthesis with playback, reducing perceived latency. See src/pipeline/orchestrator.cpp:177-196 for implementation.

Hardware Profiling at Startup

Detects CPU topology (P/E cores), RAM, Metal GPU at runtime. Configures optimal llama.cpp params.

const auto& hw = rastack::global_hw();  // Cached singleton

// Detected params:
hw.perf_cores        // P-cores (M3 Max: 10)
hw.effi_cores        // E-cores (M3 Max: 4)
hw.llm_gpu_layers    // 99 (all layers to Metal)
hw.llm_n_threads     // 1 (GPU-bound decode)
hw.llm_flash_attn    // true (Metal supports Flash Attention)
hw.pool_bytes        // 128 MB (36 GB RAM)

See src/core/hardware_profile.h:79-309 for detection logic.

Hot-Swappable Components

LLM Model Swap

Switch LLM at runtime without restarting pipeline:

bool Orchestrator::reload_llm(const LlmConfig& new_config) {
    if (state_ != PipelineState::IDLE) return false;
    
    llm_.shutdown();
    llm_.init(new_config);
    
    tools_.set_model_profile(&llm_.profile());
    recache_system_prompt();  // Re-cache with new model
    
    return true;
}

In TUI: Press M → Select new model → Hot-swap in <1 sec. See src/pipeline/orchestrator.cpp:1030-1051.

Tool Calling Architecture

Hybrid Two-Tier System

Tier 1: Keyword Match (fast path, <1ms)

Matches user query against action keywords
Scores by relevance, returns top-k actions
Filters tool definitions sent to LLM (reduces context)

Tier 2: LLM Extraction (model-native format)

Qwen3: <tool_call>{"name": "...", "arguments": {...}}</tool_call>
LFM2: <|tool_call_start|>{...}<|tool_call_end|>
Model-specific parsing via ModelProfile

Speculative Tool Detection

Buffers first 15 tokens to detect tool calls before streaming to TTS:

auto callback = [&](const TokenOutput& tok) {
    token_buffer += tok.text;
    tokens_buffered++;
    
    if (tokens_buffered <= 15) {
        if (token_buffer.find("&lt;tool_call&gt;") != std::string::npos) {
            detected_tool_call = true;  // Stop streaming
        }
    } else if (!detected_tool_call) {
        detector.feed(token_buffer);  // Stream to TTS
        token_buffer.clear();
    }
};

If tool call detected, parse and execute actions. Then run second LLM pass with tool results to generate natural language response. See src/pipeline/orchestrator.cpp:215-311.

Project Structure

src/
├── engines/        # ML engine wrappers
│   ├── stt_engine.cpp       # Zipformer/Whisper STT
│   ├── llm_engine.cpp       # llama.cpp wrapper
│   ├── tts_engine.cpp       # Piper/Kokoro TTS
│   ├── vad_engine.cpp       # Silero VAD
│   ├── embedding_engine.cpp # Arctic Embed for RAG
│   └── model_profile.cpp    # Model-specific formats
├── pipeline/       # Orchestration
│   ├── orchestrator.cpp     # Main pipeline coordinator
│   └── sentence_detector.cpp # Sentence boundary detection
├── rag/            # Retrieval system
│   ├── vector_index.cpp     # USearch HNSW index
│   ├── bm25_index.cpp       # Full-text search
│   └── hybrid_retriever.cpp # RRF fusion
├── core/           # Low-level primitives
│   ├── types.h              # Pipeline types
│   ├── ring_buffer.h        # Lock-free SPSC queue
│   ├── memory_pool.h        # Pre-allocated arena
│   └── hardware_profile.h   # Runtime detection
├── audio/          # I/O
│   └── audio_io.cpp         # CoreAudio capture/playback
├── tools/          # Tool calling
│   └── tool_engine.cpp      # Parsing + execution
├── actions/        # macOS actions
│   └── 43 action implementations
├── bench/          # Benchmarks
│   └── benchmark.cpp        # STT/LLM/TTS/E2E/RAG/tools
├── api/            # Public interface
│   └── rcli_api.cpp         # C API
└── cli/            # UI
    └── main.cpp             # TUI + CLI commands

Architecture

Pipeline Overview

Pipeline States

Threading Model

STT Thread

LLM Thread

TTS Thread

Synchronization Primitives

Memory Management

Pre-Allocated Memory Pool

Allocation Strategy

Pool Sizing by Hardware

Design Patterns

Orchestrator Pattern

System Prompt KV Caching

Double-Buffered TTS

Hardware Profiling at Startup

Hot-Swappable Components

LLM Model Swap

Tool Calling Architecture

Hybrid Two-Tier System

Speculative Tool Detection

Project Structure

Next Steps

Performance

Configuration

Documentation Index

​Pipeline Overview

​Pipeline States

​Threading Model

​STT Thread

​LLM Thread

​TTS Thread

​Synchronization Primitives

​Memory Management

​Pre-Allocated Memory Pool

​Allocation Strategy

​Pool Sizing by Hardware

​Design Patterns

​Orchestrator Pattern

​System Prompt KV Caching

​Double-Buffered TTS

​Hardware Profiling at Startup

​Hot-Swappable Components

​LLM Model Swap

​Tool Calling Architecture

​Hybrid Two-Tier System

​Speculative Tool Detection

​Project Structure

​Next Steps

Performance

Configuration

Pipeline Overview

Pipeline States

Threading Model

STT Thread

LLM Thread

TTS Thread

Synchronization Primitives

Memory Management

Pre-Allocated Memory Pool

Allocation Strategy

Pool Sizing by Hardware

Design Patterns

Orchestrator Pattern

System Prompt KV Caching

Double-Buffered TTS

Hardware Profiling at Startup

Hot-Swappable Components

LLM Model Swap

Tool Calling Architecture

Hybrid Two-Tier System

Speculative Tool Detection

Project Structure

Next Steps