Documentation Index Fetch the complete documentation index at: https://mintlify.com/yocxy2/RCLI/llms.txt
Use this file to discover all available pages before exploring further.
Pipeline Overview
RCLI implements a complete voice AI pipeline running on Apple Silicon with Metal GPU acceleration. The architecture prioritizes minimal latency through pre-allocated memory, zero-copy audio transfer, and intelligent caching.
Mic → VAD → STT → [RAG] → LLM → TTS → Speaker
|
Tool Calling → 43 macOS Actions
|
[Tool Trace] → TUI (optional)
Pipeline States
The orchestrator maintains an atomic state machine with five states:
enum class PipelineState : uint8_t {
IDLE , // Ready for input
LISTENING , // Capturing audio
PROCESSING , // LLM generation
SPEAKING , // TTS playback
INTERRUPTED // User cancelled
};
State transitions are atomic and thread-safe using std::atomic<PipelineState>. See src/core/types.h:17-23.
Threading Model
RCLI uses three dedicated threads in live mode, synchronized via condition variables:
STT Thread
Role: Audio capture, VAD filtering, speech detection
Reads 100ms chunks (1600 samples @ 16kHz) from lock-free ring buffer
Computes RMS energy to filter noise (threshold: 0.005 RMS)
Feeds speech segments to Zipformer streaming STT
Emits final transcripts to LLM thread via condition variable
Thread name: rastack.stt
void Orchestrator :: stt_thread_fn () {
pthread_setname_np ( "rastack.stt" );
while ( live_running_ . load ( std ::memory_order_relaxed)) {
// Read audio chunks from ring buffer
capture_rb_ -> read ( chunk_buf . data (), to_read);
// Energy-based filtering prevents phantom transcripts
float rms = compute_rms (chunk_buf);
if (rms > ENERGY_FLOOR || vad_ . is_speech ()) {
stt_ . feed_audio ( chunk_buf . data (), to_read);
}
// Emit final transcripts
if ( result . is_final ) {
std ::lock_guard lock (text_mutex_);
pending_text_ = result . text ;
text_ready_ = true ;
text_cv_ . notify_one ();
}
}
}
See src/pipeline/orchestrator.cpp:772-841 for implementation.
LLM Thread
Role: Token generation, tool calling, conversation history management
Waits on text_cv_ for transcribed text from STT thread
Spawns async TTS worker thread for parallel synthesis
Speculative tool detection: Buffers first 15 tokens to detect tool calls before streaming
Adaptive buffering extends window if partial tool tag detected (e.g., <tool)
Trims conversation history to fit context window (token-budget trimming)
Thread name: rastack.llm
void Orchestrator :: llm_thread_fn () {
pthread_setname_np ( "rastack.llm" );
while (live_running_) {
// Wait for STT result
std ::unique_lock lock (text_mutex_);
text_cv_ . wait (lock, [ this ]() {
return text_ready_ || ! live_running_;
});
// Spawn TTS worker (double-buffered)
std ::thread tts_worker ([ & ]() {
// Synthesizes sentences while LLM generates next ones
});
// Speculative first-token detection
auto callback = [ & ]( const TokenOutput & tok ) {
token_buffer += tok . text ;
if ( token_buffer . find (tool_call_start) != std :: string ::npos) {
detected_tool_call = true ;
} else {
detector . feed ( tok . text ); // Stream to TTS
}
};
llm_ . generate_with_cached_prompt (prompt, callback);
}
}
See src/pipeline/orchestrator.cpp:843-1021.
TTS Thread
Role: Sentence-level synthesis, double-buffered playback
Consumes sentences from queue populated by SentenceDetector
Synthesizes audio while previous sentence plays (double-buffering)
Writes directly to playback ring buffer (zero-copy)
Thread name: rastack.tts.live
The SentenceDetector accumulates LLM tokens and flushes complete sentences based on:
Primary breaks (., !, ?, \n) with min 3 words
Secondary breaks (;, :) with min 25 words (prevents long wait)
Word-level flush at 7 words if no punctuation (early TTS start)
See src/pipeline/sentence_detector.cpp:6-84.
Synchronization Primitives
Single-Producer Single-Consumer (SPSC) ring buffers with atomic head/tail pointers.
Capture buffer: Mic → STT (16384 samples, ~1 sec @ 16kHz)
Playback buffer: TTS → Speaker (44032 samples, ~2 sec @ 22kHz)
Zero-copy: Direct memcpy from producer to consumer
Power-of-2 sizing enables fast modulo via bitwise AND
Cache-line aligned to prevent false sharing
template < typename T >
class RingBuffer {
alignas (64) std ::atomic < size_t > write_pos_{ 0 };
alignas (64) std ::atomic < size_t > read_pos_{ 0 };
size_t capacity_; // Power of 2
size_t mask_; // capacity - 1
T * data_; // Pre-allocated from pool
};
See src/core/ring_buffer.h:23-146.
Used for STT → LLM communication: std ::mutex text_mutex_;
std ::condition_variable text_cv_;
std ::string pending_text_;
bool text_ready_ = false ;
STT thread signals when final transcript ready: {
std ::lock_guard lock (text_mutex_);
pending_text_ = result . text ;
text_ready_ = true ;
}
text_cv_ . notify_one ();
LLM thread waits: std :: unique_lock lock ( text_mutex_ );
text_cv_ . wait (lock, [ this ]() { return text_ready_; });
Pipeline state is atomic for lock-free reads: std ::atomic < PipelineState > state_{ PipelineState ::IDLE};
void set_state ( PipelineState new_state ) {
PipelineState old = state_ . exchange (new_state,
std ::memory_order_release);
if (state_cb_) state_cb_ (old, new_state);
}
Allows TUI to poll state without blocking threads.
Memory Management
Pre-Allocated Memory Pool
RCLI allocates a fixed-size memory pool at startup (64-256 MB depending on available RAM). Zero runtime malloc during inference.
class MemoryPool {
void * base_; // Pre-allocated arena
size_t total_size_; // Total bytes
std ::atomic < size_t > used_; // Current offset
std ::atomic < size_t > high_water_; // Peak usage
public:
template < typename T >
T * alloc ( size_t count ) {
// Wait-free allocation via CAS loop
size_t current = used_ . load ( std ::memory_order_relaxed);
size_t aligned = (current + CACHE_LINE_SIZE - 1 ) & ~ (CACHE_LINE_SIZE - 1 );
size_t new_used = aligned + count * sizeof (T);
while ( ! used_ . compare_exchange_weak (current, new_used)) {
// CAS retry
}
return reinterpret_cast < T *> (base_ + aligned);
}
};
See src/core/memory_pool.h:26-179.
Allocation Strategy
Huge pages (2MB superpages) for pools ≥4MB
macOS: vm_allocate() with VM_FLAGS_SUPERPAGE_SIZE_2MB
Linux: mmap() + madvise(MADV_HUGEPAGE)
Reduces TLB misses by 10-15%
Zero-fill at init to pre-fault pages (avoid runtime page faults)
Cache-line alignment (64 bytes) prevents false sharing
Scratch regions for temporary allocations:
size_t mark = pool . mark ();
float * temp = pool . alloc < float > ( 4096 );
// Use temp...
pool . reset_to_mark (mark); // O(1) reset
Pool Sizing by Hardware
RAM Tier Pool Size LLM Batch Use Case 64+ GB 256 MB 4096 Mac Studio / Pro 32-48 GB 128 MB 2048 M3 Max 16-24 GB 64 MB 1024 M3 / M2 / M1 <16 GB 64 MB 512 iOS / Android
See src/core/hardware_profile.h:131-146 for RAM-based pool sizing.
Design Patterns
Orchestrator Pattern
Central class owns all engines and coordinates data flow. Single point of initialization and state management.
class Orchestrator {
SttEngine stt_; // Streaming STT (Zipformer)
SttEngine offline_stt_; // Offline STT (Whisper/Parakeet)
LlmEngine llm_; // LLM with KV cache
TtsEngine tts_; // TTS synthesis
VadEngine vad_; // Voice activity detection
ToolEngine tools_; // Tool calling
AudioIO audio_; // CoreAudio I/O
std ::unique_ptr < MemoryPool > pool_;
std ::unique_ptr < RingBuffer < float >> capture_rb_;
std ::unique_ptr < RingBuffer < float >> playback_rb_;
};
See src/pipeline/orchestrator.cpp:12-88 for initialization.
System Prompt KV Caching
Llama.cpp KV cache reused across queries. System prompt (including tool definitions) cached once at init:
bool Orchestrator :: init ( const PipelineConfig & config ) {
// Cache tool-aware system prompt in KV cache
std ::string tool_system = llm_ . profile (). build_tool_system_prompt (
config . system_prompt , tools_ . get_tool_definitions_json ());
llm_ . cache_system_prompt (tool_system);
}
Subsequent queries only send user turn:
std ::string prompt = llm_ . profile (). build_user_turn (user_text);
llm_ . generate_with_cached_prompt (prompt, callback);
This reduces time-to-first-token by 50-70% (from ~45ms to ~22ms).
See src/pipeline/orchestrator.cpp:75-79 and src/pipeline/orchestrator.cpp:232-239.
Double-Buffered TTS
SentenceDetector queues complete sentences. TTS worker synthesizes next sentence while current one plays.
LLM tokens: "Hello" "there" "." "How" "are" "you" "?"
↓
SentenceDetector: ["Hello there."] ["How are you?"]
↓ ↓
TTS worker: Synthesize #1 Synthesize #2
↓ ↓
Playback: Play #1 ──────────── Play #2
This overlaps TTS synthesis with playback, reducing perceived latency.
See src/pipeline/orchestrator.cpp:177-196 for implementation.
Hardware Profiling at Startup
Detects CPU topology (P/E cores), RAM, Metal GPU at runtime. Configures optimal llama.cpp params.
const auto & hw = rastack :: global_hw (); // Cached singleton
// Detected params:
hw . perf_cores // P-cores (M3 Max: 10)
hw . effi_cores // E-cores (M3 Max: 4)
hw . llm_gpu_layers // 99 (all layers to Metal)
hw . llm_n_threads // 1 (GPU-bound decode)
hw . llm_flash_attn // true (Metal supports Flash Attention)
hw . pool_bytes // 128 MB (36 GB RAM)
See src/core/hardware_profile.h:79-309 for detection logic.
Hot-Swappable Components
LLM Model Swap
Switch LLM at runtime without restarting pipeline:
bool Orchestrator :: reload_llm ( const LlmConfig & new_config ) {
if (state_ != PipelineState ::IDLE) return false ;
llm_ . shutdown ();
llm_ . init (new_config);
tools_ . set_model_profile ( & llm_ . profile ());
recache_system_prompt (); // Re-cache with new model
return true ;
}
In TUI: Press M → Select new model → Hot-swap in <1 sec.
See src/pipeline/orchestrator.cpp:1030-1051.
Hybrid Two-Tier System
Tier 1: Keyword Match (fast path, <1ms)
Matches user query against action keywords
Scores by relevance, returns top-k actions
Filters tool definitions sent to LLM (reduces context)
Tier 2: LLM Extraction (model-native format)
Qwen3: <tool_call>{"name": "...", "arguments": {...}}</tool_call>
LFM2: <|tool_call_start|>{...}<|tool_call_end|>
Model-specific parsing via ModelProfile
Buffers first 15 tokens to detect tool calls before streaming to TTS:
auto callback = [ & ]( const TokenOutput & tok ) {
token_buffer += tok . text ;
tokens_buffered ++ ;
if (tokens_buffered <= 15 ) {
if ( token_buffer . find ( "<tool_call>" ) != std :: string ::npos) {
detected_tool_call = true ; // Stop streaming
}
} else if ( ! detected_tool_call) {
detector . feed (token_buffer); // Stream to TTS
token_buffer . clear ();
}
};
If tool call detected, parse and execute actions. Then run second LLM pass with tool results to generate natural language response.
See src/pipeline/orchestrator.cpp:215-311.
Project Structure
src/
├── engines/ # ML engine wrappers
│ ├── stt_engine.cpp # Zipformer/Whisper STT
│ ├── llm_engine.cpp # llama.cpp wrapper
│ ├── tts_engine.cpp # Piper/Kokoro TTS
│ ├── vad_engine.cpp # Silero VAD
│ ├── embedding_engine.cpp # Arctic Embed for RAG
│ └── model_profile.cpp # Model-specific formats
├── pipeline/ # Orchestration
│ ├── orchestrator.cpp # Main pipeline coordinator
│ └── sentence_detector.cpp # Sentence boundary detection
├── rag/ # Retrieval system
│ ├── vector_index.cpp # USearch HNSW index
│ ├── bm25_index.cpp # Full-text search
│ └── hybrid_retriever.cpp # RRF fusion
├── core/ # Low-level primitives
│ ├── types.h # Pipeline types
│ ├── ring_buffer.h # Lock-free SPSC queue
│ ├── memory_pool.h # Pre-allocated arena
│ └── hardware_profile.h # Runtime detection
├── audio/ # I/O
│ └── audio_io.cpp # CoreAudio capture/playback
├── tools/ # Tool calling
│ └── tool_engine.cpp # Parsing + execution
├── actions/ # macOS actions
│ └── 43 action implementations
├── bench/ # Benchmarks
│ └── benchmark.cpp # STT/LLM/TTS/E2E/RAG/tools
├── api/ # Public interface
│ └── rcli_api.cpp # C API
└── cli/ # UI
└── main.cpp # TUI + CLI commands
Next Steps
Performance Benchmark results and optimization techniques
Configuration Config files, environment variables, tuning