Voice AI Pipeline
A voice AI pipeline chains three components: speech-to-text (STT) converts spoken audio to text, an LLM generates a response, and text-to-speech (TTS) converts the response back to audio. The result feels like talking to an AI assistant — but building one that feels natural requires understanding latency budgets, streaming, and the architectural trade-offs between cascading and real-time (speech-to-speech) approaches.
The Three-Stage Pipeline
Standard cascade pipeline:
Microphone audio (PCM stream)
→ [VAD] Voice Activity Detection — detect speech start/end
→ [STT] Speech-to-Text — audio → transcript text
→ [LLM] Language Model — transcript → response text
→ [TTS] Text-to-Speech — response text → audio stream
Speaker audio output
Total latency target: < 800ms end-to-end for natural conversation feel
The Latency Budget
Natural conversation requires a response within roughly 800ms of the user finishing speaking. Each stage must stay within a budget to hit this target. A 12 GB GPU running the full local stack — Whisper Turbo + 8B chat model + Piper TTS — achieves approximately 1 second of total latency, which is fast enough for comfortable conversation.
| Stage | Local (GPU) | Cloud API | Optimised target |
|---|---|---|---|
| VAD | 5–20ms | n/a (runs client-side) | < 20ms |
| STT (Whisper Turbo) | 80–200ms | 100–300ms | < 200ms |
| LLM (8B, streaming) | 200–500ms TTFT | 100–400ms TTFT | < 300ms to first token |
| TTS (Piper / fast) | 50–200ms (CPU) | 100–300ms | < 200ms |
| Total | ~600ms–1s | ~400ms–900ms | < 800ms |
Latency tip:
Start TTS before the LLM finishes. Stream LLM tokens to TTS as they arrive — the audio generation for the first sentence starts while the model is still generating the rest. This hides most of the LLM generation latency.
Speech-to-Text (STT)
STT converts microphone audio to transcript text. The choice of STT model determines accuracy, language support, latency, and whether the system can run locally or requires a cloud API.
| Model | Runs locally | Speed | Notes |
|---|---|---|---|
| Whisper Turbo (OpenAI) | Yes | 6–8× faster than large-v3 | Same accuracy as large-v3; best local option for English+multilingual |
| Whisper large-v3 | Yes | Slower | Highest accuracy, multilingual; use when speed is not critical |
| OpenAI Realtime API (GPT-4o) | No | Very fast (streaming) | Native speech-to-speech; skips the cascade model entirely |
| Deepgram Nova-2 | Cloud | Real-time streaming | Low latency, good English accuracy, streaming transcript |
| Picovoice Cheetah Fast | Yes | Ultra-low latency | On-device, optimised for conversational AI; minimal accuracy trade-off |
Voice Activity Detection (VAD)
VAD detects when a user starts and stops speaking. Without it, the pipeline either sends audio continuously (wasting compute) or uses silence duration thresholds (which feel unnatural and cut off slow speakers). Good VAD is invisible — bad VAD creates the frustrating experience of a voice assistant that cuts you off mid-sentence.
VAD options:
- Silero VAD: fast, accurate, runs locally on CPU — the de facto standard for Python voice pipelines
- WebRTC VAD: lightweight, built into many browser-based pipelines, less accurate than Silero
- Turn detection in Realtime APIs: OpenAI Realtime API and Gemini Live include built-in server-side turn detection, removing VAD from the client entirely
Text-to-Speech (TTS)
TTS converts the LLM's text response to audio. Quality ranges from robotic (fast, local) to indistinguishable from human (cloud, slower). For most local deployments, Piper provides the best speed-quality trade-off.
| Engine | Runs locally | Quality | Notes |
|---|---|---|---|
| Piper (rhasspy) | Yes (CPU) | Good | Fastest local option; CPU-based (frees GPU for Whisper + LLM); 100+ voices |
| Kokoro | Yes | Very good | Open-source, high quality, growing in popularity 2025 |
| ElevenLabs | Cloud | Excellent | Near-human quality, voice cloning, streaming — adds cost and latency |
| OpenAI TTS | Cloud | Excellent | 6 voices, consistent quality, streaming via Realtime API |
| Coqui / XTTS | Yes | Very good | Voice cloning locally; higher GPU requirement than Piper |
Cascade vs Real-Time Architecture
Two fundamentally different approaches exist for voice AI. Choose based on your latency requirements, privacy constraints, and whether you need custom LLM control.
Cascade (STT → LLM → TTS)
- Modular — swap any component independently
- Full control over the LLM (your own model, prompts, RAG, tools)
- Can run fully locally — no data leaves your network
- Works with any LLM API or local model
- Total latency: 600ms–1.5s depending on components
- Best for: enterprise, privacy-sensitive, custom agent integrations
Real-Time / Speech-to-Speech
- Audio goes directly to a model that processes and responds in audio (OpenAI Realtime API, Gemini Live)
- Lowest latency — avoids transcription and TTS stages
- Natural interruption handling built in
- Limited LLM control — you use their model, not yours
- Data must go to the cloud provider
- Best for: consumer products, where low latency and interruption handling matter most
Handling Interruptions
Natural conversation includes interruptions — the user speaks while the AI is still talking. Cascade pipelines require explicit interruption handling; real-time APIs handle this server-side.
Cascade interruption pattern:
- VAD detects user speech while TTS audio is playing
- Stop TTS playback immediately (do not finish the sentence)
- Cancel the in-flight LLM streaming request
- Record new user utterance, run through STT
- Send new transcript to LLM with context that previous response was interrupted
Without interruption handling, the AI finishes its response even after the user has already spoken again — this is the most common reason voice demos feel unnatural.
Full Local Stack Example
Local voice pipeline (12 GB GPU):
- VAD: Silero VAD (CPU, <5ms)
- STT: Whisper Turbo INT8 quantised (GPU, ~100–200ms)
- LLM: Llama 3.1 8B via Ollama or vLLM (GPU, ~200–400ms TTFT)
- TTS: Piper (CPU — frees GPU for Whisper + LLM, ~100ms)
- Orchestration: Python async pipeline with streaming between stages
- Total latency: ~600ms–900ms; ~1s on typical hardware
Piper on CPU intentionally — leaving the GPU for Whisper and the LLM where it has more impact on latency.
Voice Pipeline Failure Modes
Common failure modes
- VAD cutoff: assistant interrupts user because VAD triggers on short pause mid-sentence
- STT hallucination: Whisper transcribes background noise as words; sends garbled text to LLM
- TTS latency spike: first token arrives fast but TTS startup takes 300ms; user perceives no response
- No interruption handling: AI finishes sentence after user has already started talking
- Microphone bleed: TTS audio is picked up by microphone, creating a feedback loop
Mitigations
- Tune VAD end-of-speech silence threshold for your use case (typically 500–800ms)
- Filter short (< 0.5s) low-confidence transcripts before sending to LLM
- Pre-buffer TTS: generate the first sentence during LLM generation before playback starts
- Implement explicit interruption cancellation (stop TTS + cancel LLM request)
- Use AEC (Acoustic Echo Cancellation) to remove speaker output from mic input
Checklist: Do You Understand This?
- What are the three main stages of a cascade voice AI pipeline, and what does each convert?
- What is the total latency target for a natural-feeling voice conversation, and which stage is hardest to optimise?
- Why is Piper run on CPU rather than GPU in the local stack, and what does the GPU handle instead?
- What is the difference between a cascade pipeline and a real-time (speech-to-speech) architecture? When would you choose each?
- Describe the five-step interruption handling pattern for a cascade pipeline.
- What is "microphone bleed" and how is it prevented?