Intermediate

Voice AI Pipeline

A voice AI pipeline chains three components: speech-to-text (STT) converts spoken audio to text, an LLM generates a response, and text-to-speech (TTS) converts the response back to audio. The result feels like talking to an AI assistant — but building one that feels natural requires understanding latency budgets, streaming, and the architectural trade-offs between cascading and real-time (speech-to-speech) approaches.

Microphone

Raw audio stream

→

VAD

Detect speech end

→

STT

Whisper / Deepgram

→

LLM

Claude generates reply

→

TTS

Text → speech

→

Speaker

Audio output

End-to-end voice pipeline — latency is the sum of all stages

The Three-Stage Pipeline

Standard cascade pipeline:

Microphone audio (PCM stream)

→ [VAD] Voice Activity Detection — detect speech start/end

→ [STT] Speech-to-Text — audio → transcript text

→ [LLM] Language Model — transcript → response text

→ [TTS] Text-to-Speech — response text → audio stream

Speaker audio output

Total latency target: < 800ms end-to-end for natural conversation feel

The Latency Budget

Natural conversation requires a response within roughly 800ms of the user finishing speaking. Each stage must stay within a budget to hit this target. A 12 GB GPU running the full local stack — Whisper Turbo + 8B chat model + Piper TTS — achieves approximately 1 second of total latency, which is fast enough for comfortable conversation.

Stage	Local (GPU)	Cloud API	Optimised target
VAD	5–20ms	n/a (runs client-side)	< 20ms
STT (Whisper Turbo)	80–200ms	100–300ms	< 200ms
LLM (8B, streaming)	200–500ms TTFT	100–400ms TTFT	< 300ms to first token
TTS (Piper / fast)	50–200ms (CPU)	100–300ms	< 200ms
Total	~600ms–1s	~400ms–900ms	< 800ms

Latency tip:

Start TTS before the LLM finishes. Stream LLM tokens to TTS as they arrive — the audio generation for the first sentence starts while the model is still generating the rest. This hides most of the LLM generation latency.

Speech-to-Text (STT)

STT converts microphone audio to transcript text. The choice of STT model determines accuracy, language support, latency, and whether the system can run locally or requires a cloud API.

Model	Runs locally	Speed	Notes
Whisper Turbo (OpenAI)	Yes	6–8× faster than large-v3	Same accuracy as large-v3; best local option for English+multilingual
Whisper large-v3	Yes	Slower	Highest accuracy, multilingual; use when speed is not critical
OpenAI Realtime API (GPT-4o)	No	Very fast (streaming)	Native speech-to-speech; skips the cascade model entirely
Deepgram Nova-2	Cloud	Real-time streaming	Low latency, good English accuracy, streaming transcript
Picovoice Cheetah Fast	Yes	Ultra-low latency	On-device, optimised for conversational AI; minimal accuracy trade-off

Voice Activity Detection (VAD)

VAD detects when a user starts and stops speaking. Without it, the pipeline either sends audio continuously (wasting compute) or uses silence duration thresholds (which feel unnatural and cut off slow speakers). Good VAD is invisible — bad VAD creates the frustrating experience of a voice assistant that cuts you off mid-sentence.

VAD options:

Silero VAD: fast, accurate, runs locally on CPU — the de facto standard for Python voice pipelines
WebRTC VAD: lightweight, built into many browser-based pipelines, less accurate than Silero
Turn detection in Realtime APIs: OpenAI Realtime API and Gemini Live include built-in server-side turn detection, removing VAD from the client entirely

Text-to-Speech (TTS)

TTS converts the LLM's text response to audio. Quality ranges from robotic (fast, local) to indistinguishable from human (cloud, slower). For most local deployments, Piper provides the best speed-quality trade-off.

Engine	Runs locally	Quality	Notes
Piper (rhasspy)	Yes (CPU)	Good	Fastest local option; CPU-based (frees GPU for Whisper + LLM); 100+ voices
Kokoro	Yes	Very good	Open-source, high quality, growing in popularity 2025
ElevenLabs	Cloud	Excellent	Near-human quality, voice cloning, streaming — adds cost and latency
OpenAI TTS	Cloud	Excellent	6 voices, consistent quality, streaming via Realtime API
Coqui / XTTS	Yes	Very good	Voice cloning locally; higher GPU requirement than Piper

Cascade vs Real-Time Architecture

Two fundamentally different approaches exist for voice AI. Choose based on your latency requirements, privacy constraints, and whether you need custom LLM control.

Cascade (STT → LLM → TTS)

Modular — swap any component independently
Full control over the LLM (your own model, prompts, RAG, tools)
Can run fully locally — no data leaves your network
Works with any LLM API or local model
Total latency: 600ms–1.5s depending on components
Best for: enterprise, privacy-sensitive, custom agent integrations

Real-Time / Speech-to-Speech

Audio goes directly to a model that processes and responds in audio (OpenAI Realtime API, Gemini Live)
Lowest latency — avoids transcription and TTS stages
Natural interruption handling built in
Limited LLM control — you use their model, not yours
Data must go to the cloud provider
Best for: consumer products, where low latency and interruption handling matter most

Handling Interruptions

Natural conversation includes interruptions — the user speaks while the AI is still talking. Cascade pipelines require explicit interruption handling; real-time APIs handle this server-side.

Cascade interruption pattern:

VAD detects user speech while TTS audio is playing
Stop TTS playback immediately (do not finish the sentence)
Cancel the in-flight LLM streaming request
Record new user utterance, run through STT
Send new transcript to LLM with context that previous response was interrupted

Without interruption handling, the AI finishes its response even after the user has already spoken again — this is the most common reason voice demos feel unnatural.

Full Local Stack Example

Local voice pipeline (12 GB GPU):

VAD: Silero VAD (CPU, <5ms)
STT: Whisper Turbo INT8 quantised (GPU, ~100–200ms)
LLM: Llama 3.1 8B via Ollama or vLLM (GPU, ~200–400ms TTFT)
TTS: Piper (CPU — frees GPU for Whisper + LLM, ~100ms)
Orchestration: Python async pipeline with streaming between stages
Total latency: ~600ms–900ms; ~1s on typical hardware

Piper on CPU intentionally — leaving the GPU for Whisper and the LLM where it has more impact on latency.

Voice Pipeline Failure Modes

Common failure modes

VAD cutoff: assistant interrupts user because VAD triggers on short pause mid-sentence
STT hallucination: Whisper transcribes background noise as words; sends garbled text to LLM
TTS latency spike: first token arrives fast but TTS startup takes 300ms; user perceives no response
No interruption handling: AI finishes sentence after user has already started talking
Microphone bleed: TTS audio is picked up by microphone, creating a feedback loop

Mitigations

Tune VAD end-of-speech silence threshold for your use case (typically 500–800ms)
Filter short (< 0.5s) low-confidence transcripts before sending to LLM
Pre-buffer TTS: generate the first sentence during LLM generation before playback starts
Implement explicit interruption cancellation (stop TTS + cancel LLM request)
Use AEC (Acoustic Echo Cancellation) to remove speaker output from mic input

Checklist: Do You Understand This?

What are the three main stages of a cascade voice AI pipeline, and what does each convert?
What is the total latency target for a natural-feeling voice conversation, and which stage is hardest to optimise?
Why is Piper run on CPU rather than GPU in the local stack, and what does the GPU handle instead?
What is the difference between a cascade pipeline and a real-time (speech-to-speech) architecture? When would you choose each?
Describe the five-step interruption handling pattern for a cascade pipeline.
What is "microphone bleed" and how is it prevented?