Voice Latency Design
Latency is the make-or-break metric for voice AI. A response under 400ms feels human-like. A response over 1.5 seconds feels broken. Most naive voice pipeline implementations sit at 2–5 seconds — not because the individual components are slow, but because they are wired sequentially instead of in parallel, and because no streaming is used between stages. This page covers the specific optimisations that take a voice pipeline from 2 seconds to under 400ms.
Why 400ms Is the Target
Human conversation has an average turn-taking gap of 200–300ms. A voice AI response under 400ms falls within the range users perceive as natural. At 600–800ms, users notice a pause but tolerate it for complex questions. Above 1.5 seconds, the interaction feels broken and users stop talking naturally, resorting to command-style speech. The 400ms target is not arbitrary — it is the perceptual threshold for naturalness.
| Latency range | User perception |
|---|---|
| < 400ms | Natural conversation — feels human-like |
| 400–800ms | Slight pause — acceptable for complex queries, noticeable |
| 800ms–1.5s | Clearly waiting — tolerable for one-off queries, poor for conversation |
| > 1.5s | Broken — users lose context, resort to command-style speech |
Where Time Goes in a Naive Pipeline
A naive sequential pipeline waits for each stage to fully complete before starting the next. This is the primary source of excess latency — not slow components, but sequential wiring.
Naive pipeline (sequential — ~2–4s total):
1. Wait for user to finish speaking
2. Send full audio clip to STT → wait for full transcript
3. Send full transcript to LLM → wait for full response text
4. Send full response text to TTS → wait for full audio file
5. Play audio
Each stage adds its full duration to the total. No parallelism, no streaming.
Optimised pipeline (streaming + overlap — ~300–500ms to first audio):
1. VAD detects end of speech
2. Stream audio to STT → tokens arrive in chunks
3. Final transcript arrives → immediately sent to LLM (streaming enabled)
4. LLM tokens stream → sentence boundary detector fires at first complete sentence
5. First sentence sent to TTS → audio starts playing while LLM generates sentence 2+
User hears audio ~300ms after speaking. LLM and TTS overlap; user hears sentence 1 while sentence 2 is being generated.
Sentence-Aware Streaming
The single highest-impact optimisation is sentence-aware streaming between the LLM and TTS. Instead of waiting for the complete LLM response, a sentence boundary detector monitors the streaming token output. As soon as a complete, semantically coherent sentence is detected (typically at a period, question mark, or exclamation point followed by a space), that sentence is immediately dispatched to TTS — while the LLM continues generating the rest of the response.
How to implement
- Buffer incoming LLM tokens
- Detect sentence end:
./?/!followed by whitespace or end of stream - On sentence boundary: dispatch buffered text to TTS, clear buffer
- Queue TTS audio chunks for seamless playback
- Flush any remaining buffer at LLM stream end
Impact
- Streaming implementations achieve 800–1500ms latency reduction vs batch
- Streaming TTS systems can achieve time-to-first-audio under 50ms after the first sentence is dispatched
- VoXtream (2025) achieves first-packet latency as low as 102ms on GPU
- User perceives faster response even if total generation time is the same
LLM Optimisations
After STT, the LLM is typically the largest latency contributor. TTFT (time-to-first-token) is the metric that matters for voice — the user's first audio starts from the first complete sentence, which starts from the first tokens.
Model size selection
For voice, choose the smallest model that meets quality requirements. An 8B model has 3–5× lower TTFT than a 70B model on the same hardware. Quality for conversational voice is often acceptable at 8B — 70B adds latency without proportional quality gain for short conversational responses.
Quantisation (4-bit)
4-bit quantisation with GGUF (llama.cpp) or AWQ achieves up to 40% latency reduction while preserving over 95% of generation quality. Quantised conversational models reduce computational complexity by 60× or more compared to full precision. This is one of the highest ROI optimisations for local voice stacks.
Keep context short
TTFT scales with prompt length. Trim the system prompt to voice-essential content only. For multi-turn conversations, summarise history rather than appending all turns verbatim. A 500-token prompt has measurably lower TTFT than a 3000-token one.
TTS Optimisations
Pre-buffer the first chunk
Generate the first TTS audio chunk while the LLM is still completing the sentence. Some TTS engines support partial input. Even a 50ms pre-buffer eliminates the perceptual gap between LLM response and audio start.
Choose the right TTS for your constraint
- Piper (CPU): ~50–200ms, frees GPU — best choice when GPU is needed for STT + LLM
- Cloud TTS (ElevenLabs, OpenAI): adds network round-trip (~100–300ms) but highest quality
- Streaming cloud TTS: audio starts before full text is received — reduces cloud TTS latency by 50%+
Warm up models before the conversation starts
Cold TTS startup (loading model weights) adds 200–800ms to the first response only. Pre-warm the TTS model at server startup or session start so the first utterance has the same latency as subsequent ones.
Network Optimisations
For cloud-hosted pipelines, network latency is a floor that cannot be eliminated — only minimised. These are the optimisations that have the most impact.
- Co-locate STT, LLM, and TTS inference: inter-service calls within the same data centre add 1–5ms vs 50–200ms cross-region
- Reuse connections: keep HTTP/2 or WebSocket connections open for the session — avoid TCP + TLS handshake on every request
- Use WebRTC for audio transport: lower latency than HTTP for audio streaming; globally distributed WebRTC infrastructure recommended for production
- Avoid DNS on the critical path: pre-resolve and cache DNS for inference endpoints
- Deploy inference close to users: 100ms round-trip from US to EU adds directly to TTFT — use region-aware routing
Measuring Latency Correctly
Metrics to track
- STT latency: audio end → transcript available
- LLM TTFT: transcript sent → first token received
- Sentence dispatch time: first token → first sentence boundary detected
- TTS start latency: first sentence dispatched → first audio chunk ready
- Time-to-first-audio (TTFA): audio end → first audio plays (the user-perceived metric)
Common measurement mistakes
- Measuring P50 only — voice latency spikes at P95/P99 are what users remember
- Measuring on localhost — add network latency simulation for realistic numbers
- Timing from request send, not from audio end — misses VAD and buffering time
- Not separating cold start from warm path — cold starts skew P50 upward
Checklist: Do You Understand This?
- What is the perceptual latency threshold for natural-feeling voice conversation, and what happens above 1.5 seconds?
- Why does a naive sequential pipeline have 2–4× higher latency than a streaming pipeline with the same components?
- Explain sentence-aware streaming: what triggers a dispatch to TTS, and what happens while TTS processes the first sentence?
- Why is 4-bit quantisation so impactful for voice latency, and what quality trade-off does it make?
- What is TTFA and why is it the user-perceived metric rather than individual stage latencies?
- Why should TTS models be pre-warmed, and what latency penalty does cold startup add?