Intermediate

Voice Latency Design

Latency is the make-or-break metric for voice AI. A response under 400ms feels human-like. A response over 1.5 seconds feels broken. Most naive voice pipeline implementations sit at 2–5 seconds — not because the individual components are slow, but because they are wired sequentially instead of in parallel, and because no streaming is used between stages. This page covers the specific optimisations that take a voice pipeline from 2 seconds to under 400ms.

Why 400ms Is the Target

Human conversation has an average turn-taking gap of 200–300ms. A voice AI response under 400ms falls within the range users perceive as natural. At 600–800ms, users notice a pause but tolerate it for complex questions. Above 1.5 seconds, the interaction feels broken and users stop talking naturally, resorting to command-style speech. The 400ms target is not arbitrary — it is the perceptual threshold for naturalness.

Latency range	User perception
< 400ms	Natural conversation — feels human-like
400–800ms	Slight pause — acceptable for complex queries, noticeable
800ms–1.5s	Clearly waiting — tolerable for one-off queries, poor for conversation
> 1.5s	Broken — users lose context, resort to command-style speech

Fast

Sub-400ms response

Slow

1–3 second lag

Optimised STT (Deepgram)

Streaming LLM + TTS

Batch STT + full LLM wait

Sequential non-streaming

Where Time Goes in a Naive Pipeline

A naive sequential pipeline waits for each stage to fully complete before starting the next. This is the primary source of excess latency — not slow components, but sequential wiring.

Naive pipeline (sequential — ~2–4s total):

1. Wait for user to finish speaking

2. Send full audio clip to STT → wait for full transcript

3. Send full transcript to LLM → wait for full response text

4. Send full response text to TTS → wait for full audio file

5. Play audio

Each stage adds its full duration to the total. No parallelism, no streaming.

Optimised pipeline (streaming + overlap — ~300–500ms to first audio):

1. VAD detects end of speech

2. Stream audio to STT → tokens arrive in chunks

3. Final transcript arrives → immediately sent to LLM (streaming enabled)

4. LLM tokens stream → sentence boundary detector fires at first complete sentence

5. First sentence sent to TTS → audio starts playing while LLM generates sentence 2+

User hears audio ~300ms after speaking. LLM and TTS overlap; user hears sentence 1 while sentence 2 is being generated.

Sentence-Aware Streaming

The single highest-impact optimisation is sentence-aware streaming between the LLM and TTS. Instead of waiting for the complete LLM response, a sentence boundary detector monitors the streaming token output. As soon as a complete, semantically coherent sentence is detected (typically at a period, question mark, or exclamation point followed by a space), that sentence is immediately dispatched to TTS — while the LLM continues generating the rest of the response.

How to implement

Buffer incoming LLM tokens
Detect sentence end: . / ? / ! followed by whitespace or end of stream
On sentence boundary: dispatch buffered text to TTS, clear buffer
Queue TTS audio chunks for seamless playback
Flush any remaining buffer at LLM stream end

Impact

Streaming implementations achieve 800–1500ms latency reduction vs batch
Streaming TTS systems can achieve time-to-first-audio under 50ms after the first sentence is dispatched
VoXtream (2025) achieves first-packet latency as low as 102ms on GPU
User perceives faster response even if total generation time is the same

LLM Optimisations

After STT, the LLM is typically the largest latency contributor. TTFT (time-to-first-token) is the metric that matters for voice — the user's first audio starts from the first complete sentence, which starts from the first tokens.

Model size selection

For voice, choose the smallest model that meets quality requirements. An 8B model has 3–5× lower TTFT than a 70B model on the same hardware. Quality for conversational voice is often acceptable at 8B — 70B adds latency without proportional quality gain for short conversational responses.

Quantisation (4-bit)

4-bit quantisation with GGUF (llama.cpp) or AWQ achieves up to 40% latency reduction while preserving over 95% of generation quality. Quantised conversational models reduce computational complexity by 60× or more compared to full precision. This is one of the highest ROI optimisations for local voice stacks.

Keep context short

TTFT scales with prompt length. Trim the system prompt to voice-essential content only. For multi-turn conversations, summarise history rather than appending all turns verbatim. A 500-token prompt has measurably lower TTFT than a 3000-token one.

TTS Optimisations

Pre-buffer the first chunk

Generate the first TTS audio chunk while the LLM is still completing the sentence. Some TTS engines support partial input. Even a 50ms pre-buffer eliminates the perceptual gap between LLM response and audio start.

Choose the right TTS for your constraint

Piper (CPU): ~50–200ms, frees GPU — best choice when GPU is needed for STT + LLM
Cloud TTS (ElevenLabs, OpenAI): adds network round-trip (~100–300ms) but highest quality
Streaming cloud TTS: audio starts before full text is received — reduces cloud TTS latency by 50%+

Warm up models before the conversation starts

Cold TTS startup (loading model weights) adds 200–800ms to the first response only. Pre-warm the TTS model at server startup or session start so the first utterance has the same latency as subsequent ones.

Network Optimisations

For cloud-hosted pipelines, network latency is a floor that cannot be eliminated — only minimised. These are the optimisations that have the most impact.

Co-locate STT, LLM, and TTS inference: inter-service calls within the same data centre add 1–5ms vs 50–200ms cross-region
Reuse connections: keep HTTP/2 or WebSocket connections open for the session — avoid TCP + TLS handshake on every request
Use WebRTC for audio transport: lower latency than HTTP for audio streaming; globally distributed WebRTC infrastructure recommended for production
Avoid DNS on the critical path: pre-resolve and cache DNS for inference endpoints
Deploy inference close to users: 100ms round-trip from US to EU adds directly to TTFT — use region-aware routing

Measuring Latency Correctly

Metrics to track

STT latency: audio end → transcript available
LLM TTFT: transcript sent → first token received
Sentence dispatch time: first token → first sentence boundary detected
TTS start latency: first sentence dispatched → first audio chunk ready
Time-to-first-audio (TTFA): audio end → first audio plays (the user-perceived metric)

Common measurement mistakes

Measuring P50 only — voice latency spikes at P95/P99 are what users remember
Measuring on localhost — add network latency simulation for realistic numbers
Timing from request send, not from audio end — misses VAD and buffering time
Not separating cold start from warm path — cold starts skew P50 upward

Checklist: Do You Understand This?

What is the perceptual latency threshold for natural-feeling voice conversation, and what happens above 1.5 seconds?
Why does a naive sequential pipeline have 2–4× higher latency than a streaming pipeline with the same components?
Explain sentence-aware streaming: what triggers a dispatch to TTS, and what happens while TTS processes the first sentence?
Why is 4-bit quantisation so impactful for voice latency, and what quality trade-off does it make?
What is TTFA and why is it the user-perceived metric rather than individual stage latencies?
Why should TTS models be pre-warmed, and what latency penalty does cold startup add?