🧠 All Things AI
Intermediate

Voice Pipeline Architecture

A voice AI pipeline converts spoken input into a text response and speaks it back — seamlessly enough that the interaction feels conversational. Building one requires careful latency management at every stage, robust interruption handling, and deliberate choices about which components run locally vs in the cloud.

Full Pipeline Diagram

The classic cascade pipeline processes audio sequentially through four stages:

Microphone
Audio buffer (streaming PCM)
VAD
Detect speech start/end → emit utterance
STT
Audio → text transcript
LLM
History + transcript → streaming tokens
TTS
Sentence chunks → audio stream
Speaker
Audio playback → loop back to VAD

Sentence-boundary streaming dispatches each complete sentence to TTS while LLM continues generating — hides 800–1,500ms of generation latency

Latency Budget

The target for conversational voice is 800ms time-to-first-word — the delay from when the user stops speaking to when they hear the first word of the response. Beyond ~1,200ms, the interaction starts to feel like a phone system, not a conversation.

StageLocal GPU targetCloud API targetPrimary levers
VAD (speech end detection)10–50ms10–50msVAD threshold, end-of-speech padding
STT transcription100–300ms (Whisper Turbo INT8)200–400ms (Deepgram Nova-2)Model size, streaming vs batch, hardware
LLM TTFT (time-to-first-token)300–600ms (8B model)400–800ms (GPT-4o / Claude Sonnet)Model tier, prompt length, quantisation
TTS start (first audio chunk)50–150ms (Piper)100–300ms (ElevenLabs streaming)Sentence-boundary dispatch, pre-warming
Total TTFW460–1,100ms710–1,550msOverlap LLM generation with TTS

The key latency trick: sentence-boundary streaming

Do not wait for the full LLM response before starting TTS. Detect sentence boundaries in the token stream (period, question mark, exclamation mark followed by whitespace or end-of-stream). Dispatch each complete sentence to TTS immediately while the LLM continues generating the next sentence. This hides most of the LLM generation latency behind TTS audio playback, reducing perceived wait by 800–1,500ms on typical responses.

Stage 1: Voice Activity Detection (VAD)

VAD detects when the user starts and stops speaking. Without it, you have no clean utterance boundary to feed to STT.

VAD optionWhere it runsNotes
Silero VADLocal CPU/GPUDe facto standard; Apache 2.0; excellent accuracy; Python & JS SDKs
WebRTC VADBrowser / localBuilt into browsers; lightweight; lower accuracy than Silero
OpenAI Realtime APICloud (server-side)Turn detection built in; removes need for local VAD; cloud-only
Picovoice CobraLocal (edge-optimised)Commercial; runs on microcontrollers; very low power draw

Key VAD tuning parameters: end-of-speech padding (how long to wait after the last speech frame before declaring the utterance complete — 300–600ms is typical; too short causes cutoffs, too long adds latency) and detection threshold(higher = fewer false triggers from background noise).

Stage 2: Speech-to-Text (STT)

STT optionWhereLatencyBest for
Whisper Turbo INT8 (GPU)Local100–250msPrivacy, offline, cost-sensitive at scale
Whisper large-v3 (GPU)Local500–1,000msHighest accuracy when latency allows
Deepgram Nova-2Cloud API200–400msProduction quality; streaming; speaker diarisation
OpenAI Realtime APICloudBuilt-inIntegrated STT+LLM+TTS; lowest total latency if cloud-only
Picovoice Cheetah FastLocal (edge)<100msExtremely constrained hardware; offline IoT

For cascade pipelines, Whisper Turbo INT8 on GPU is the recommended local option (fast, accurate, free). For cloud-only or thin clients, Deepgram Nova-2offers excellent accuracy with streaming support and speaker diarisation.

Stage 3: LLM

The LLM receives the STT transcript and conversation history. Key design decisions:

  • System prompt for voice persona — keep it short (voice personas do not need the same long instructions as text chatbots); specify speaking style (concise, conversational, no bullet points or markdown)
  • Model tier selection — for low-latency voice, a faster/smaller model is often better than a more capable one: GPT-4o mini, Claude Haiku, or Gemini Flash at 400–600ms TTFT vs 800ms+ for top-tier models
  • Keep context short — TTFT scales with prompt length; summarise or prune conversation history aggressively for voice (last 3–5 turns)
  • Local option — Llama 3.1 8B via Ollama on a GPU machine gives 300–500ms TTFT and costs nothing per token at scale

Stage 4: Text-to-Speech (TTS)

TTS optionWhereQualityFirst chunk latency
PiperLocal CPUGood (slightly robotic)50–100ms
KokoroLocal GPU/CPUVery good (open-source)80–200ms
ElevenLabs v3 (streaming)CloudNear-human, emotional100–250ms
OpenAI TTS (streaming)CloudVery good150–300ms
Azure Neural TTSCloudVery good, 400+ voices150–350ms

Piper on CPU is intentionally used even on GPU machines to keep the GPU free for the LLM and STT — it is fast enough for voice at 50–100ms and runs without competing for GPU memory. Pre-warm TTS models at application startup to eliminate the 200–800ms cold-start cost on the first request.

Interruption Handling (Barge-In)

A conversational voice agent must handle the user speaking while the agent is talking — called barge-in or interruption. Without it, the user must wait for the agent to finish, which feels unnatural.

VAD detects speech
While TTS is playing
Stop TTS
Discard buffered audio immediately
Cancel LLM request
Abort in-flight streaming generation
Log interrupted turn
Add to conversation history for context
Resume pipeline
New utterance → STT → LLM → TTS

Barge-in must cancel TTS and LLM together — cancelling only TTS leaves a stale LLM generation consuming tokens

Key implementation note: acoustic echo cancellation (AEC) is required on devices without headphones. Without AEC, the microphone picks up the speaker audio, which triggers false VAD detections and creates feedback loops. Most audio SDKs (WebRTC, pyaudio with AEC filter, Apple AVFoundation) include AEC.

On-Device vs Cloud Architecture

On-Device Stack (privacy-first, offline capable)
openWakeWord
Wake word
Silero VAD
Speech detection
Whisper Turbo INT8
STT (GPU)
Llama 3.1 8B
LLM (Ollama)
Piper
TTS (CPU)
Cloud Stack (best quality, thin client)
Deepgram Nova-2
Streaming STT
GPT-4o mini / Claude Haiku
LLM (low-latency tier)
ElevenLabs / OpenAI TTS
Near-human TTS

On-device: requires 16 GB RAM + 8 GB VRAM minimum. Cloud: any device with internet.

ComponentOn-device stackCloud stack
Wake wordopenWakeWord / PorcupineN/A (cloud always listening is a privacy risk)
VADSilero VADServer-side in Realtime API
STTWhisper Turbo INT8 (GPU)Deepgram Nova-2, OpenAI Realtime API
LLMLlama 3.1 8B / 70B (Ollama)GPT-4o mini, Claude Haiku, Gemini Flash
TTSPiper (CPU), Kokoro (GPU)ElevenLabs, OpenAI TTS, Azure Neural
Min hardware16 GB RAM, 8 GB VRAM (M2 Pro or RTX 3080)Any device with internet connection
Latency (TTFW)500–900ms600–1,400ms (network dependent)
Cost at scaleHardware + electricity only$0.006–$0.03 per minute of conversation
PrivacyAudio never leaves deviceAudio transmitted to cloud

Wake Word Integration

Always-on voice assistants need a wake word engine to avoid continuously processing audio:

  • openWakeWord — Apache 2.0, custom wake words from audio clips, runs 15–20 models simultaneously on a Raspberry Pi 3 single core; native Home Assistant integration; recommended for privacy-sensitive deployments
  • Porcupine (Picovoice) — commercial, >97% accuracy, <1 false alarm per 10 hours, trains new wake words in seconds via web UI; runs on ARM Cortex-M4 microcontrollers; best for commercial products

Wake word engines run continuously on CPU at very low power. Only when the wake word is detected does the full STT → LLM → TTS pipeline activate. The wake phrase itself is stripped from the STT transcript (the user says “Hey Assistant, book a meeting”; the LLM sees “book a meeting”).

Pipeline Failure Modes

FailureSymptomMitigation
VAD cutoffAgent cuts off mid-sentence; user sounds "clipped"Increase end-of-speech padding (400–600ms); tune VAD threshold
STT hallucinationShort/silent audio produces garbage transcript (“Thanks.”)Discard transcripts under 2 words or under 0.3s audio duration
TTS latency spikeLong pause before agent speaks; conversation breaksPre-warm TTS at startup; pre-buffer first chunk; use streaming TTS API
No interruption handlingUser must wait for full agent response before speakingImplement VAD monitoring during TTS playback; cancel on speech detection
Microphone bleed (echo)Agent hears itself and triggers on its own speechEnable acoustic echo cancellation (AEC) in audio SDK
Context overflowLong conversations increase TTFT until system failsSummarise and prune conversation history aggressively (<5 turns for voice)

Checklist: Do You Understand This?

  • What is the time-to-first-word (TTFW) target for a conversational voice agent, and why?
  • How does sentence-boundary streaming reduce perceived latency without changing actual computation time?
  • What does VAD do, and what happens if end-of-speech padding is set too short?
  • Why is Piper TTS run on CPU even when a GPU is available?
  • What are the five steps in a barge-in (interruption) handling sequence?
  • What is acoustic echo cancellation (AEC) and when is it required?
  • For a privacy-sensitive voice application that must work offline, what local stack would you choose?