Voice Pipeline Architecture
A voice AI pipeline converts spoken input into a text response and speaks it back — seamlessly enough that the interaction feels conversational. Building one requires careful latency management at every stage, robust interruption handling, and deliberate choices about which components run locally vs in the cloud.
Full Pipeline Diagram
The classic cascade pipeline processes audio sequentially through four stages:
Sentence-boundary streaming dispatches each complete sentence to TTS while LLM continues generating — hides 800–1,500ms of generation latency
Latency Budget
The target for conversational voice is 800ms time-to-first-word — the delay from when the user stops speaking to when they hear the first word of the response. Beyond ~1,200ms, the interaction starts to feel like a phone system, not a conversation.
| Stage | Local GPU target | Cloud API target | Primary levers |
|---|---|---|---|
| VAD (speech end detection) | 10–50ms | 10–50ms | VAD threshold, end-of-speech padding |
| STT transcription | 100–300ms (Whisper Turbo INT8) | 200–400ms (Deepgram Nova-2) | Model size, streaming vs batch, hardware |
| LLM TTFT (time-to-first-token) | 300–600ms (8B model) | 400–800ms (GPT-4o / Claude Sonnet) | Model tier, prompt length, quantisation |
| TTS start (first audio chunk) | 50–150ms (Piper) | 100–300ms (ElevenLabs streaming) | Sentence-boundary dispatch, pre-warming |
| Total TTFW | 460–1,100ms | 710–1,550ms | Overlap LLM generation with TTS |
The key latency trick: sentence-boundary streaming
Do not wait for the full LLM response before starting TTS. Detect sentence boundaries in the token stream (period, question mark, exclamation mark followed by whitespace or end-of-stream). Dispatch each complete sentence to TTS immediately while the LLM continues generating the next sentence. This hides most of the LLM generation latency behind TTS audio playback, reducing perceived wait by 800–1,500ms on typical responses.
Stage 1: Voice Activity Detection (VAD)
VAD detects when the user starts and stops speaking. Without it, you have no clean utterance boundary to feed to STT.
| VAD option | Where it runs | Notes |
|---|---|---|
| Silero VAD | Local CPU/GPU | De facto standard; Apache 2.0; excellent accuracy; Python & JS SDKs |
| WebRTC VAD | Browser / local | Built into browsers; lightweight; lower accuracy than Silero |
| OpenAI Realtime API | Cloud (server-side) | Turn detection built in; removes need for local VAD; cloud-only |
| Picovoice Cobra | Local (edge-optimised) | Commercial; runs on microcontrollers; very low power draw |
Key VAD tuning parameters: end-of-speech padding (how long to wait after the last speech frame before declaring the utterance complete — 300–600ms is typical; too short causes cutoffs, too long adds latency) and detection threshold(higher = fewer false triggers from background noise).
Stage 2: Speech-to-Text (STT)
| STT option | Where | Latency | Best for |
|---|---|---|---|
| Whisper Turbo INT8 (GPU) | Local | 100–250ms | Privacy, offline, cost-sensitive at scale |
| Whisper large-v3 (GPU) | Local | 500–1,000ms | Highest accuracy when latency allows |
| Deepgram Nova-2 | Cloud API | 200–400ms | Production quality; streaming; speaker diarisation |
| OpenAI Realtime API | Cloud | Built-in | Integrated STT+LLM+TTS; lowest total latency if cloud-only |
| Picovoice Cheetah Fast | Local (edge) | <100ms | Extremely constrained hardware; offline IoT |
For cascade pipelines, Whisper Turbo INT8 on GPU is the recommended local option (fast, accurate, free). For cloud-only or thin clients, Deepgram Nova-2offers excellent accuracy with streaming support and speaker diarisation.
Stage 3: LLM
The LLM receives the STT transcript and conversation history. Key design decisions:
- System prompt for voice persona — keep it short (voice personas do not need the same long instructions as text chatbots); specify speaking style (concise, conversational, no bullet points or markdown)
- Model tier selection — for low-latency voice, a faster/smaller model is often better than a more capable one: GPT-4o mini, Claude Haiku, or Gemini Flash at 400–600ms TTFT vs 800ms+ for top-tier models
- Keep context short — TTFT scales with prompt length; summarise or prune conversation history aggressively for voice (last 3–5 turns)
- Local option — Llama 3.1 8B via Ollama on a GPU machine gives 300–500ms TTFT and costs nothing per token at scale
Stage 4: Text-to-Speech (TTS)
| TTS option | Where | Quality | First chunk latency |
|---|---|---|---|
| Piper | Local CPU | Good (slightly robotic) | 50–100ms |
| Kokoro | Local GPU/CPU | Very good (open-source) | 80–200ms |
| ElevenLabs v3 (streaming) | Cloud | Near-human, emotional | 100–250ms |
| OpenAI TTS (streaming) | Cloud | Very good | 150–300ms |
| Azure Neural TTS | Cloud | Very good, 400+ voices | 150–350ms |
Piper on CPU is intentionally used even on GPU machines to keep the GPU free for the LLM and STT — it is fast enough for voice at 50–100ms and runs without competing for GPU memory. Pre-warm TTS models at application startup to eliminate the 200–800ms cold-start cost on the first request.
Interruption Handling (Barge-In)
A conversational voice agent must handle the user speaking while the agent is talking — called barge-in or interruption. Without it, the user must wait for the agent to finish, which feels unnatural.
Barge-in must cancel TTS and LLM together — cancelling only TTS leaves a stale LLM generation consuming tokens
Key implementation note: acoustic echo cancellation (AEC) is required on devices without headphones. Without AEC, the microphone picks up the speaker audio, which triggers false VAD detections and creates feedback loops. Most audio SDKs (WebRTC, pyaudio with AEC filter, Apple AVFoundation) include AEC.
On-Device vs Cloud Architecture
On-device: requires 16 GB RAM + 8 GB VRAM minimum. Cloud: any device with internet.
| Component | On-device stack | Cloud stack |
|---|---|---|
| Wake word | openWakeWord / Porcupine | N/A (cloud always listening is a privacy risk) |
| VAD | Silero VAD | Server-side in Realtime API |
| STT | Whisper Turbo INT8 (GPU) | Deepgram Nova-2, OpenAI Realtime API |
| LLM | Llama 3.1 8B / 70B (Ollama) | GPT-4o mini, Claude Haiku, Gemini Flash |
| TTS | Piper (CPU), Kokoro (GPU) | ElevenLabs, OpenAI TTS, Azure Neural |
| Min hardware | 16 GB RAM, 8 GB VRAM (M2 Pro or RTX 3080) | Any device with internet connection |
| Latency (TTFW) | 500–900ms | 600–1,400ms (network dependent) |
| Cost at scale | Hardware + electricity only | $0.006–$0.03 per minute of conversation |
| Privacy | Audio never leaves device | Audio transmitted to cloud |
Wake Word Integration
Always-on voice assistants need a wake word engine to avoid continuously processing audio:
- openWakeWord — Apache 2.0, custom wake words from audio clips, runs 15–20 models simultaneously on a Raspberry Pi 3 single core; native Home Assistant integration; recommended for privacy-sensitive deployments
- Porcupine (Picovoice) — commercial, >97% accuracy, <1 false alarm per 10 hours, trains new wake words in seconds via web UI; runs on ARM Cortex-M4 microcontrollers; best for commercial products
Wake word engines run continuously on CPU at very low power. Only when the wake word is detected does the full STT → LLM → TTS pipeline activate. The wake phrase itself is stripped from the STT transcript (the user says “Hey Assistant, book a meeting”; the LLM sees “book a meeting”).
Pipeline Failure Modes
| Failure | Symptom | Mitigation |
|---|---|---|
| VAD cutoff | Agent cuts off mid-sentence; user sounds "clipped" | Increase end-of-speech padding (400–600ms); tune VAD threshold |
| STT hallucination | Short/silent audio produces garbage transcript (“Thanks.”) | Discard transcripts under 2 words or under 0.3s audio duration |
| TTS latency spike | Long pause before agent speaks; conversation breaks | Pre-warm TTS at startup; pre-buffer first chunk; use streaming TTS API |
| No interruption handling | User must wait for full agent response before speaking | Implement VAD monitoring during TTS playback; cancel on speech detection |
| Microphone bleed (echo) | Agent hears itself and triggers on its own speech | Enable acoustic echo cancellation (AEC) in audio SDK |
| Context overflow | Long conversations increase TTFT until system fails | Summarise and prune conversation history aggressively (<5 turns for voice) |
Checklist: Do You Understand This?
- What is the time-to-first-word (TTFW) target for a conversational voice agent, and why?
- How does sentence-boundary streaming reduce perceived latency without changing actual computation time?
- What does VAD do, and what happens if end-of-speech padding is set too short?
- Why is Piper TTS run on CPU even when a GPU is available?
- What are the five steps in a barge-in (interruption) handling sequence?
- What is acoustic echo cancellation (AEC) and when is it required?
- For a privacy-sensitive voice application that must work offline, what local stack would you choose?