Intermediate

Voice Pipeline Architecture

A voice AI pipeline converts spoken input into a text response and speaks it back — seamlessly enough that the interaction feels conversational. Building one requires careful latency management at every stage, robust interruption handling, and deliberate choices about which components run locally vs in the cloud.

Full Pipeline Diagram

The classic cascade pipeline processes audio sequentially through four stages:

Microphone

Audio buffer (streaming PCM)

→

VAD

Detect speech start/end → emit utterance

→

STT

Audio → text transcript

→

LLM

History + transcript → streaming tokens

→

TTS

Sentence chunks → audio stream

→

Speaker

Audio playback → loop back to VAD

Sentence-boundary streaming dispatches each complete sentence to TTS while LLM continues generating — hides 800–1,500ms of generation latency

Latency Budget

The target for conversational voice is 800ms time-to-first-word — the delay from when the user stops speaking to when they hear the first word of the response. Beyond ~1,200ms, the interaction starts to feel like a phone system, not a conversation.

Stage	Local GPU target	Cloud API target	Primary levers
VAD (speech end detection)	10–50ms	10–50ms	VAD threshold, end-of-speech padding
STT transcription	100–300ms (Whisper Turbo INT8)	200–400ms (Deepgram Nova-2)	Model size, streaming vs batch, hardware
LLM TTFT (time-to-first-token)	300–600ms (8B model)	400–800ms (GPT-4o / Claude Sonnet)	Model tier, prompt length, quantisation
TTS start (first audio chunk)	50–150ms (Piper)	100–300ms (ElevenLabs streaming)	Sentence-boundary dispatch, pre-warming
Total TTFW	460–1,100ms	710–1,550ms	Overlap LLM generation with TTS

The key latency trick: sentence-boundary streaming

Do not wait for the full LLM response before starting TTS. Detect sentence boundaries in the token stream (period, question mark, exclamation mark followed by whitespace or end-of-stream). Dispatch each complete sentence to TTS immediately while the LLM continues generating the next sentence. This hides most of the LLM generation latency behind TTS audio playback, reducing perceived wait by 800–1,500ms on typical responses.

Stage 1: Voice Activity Detection (VAD)

VAD detects when the user starts and stops speaking. Without it, you have no clean utterance boundary to feed to STT.

VAD option	Where it runs	Notes
Silero VAD	Local CPU/GPU	De facto standard; Apache 2.0; excellent accuracy; Python & JS SDKs
WebRTC VAD	Browser / local	Built into browsers; lightweight; lower accuracy than Silero
OpenAI Realtime API	Cloud (server-side)	Turn detection built in; removes need for local VAD; cloud-only
Picovoice Cobra	Local (edge-optimised)	Commercial; runs on microcontrollers; very low power draw

Key VAD tuning parameters: end-of-speech padding (how long to wait after the last speech frame before declaring the utterance complete — 300–600ms is typical; too short causes cutoffs, too long adds latency) and detection threshold(higher = fewer false triggers from background noise).

Stage 2: Speech-to-Text (STT)

STT option	Where	Latency	Best for
Whisper Turbo INT8 (GPU)	Local	100–250ms	Privacy, offline, cost-sensitive at scale
Whisper large-v3 (GPU)	Local	500–1,000ms	Highest accuracy when latency allows
Deepgram Nova-2	Cloud API	200–400ms	Production quality; streaming; speaker diarisation
OpenAI Realtime API	Cloud	Built-in	Integrated STT+LLM+TTS; lowest total latency if cloud-only
Picovoice Cheetah Fast	Local (edge)	<100ms	Extremely constrained hardware; offline IoT

For cascade pipelines, Whisper Turbo INT8 on GPU is the recommended local option (fast, accurate, free). For cloud-only or thin clients, Deepgram Nova-2offers excellent accuracy with streaming support and speaker diarisation.

Stage 3: LLM

The LLM receives the STT transcript and conversation history. Key design decisions:

System prompt for voice persona — keep it short (voice personas do not need the same long instructions as text chatbots); specify speaking style (concise, conversational, no bullet points or markdown)
Model tier selection — for low-latency voice, a faster/smaller model is often better than a more capable one: GPT-4o mini, Claude Haiku, or Gemini Flash at 400–600ms TTFT vs 800ms+ for top-tier models
Keep context short — TTFT scales with prompt length; summarise or prune conversation history aggressively for voice (last 3–5 turns)
Local option — Llama 3.1 8B via Ollama on a GPU machine gives 300–500ms TTFT and costs nothing per token at scale

Stage 4: Text-to-Speech (TTS)

TTS option	Where	Quality	First chunk latency
Piper	Local CPU	Good (slightly robotic)	50–100ms
Kokoro	Local GPU/CPU	Very good (open-source)	80–200ms
ElevenLabs v3 (streaming)	Cloud	Near-human, emotional	100–250ms
OpenAI TTS (streaming)	Cloud	Very good	150–300ms
Azure Neural TTS	Cloud	Very good, 400+ voices	150–350ms

Piper on CPU is intentionally used even on GPU machines to keep the GPU free for the LLM and STT — it is fast enough for voice at 50–100ms and runs without competing for GPU memory. Pre-warm TTS models at application startup to eliminate the 200–800ms cold-start cost on the first request.

Interruption Handling (Barge-In)

A conversational voice agent must handle the user speaking while the agent is talking — called barge-in or interruption. Without it, the user must wait for the agent to finish, which feels unnatural.

VAD detects speech

While TTS is playing

→

Stop TTS

Discard buffered audio immediately

→

Cancel LLM request

Abort in-flight streaming generation

→

Log interrupted turn

Add to conversation history for context

→

Resume pipeline

New utterance → STT → LLM → TTS

Barge-in must cancel TTS and LLM together — cancelling only TTS leaves a stale LLM generation consuming tokens

Key implementation note: acoustic echo cancellation (AEC) is required on devices without headphones. Without AEC, the microphone picks up the speaker audio, which triggers false VAD detections and creates feedback loops. Most audio SDKs (WebRTC, pyaudio with AEC filter, Apple AVFoundation) include AEC.

On-Device vs Cloud Architecture

On-Device Stack (privacy-first, offline capable)

openWakeWord

Wake word

Silero VAD

Speech detection

Whisper Turbo INT8

STT (GPU)

Llama 3.1 8B

LLM (Ollama)

Piper

TTS (CPU)

Cloud Stack (best quality, thin client)

Deepgram Nova-2

Streaming STT

GPT-4o mini / Claude Haiku

LLM (low-latency tier)

ElevenLabs / OpenAI TTS

Near-human TTS

On-device: requires 16 GB RAM + 8 GB VRAM minimum. Cloud: any device with internet.

Component	On-device stack	Cloud stack
Wake word	openWakeWord / Porcupine	N/A (cloud always listening is a privacy risk)
VAD	Silero VAD	Server-side in Realtime API
STT	Whisper Turbo INT8 (GPU)	Deepgram Nova-2, OpenAI Realtime API
LLM	Llama 3.1 8B / 70B (Ollama)	GPT-4o mini, Claude Haiku, Gemini Flash
TTS	Piper (CPU), Kokoro (GPU)	ElevenLabs, OpenAI TTS, Azure Neural
Min hardware	16 GB RAM, 8 GB VRAM (M2 Pro or RTX 3080)	Any device with internet connection
Latency (TTFW)	500–900ms	600–1,400ms (network dependent)
Cost at scale	Hardware + electricity only	$0.006–$0.03 per minute of conversation
Privacy	Audio never leaves device	Audio transmitted to cloud

Wake Word Integration

Always-on voice assistants need a wake word engine to avoid continuously processing audio:

openWakeWord — Apache 2.0, custom wake words from audio clips, runs 15–20 models simultaneously on a Raspberry Pi 3 single core; native Home Assistant integration; recommended for privacy-sensitive deployments
Porcupine (Picovoice) — commercial, >97% accuracy, <1 false alarm per 10 hours, trains new wake words in seconds via web UI; runs on ARM Cortex-M4 microcontrollers; best for commercial products

Wake word engines run continuously on CPU at very low power. Only when the wake word is detected does the full STT → LLM → TTS pipeline activate. The wake phrase itself is stripped from the STT transcript (the user says “Hey Assistant, book a meeting”; the LLM sees “book a meeting”).

Pipeline Failure Modes

Failure	Symptom	Mitigation
VAD cutoff	Agent cuts off mid-sentence; user sounds "clipped"	Increase end-of-speech padding (400–600ms); tune VAD threshold
STT hallucination	Short/silent audio produces garbage transcript (“Thanks.”)	Discard transcripts under 2 words or under 0.3s audio duration
TTS latency spike	Long pause before agent speaks; conversation breaks	Pre-warm TTS at startup; pre-buffer first chunk; use streaming TTS API
No interruption handling	User must wait for full agent response before speaking	Implement VAD monitoring during TTS playback; cancel on speech detection
Microphone bleed (echo)	Agent hears itself and triggers on its own speech	Enable acoustic echo cancellation (AEC) in audio SDK
Context overflow	Long conversations increase TTFT until system fails	Summarise and prune conversation history aggressively (<5 turns for voice)

Checklist: Do You Understand This?

What is the time-to-first-word (TTFW) target for a conversational voice agent, and why?
How does sentence-boundary streaming reduce perceived latency without changing actual computation time?
What does VAD do, and what happens if end-of-speech padding is set too short?
Why is Piper TTS run on CPU even when a GPU is available?
What are the five steps in a barge-in (interruption) handling sequence?
What is acoustic echo cancellation (AEC) and when is it required?
For a privacy-sensitive voice application that must work offline, what local stack would you choose?