Intermediate

Wake Words & Privacy

A wake word engine listens passively to audio 24/7, triggering the full voice pipeline only when a specific phrase is detected — "Hey Jarvis", "Computer", "Alexa". Without a wake word, the voice pipeline would either run continuously (expensive and slow) or require a push-to-talk button (breaks natural interaction). The challenge: wake word detection must be always-on, low-power, and private — meaning it must run entirely on the device, with no audio ever sent to a server.

Always-On

Low-power detector

→

Wake Word

"Hey Claude" detected

→

Full Pipeline

STT → LLM → TTS

→

Response

Within ~400ms

Wake word gates the expensive full pipeline — keeps idle power usage near zero

What Wake Word Engines Do

A wake word engine continuously analyses small audio frames (typically 10–30ms chunks) and classifies each frame as "wake word present" or "not wake word". When it detects the phrase, it signals the full STT pipeline to start recording and processing. The engine runs on the client — a phone, a Raspberry Pi, a microcontroller — consuming minimal CPU/memory so the device can remain responsive while always listening.

The two core trade-offs:

False positives (false alarms): engine triggers on speech that is not the wake word. Causes accidental activation — the AI starts listening when not intended. Measured as false alarms per hour. Industry target: <1 false alarm per 10 hours of background speech.
False negatives (missed detections): engine fails to trigger when the wake word is genuinely spoken. Causes the user to repeat themselves. Measured as miss rate. Industry target: <3% miss rate in clean speech conditions.

These trade-offs are controlled by a detection threshold — lower threshold = fewer misses but more false alarms; higher threshold = fewer false alarms but more misses.

openWakeWord (Open-Source)

openWakeWord is an open-source wake word detection framework designed for real-world accuracy on commodity hardware. It was popularised by its adoption in Home Assistant for local voice control.

How it works

Built on Google's open-source audio embedding model (pre-trained on AudioSet)
Fine-tuned per wake word using Piper TTS — generates thousands of audio clips with diverse speaker voices, room acoustics, and noise augmentation
Training a new wake word requires only the text phrase — no recorded audio samples needed
Runs 15–20 models simultaneously on a single core of a Raspberry Pi 3
Python-based; integrates with Wyoming protocol (Home Assistant)

Best for

Home automation and self-hosted voice assistants
Projects where custom wake words are needed without recording real speakers
Raspberry Pi and similar SBC deployments
Open-source, privacy-first builds where commercial licensing is not acceptable
Teams with ML knowledge who can tune the model

Limitation:

openWakeWord models are likely too large for microcontrollers and highly constrained embedded hardware (<1MB RAM). Porcupine is better suited for these environments.

Porcupine (Picovoice — Commercial)

Porcupine is a highly-accurate, commercial wake word engine from Picovoice, optimised for constrained hardware from microcontrollers to smartphones. It is the most widely deployed on-device wake word engine in commercial products.

Key capabilities

97%+ accuracy with <1 false alarm per 10 hours in background speech conditions
Custom wake words: type in the phrase, receive a trained model within seconds via transfer learning — no audio recording needed
Runs on ARM Cortex-M4 (microcontroller), Raspberry Pi, Android, iOS, web
SDKs for Python, iOS, Android, Web, React Native, .NET, Java, Go
Consistent performance across accents and noise conditions

Best for

Commercial products requiring reliability guarantees
Microcontroller and highly constrained embedded deployments
Mobile apps (iOS/Android) needing always-on wake word
Enterprise voice products where support and SLAs matter
Multi-language or multi-accent requirements

openWakeWord vs Porcupine

Dimension	openWakeWord	Porcupine
Licence	Apache 2.0 (open-source)	Commercial (free tier available)
Custom wake words	Yes (TTS-generated training)	Yes (type phrase → model in seconds)
Accuracy	Good (can exceed Porcupine with tuning)	97%+ with <1 false alarm / 10hrs
Minimum hardware	Raspberry Pi 3 (single core)	ARM Cortex-M4 (microcontroller)
Languages	English-focused (expanding)	Multi-language
Training data needed	None (TTS-generated)	None (transfer learning from text)
Home Assistant support	Native (Wyoming protocol)	Via custom integration

Privacy: Why On-Device Matters

The privacy guarantee of wake word detection depends entirely on where the audio is processed. A cloud-based wake word engine must stream microphone audio to a server 24/7 — the server "decides" when the wake word was said. An on-device engine processes audio locally; no audio leaves the device until the wake word is confirmed.

On-device privacy guarantees

No audio is transmitted until the wake word is detected locally
Works offline — no network required for activation
No vendor can collect ambient audio
Compliant with data sovereignty laws (GDPR, HIPAA contexts)
Cannot be affected by server outages or rate limits during wake detection

What to tell users

Clearly document what audio is sent after wake word detection and where
Provide a mute/disable button that is hardware-enforced, not software-only
Log activations so users can audit when the device was triggered
False positive audio (overheard in a false alarm) should not be stored or processed — discard if the STT returns low-confidence results

Integrating Wake Words into a Voice Pipeline

Wake word → pipeline trigger flow:

Wake word engine runs continuously, analysing 10–30ms audio frames (CPU only)
Detection confidence exceeds threshold → engine fires a trigger event
Pipeline activates VAD to capture the following utterance
Audio (post-wake-word) is buffered and sent to STT
On low-confidence STT result or very short utterance: discard, return to listening
On valid transcript: continue to LLM → TTS
After response completes: return to wake word listening mode

The wake word itself is typically excluded from the STT transcript — the utterance that matters starts immediately after the wake phrase.

Checklist: Do You Understand This?

What is the difference between a false positive and false negative in wake word detection, and what is the industry target for each?
How does openWakeWord generate training data for a custom wake word without recording real speakers?
Why can Porcupine run on a microcontroller but openWakeWord cannot?
What is the privacy difference between on-device and cloud-based wake word detection?
In the wake word → pipeline trigger flow, what happens when the STT returns a low-confidence result after an activation?
What is the detection threshold, and what happens when you raise vs lower it?