Advanced

Text-to-Speech

Google offers two generations of text-to-speech for developers: Chirp 3 HD Voices via the Cloud Text-to-Speech API (traditional SSML-based approach), and the newer Gemini-TTS models (natural language voice control, no SSML required). Both are generally available and serve different use cases along the control-vs-simplicity tradeoff.

Chirp 3 HD Voices

Chirp 3 HD Voices are available through the Cloud Text-to-Speech API and represent the current state of the art for the traditional (SSML-controlled) TTS approach.

Voices: 8 speaker voices
Coverage: 31 locales, globally available across us, eu, and asia-southeast1 regions
Speed control: 0.25× to 2× playback rate
SSML support: Limited tag set — <phoneme>, <p>, <s>, <sub>, <say-as> — for pronunciation and formatting control
Custom pauses: Insert timed pauses at any point in the speech output
Custom pronunciations: Override default phoneme pronunciation for brand names, technical terms, acronyms
Delivery: Both real-time streaming and batch processing supported

Instant Custom Voice

A notable capability in the Chirp 3 HD tier is Instant Custom Voice: you provide audio recordings of a speaker, and Google builds a personalised voice model from them. The generated voice retains the speaker's distinctive characteristics — cadence, timbre, accent. Use cases include personalised voice assistants, audiobook narration in the author's own voice, and branded voice identities for enterprise applications.

Gemini-TTS

Gemini-TTS takes a fundamentally different approach to voice control. Instead of XML-based SSML markup, you describe how you want the voice to sound in plain English — as part of your prompt.

Available models:

gemini-2.5-flash-tts — Fast generation, suitable for high-throughput applications
gemini-2.5-pro-tts — Highest quality, more nuanced prosody control

Key characteristics:

30 distinct speaker voices across 80+ locales
Natural language control: Prompt phrases like “speak slowly and warmly”, “read this like a news anchor”, or “use an excited tone for the product launch announcement” — no SSML required
Available via: Cloud Text-to-Speech API or Vertex AI API
Native audio: The Gemini 2.5 Native Audio model (same one powering Gemini Live) generates speech as a direct model output rather than a post-processing step

Gemini Live — Conversational Audio

For real-time voice applications, the Gemini Live API combines STT, language model reasoning, and TTS into a single end-to-end model (Gemini 2.5 Native Audio). Audio goes in, audio comes out — without the latency of three separate API calls. This produces more natural-sounding conversational responses because prosody, emphasis, and pacing are generated by the same model that understands the meaning, not by a separate TTS stage.

Choosing Between Chirp 3 HD and Gemini-TTS

Use Chirp 3 HD When

You need precise phoneme-level pronunciation control (medical, legal, technical terms)
You require specific SSML formatting features
You need Instant Custom Voice (voice cloning)
You have existing SSML markup you want to reuse

Use Gemini-TTS When

You want to control voice style through natural language prompts
You have a broader range of voices and locales to choose from
You want the highest quality prosody for conversational applications
You are already using the Gemini API and want a unified SDK

Checklist

How many voices does Chirp 3 HD offer, and how many locales does Gemini-TTS cover?
What is Instant Custom Voice, and what audio input does it require?
How do you control speaking style in Gemini-TTS — what replaces SSML?
What are the two Gemini-TTS model options and when would you choose each?
Why does Gemini Live produce more natural-sounding speech than a traditional TTS pipeline?