Text-to-Speech
Google offers two generations of text-to-speech for developers: Chirp 3 HD Voices via the Cloud Text-to-Speech API (traditional SSML-based approach), and the newer Gemini-TTS models (natural language voice control, no SSML required). Both are generally available and serve different use cases along the control-vs-simplicity tradeoff.
Chirp 3 HD Voices
Chirp 3 HD Voices are available through the Cloud Text-to-Speech API and represent the current state of the art for the traditional (SSML-controlled) TTS approach.
- Voices: 8 speaker voices
- Coverage: 31 locales, globally available across us, eu, and asia-southeast1 regions
- Speed control: 0.25× to 2× playback rate
- SSML support: Limited tag set —
<phoneme>,<p>,<s>,<sub>,<say-as>— for pronunciation and formatting control - Custom pauses: Insert timed pauses at any point in the speech output
- Custom pronunciations: Override default phoneme pronunciation for brand names, technical terms, acronyms
- Delivery: Both real-time streaming and batch processing supported
Instant Custom Voice
A notable capability in the Chirp 3 HD tier is Instant Custom Voice: you provide audio recordings of a speaker, and Google builds a personalised voice model from them. The generated voice retains the speaker's distinctive characteristics — cadence, timbre, accent. Use cases include personalised voice assistants, audiobook narration in the author's own voice, and branded voice identities for enterprise applications.
Gemini-TTS
Gemini-TTS takes a fundamentally different approach to voice control. Instead of XML-based SSML markup, you describe how you want the voice to sound in plain English — as part of your prompt.
Available models:
gemini-2.5-flash-tts— Fast generation, suitable for high-throughput applicationsgemini-2.5-pro-tts— Highest quality, more nuanced prosody control
Key characteristics:
- 30 distinct speaker voices across 80+ locales
- Natural language control: Prompt phrases like “speak slowly and warmly”, “read this like a news anchor”, or “use an excited tone for the product launch announcement” — no SSML required
- Available via: Cloud Text-to-Speech API or Vertex AI API
- Native audio: The Gemini 2.5 Native Audio model (same one powering Gemini Live) generates speech as a direct model output rather than a post-processing step
Gemini Live — Conversational Audio
For real-time voice applications, the Gemini Live API combines STT, language model reasoning, and TTS into a single end-to-end model (Gemini 2.5 Native Audio). Audio goes in, audio comes out — without the latency of three separate API calls. This produces more natural-sounding conversational responses because prosody, emphasis, and pacing are generated by the same model that understands the meaning, not by a separate TTS stage.
Choosing Between Chirp 3 HD and Gemini-TTS
Use Chirp 3 HD When
- You need precise phoneme-level pronunciation control (medical, legal, technical terms)
- You require specific SSML formatting features
- You need Instant Custom Voice (voice cloning)
- You have existing SSML markup you want to reuse
Use Gemini-TTS When
- You want to control voice style through natural language prompts
- You have a broader range of voices and locales to choose from
- You want the highest quality prosody for conversational applications
- You are already using the Gemini API and want a unified SDK
Checklist
- How many voices does Chirp 3 HD offer, and how many locales does Gemini-TTS cover?
- What is Instant Custom Voice, and what audio input does it require?
- How do you control speaking style in Gemini-TTS — what replaces SSML?
- What are the two Gemini-TTS model options and when would you choose each?
- Why does Gemini Live produce more natural-sounding speech than a traditional TTS pipeline?