🧠 All Things AI
Advanced

Text-to-Speech

Google offers two generations of text-to-speech for developers: Chirp 3 HD Voices via the Cloud Text-to-Speech API (traditional SSML-based approach), and the newer Gemini-TTS models (natural language voice control, no SSML required). Both are generally available and serve different use cases along the control-vs-simplicity tradeoff.

Chirp 3 HD Voices

Chirp 3 HD Voices are available through the Cloud Text-to-Speech API and represent the current state of the art for the traditional (SSML-controlled) TTS approach.

  • Voices: 8 speaker voices
  • Coverage: 31 locales, globally available across us, eu, and asia-southeast1 regions
  • Speed control: 0.25× to 2× playback rate
  • SSML support: Limited tag set — <phoneme>, <p>, <s>, <sub>, <say-as> — for pronunciation and formatting control
  • Custom pauses: Insert timed pauses at any point in the speech output
  • Custom pronunciations: Override default phoneme pronunciation for brand names, technical terms, acronyms
  • Delivery: Both real-time streaming and batch processing supported

Instant Custom Voice

A notable capability in the Chirp 3 HD tier is Instant Custom Voice: you provide audio recordings of a speaker, and Google builds a personalised voice model from them. The generated voice retains the speaker's distinctive characteristics — cadence, timbre, accent. Use cases include personalised voice assistants, audiobook narration in the author's own voice, and branded voice identities for enterprise applications.

Gemini-TTS

Gemini-TTS takes a fundamentally different approach to voice control. Instead of XML-based SSML markup, you describe how you want the voice to sound in plain English — as part of your prompt.

Available models:

  • gemini-2.5-flash-tts — Fast generation, suitable for high-throughput applications
  • gemini-2.5-pro-tts — Highest quality, more nuanced prosody control

Key characteristics:

  • 30 distinct speaker voices across 80+ locales
  • Natural language control: Prompt phrases like “speak slowly and warmly”, “read this like a news anchor”, or “use an excited tone for the product launch announcement” — no SSML required
  • Available via: Cloud Text-to-Speech API or Vertex AI API
  • Native audio: The Gemini 2.5 Native Audio model (same one powering Gemini Live) generates speech as a direct model output rather than a post-processing step

Gemini Live — Conversational Audio

For real-time voice applications, the Gemini Live API combines STT, language model reasoning, and TTS into a single end-to-end model (Gemini 2.5 Native Audio). Audio goes in, audio comes out — without the latency of three separate API calls. This produces more natural-sounding conversational responses because prosody, emphasis, and pacing are generated by the same model that understands the meaning, not by a separate TTS stage.

Choosing Between Chirp 3 HD and Gemini-TTS

Use Chirp 3 HD When

  • You need precise phoneme-level pronunciation control (medical, legal, technical terms)
  • You require specific SSML formatting features
  • You need Instant Custom Voice (voice cloning)
  • You have existing SSML markup you want to reuse

Use Gemini-TTS When

  • You want to control voice style through natural language prompts
  • You have a broader range of voices and locales to choose from
  • You want the highest quality prosody for conversational applications
  • You are already using the Gemini API and want a unified SDK

Checklist

  • How many voices does Chirp 3 HD offer, and how many locales does Gemini-TTS cover?
  • What is Instant Custom Voice, and what audio input does it require?
  • How do you control speaking style in Gemini-TTS — what replaces SSML?
  • What are the two Gemini-TTS model options and when would you choose each?
  • Why does Gemini Live produce more natural-sounding speech than a traditional TTS pipeline?