Advanced

Speech-to-Text

Google offers multiple paths for speech-to-text depending on your use case. The Cloud Speech-to-Text API (with Chirp 3) is the dedicated STT service for bulk transcription and real-time audio streams. Gemini's native audio input combines transcription and understanding in a single model call, better suited for conversational and agentic applications.

Cloud Speech-to-Text — Chirp 3 Transcription

Chirp 3 Transcription is the latest generation model in Google Cloud's Speech-to-Text API V2, released as generally available. It represents a significant accuracy and speed improvement over the previous Chirp 2 model, with two key additions:

Speaker diarization: Automatically identifies and labels different speakers in a recording — “Speaker 1 said X, Speaker 2 said Y.” Critical for meeting transcription, call centre analysis, and interview processing.
Automatic language detection: Identifies which language is being spoken in a multilingual audio file without you specifying it upfront. Handles language switching within a single recording.

The model was trained on millions of hours of audio and billions of text sentences, giving it broad language coverage and robustness to accents, background noise, and audio quality variation.

Gemini Native Audio Input

Gemini 2.0 and 2.5 models accept audio files directly in the API — not as a separate transcription step but as part of the model's multimodal input. This enables a different development pattern:

Pass audio + a question or instruction in a single API call
Gemini transcribes, understands context, and responds — one step instead of two
Works well for: “Summarise this meeting recording”, “What action items were mentioned in this audio?”, “Translate and summarise this voice message”

This is more efficient than calling a dedicated STT API followed by a separate LLM call, and it avoids intermediate transcription errors propagating to the reasoning step.

Gemini Live API — Real-Time Voice

The Gemini Live API supports real-time, low-latency bidirectional audio — the same technology powering the Gemini Live voice conversation feature in the Gemini app. Available on Vertex AI as a generally available service, it enables developers to build voice-driven applications with natural interruption handling, turn-taking, and conversational continuity.

Gemini Live uses Gemini 2.5 Flash Native Audio — a variant of the model with audio output capabilities built in (not a post-processing TTS step). This results in more natural voice conversations compared to a pipeline of STT → LLM → TTS.

On-Device Voice Input

The Gemini app on mobile uses on-device voice processing (Gemini Nano where available) for voice input — meaning your audio is processed locally rather than sent to the cloud for transcription before reaching the model. This reduces latency for voice queries and provides a privacy benefit for sensitive inputs.

Choosing the Right Path

Cloud STT API (Chirp 3)

Best for: bulk batch transcription, call centre recording analysis, meeting transcription where speaker labelling is needed, real-time audio streaming in production applications.

Gemini Native Audio

Best for: conversational use cases where you need transcription + reasoning together — audio Q&A, summarisation, information extraction from voice.

Gemini Live API

Best for: real-time voice assistant applications, customer service voice bots, interactive voice experiences with low latency requirements.

Checklist

What two capabilities did Chirp 3 add over Chirp 2?
What is the advantage of using Gemini native audio input vs a separate STT API call?
What makes the Gemini Live API suitable for real-time voice applications?
What model variant powers Gemini Live's voice output, and why does it produce more natural speech?
Where is Gemini Live API available for production deployments?