Beginner

Audio & Voice

OpenAI's audio capabilities span real-time voice conversation in ChatGPT, high-accuracy speech-to-text transcription, and expressive text-to-speech synthesis. For developers, the Realtime API enables low-latency streaming audio in custom applications. Each layer of this stack has seen significant updates in 2025.

Advanced Voice Mode

Advanced Voice Mode is ChatGPT's real-time speech-to-speech conversation feature. Unlike older voice interfaces that transcribed your speech, sent text to the model, and then read the response aloud, Advanced Voice Mode runs end-to-end on GPT-5 with audio input and audio output natively — meaning the model hears your tone and intonation and responds with natural, expressive speech rather than robotic text-to-speech.

Key characteristics:

Handles natural interruptions — you can cut in mid-sentence and the model responds appropriately
Understands emotional cues in voice, not just the words
Available on Plus and above plans
Platforms: iOS, Android, Windows, and Web. Removed from the Mac app on January 15, 2026.

Speech-to-Text (STT)

OpenAI offers three speech-to-text options via the API, each with different capability levels:

Whisper (whisper-1)

The original open-source transcription model, still available via the API aswhisper-1. Open-source weights are publicly available on Hugging Face for self-hosting. Strong multilingual transcription, widely used in production. Not real-time — processes audio files in batch mode.

gpt-4o-transcribe (March 2025)

A higher-accuracy transcription model built on GPT-4o architecture. Significantly better language recognition (especially on accented speech and domain-specific terminology), lower word error rate than whisper-1 on benchmark datasets, and better at capturing context in ambiguous audio. Priced higher than whisper-1.

gpt-4o-mini-transcribe

A cost-efficient transcription model with a lower word error rate than whisper-1 on ASR benchmarks, despite being cheaper than gpt-4o-transcribe. Good choice for high-volume transcription pipelines where gpt-4o-transcribe's cost is prohibitive but whisper-1's accuracy is insufficient.

Text-to-Speech (TTS)

OpenAI provides three TTS options for generating spoken audio from text:

tts-1

Standard-quality TTS optimised for speed. Six voice options. Good for applications where latency matters more than the highest audio fidelity. Priced at $15 per 1M characters.

tts-1-hd

Higher-quality TTS at the cost of increased latency and price ($30 per 1M characters). Better for audio content where quality is the priority — podcasts, narration, accessible audio descriptions.

gpt-4o-mini-tts (2025)

A steerable TTS model that accepts natural language instructions for how to deliver the speech — not just what to say. You can instruct it: "Speak with nervous excitement, like you're pitching to investors for the first time" or "Read this slowly and clearly, like you're explaining to a child." Better naturalness and expressiveness than tts-1 variants, while remaining cost-efficient.

Realtime API (For Developers)

The Realtime API provides the same audio model that powers Advanced Voice Mode in ChatGPT, exposed as a developer API for building custom voice applications. Key characteristics:

Streams audio input and output directly — no transcription-then-completion intermediate step
Same model as Advanced Voice Mode: natural interruption handling, emotional awareness
Generally available since August 28, 2025
Supports MCP server connections — your voice application can call tools via MCP
Supports image input — users can point a camera and discuss what they see
SIP phone calling support — enables voice AI in traditional phone/call centre stacks

The Realtime API unlocks building voice-first AI applications without managing the complexity of a separate STT → LLM → TTS pipeline. The latency and naturalness is significantly better than chaining separate models.

Checklist

What makes Advanced Voice Mode different from a standard text-to-speech read-out of ChatGPT responses?
Which STT model is open-source and self-hostable?
What is the key differentiating feature of gpt-4o-mini-tts compared to tts-1?
On which platform was Advanced Voice Mode removed in January 2026?
What does the Realtime API enable that a standard STT+LLM+TTS pipeline cannot achieve as well?