Audio & Voice
OpenAI's audio capabilities span real-time voice conversation in ChatGPT, high-accuracy speech-to-text transcription, and expressive text-to-speech synthesis. For developers, the Realtime API enables low-latency streaming audio in custom applications. Each layer of this stack has seen significant updates in 2025.
Advanced Voice Mode
Advanced Voice Mode is ChatGPT's real-time speech-to-speech conversation feature. Unlike older voice interfaces that transcribed your speech, sent text to the model, and then read the response aloud, Advanced Voice Mode runs end-to-end on GPT-5 with audio input and audio output natively — meaning the model hears your tone and intonation and responds with natural, expressive speech rather than robotic text-to-speech.
Key characteristics:
- Handles natural interruptions — you can cut in mid-sentence and the model responds appropriately
- Understands emotional cues in voice, not just the words
- Available on Plus and above plans
- Platforms: iOS, Android, Windows, and Web. Removed from the Mac app on January 15, 2026.
Speech-to-Text (STT)
OpenAI offers three speech-to-text options via the API, each with different capability levels:
Whisper (whisper-1)
The original open-source transcription model, still available via the API aswhisper-1. Open-source weights are publicly available on Hugging Face for self-hosting. Strong multilingual transcription, widely used in production. Not real-time — processes audio files in batch mode.
gpt-4o-transcribe (March 2025)
A higher-accuracy transcription model built on GPT-4o architecture. Significantly better language recognition (especially on accented speech and domain-specific terminology), lower word error rate than whisper-1 on benchmark datasets, and better at capturing context in ambiguous audio. Priced higher than whisper-1.
gpt-4o-mini-transcribe
A cost-efficient transcription model with a lower word error rate than whisper-1 on ASR benchmarks, despite being cheaper than gpt-4o-transcribe. Good choice for high-volume transcription pipelines where gpt-4o-transcribe's cost is prohibitive but whisper-1's accuracy is insufficient.
Text-to-Speech (TTS)
OpenAI provides three TTS options for generating spoken audio from text:
tts-1
Standard-quality TTS optimised for speed. Six voice options. Good for applications where latency matters more than the highest audio fidelity. Priced at $15 per 1M characters.
tts-1-hd
Higher-quality TTS at the cost of increased latency and price ($30 per 1M characters). Better for audio content where quality is the priority — podcasts, narration, accessible audio descriptions.
gpt-4o-mini-tts (2025)
A steerable TTS model that accepts natural language instructions for how to deliver the speech — not just what to say. You can instruct it: "Speak with nervous excitement, like you're pitching to investors for the first time" or "Read this slowly and clearly, like you're explaining to a child." Better naturalness and expressiveness than tts-1 variants, while remaining cost-efficient.
Realtime API (For Developers)
The Realtime API provides the same audio model that powers Advanced Voice Mode in ChatGPT, exposed as a developer API for building custom voice applications. Key characteristics:
- Streams audio input and output directly — no transcription-then-completion intermediate step
- Same model as Advanced Voice Mode: natural interruption handling, emotional awareness
- Generally available since August 28, 2025
- Supports MCP server connections — your voice application can call tools via MCP
- Supports image input — users can point a camera and discuss what they see
- SIP phone calling support — enables voice AI in traditional phone/call centre stacks
The Realtime API unlocks building voice-first AI applications without managing the complexity of a separate STT → LLM → TTS pipeline. The latency and naturalness is significantly better than chaining separate models.
Checklist
- What makes Advanced Voice Mode different from a standard text-to-speech read-out of ChatGPT responses?
- Which STT model is open-source and self-hostable?
- What is the key differentiating feature of gpt-4o-mini-tts compared to tts-1?
- On which platform was Advanced Voice Mode removed in January 2026?
- What does the Realtime API enable that a standard STT+LLM+TTS pipeline cannot achieve as well?