AWS Transcribe — Speech-to-Text
Amazon Transcribe is AWS's managed automatic speech recognition (ASR) service. It converts audio and video recordings to text at scale — with support for batch and real-time streaming, 100+ languages, speaker diarization, PII redaction, and deep AWS ecosystem integration. It sits alongside Bedrock as part of AWS's broader AI services stack.
What It Does
Transcribe takes audio (MP3, WAV, FLAC, MP4, and more) and returns a structured transcript. You can use it for call centre analytics, meeting notes, subtitles, voice-to-text pipelines, and any workflow where spoken language needs to become structured data.
It runs as a fully managed service — no models to host, no GPUs to provision. You call the API and AWS handles the rest.
Batch vs Real-Time Streaming
Batch Transcription
- Submit audio files stored in S3
- Job runs asynchronously — poll for completion
- Results written to S3 as JSON + text
- Best for: recorded calls, meetings, video files
- Used via boto3
transcribeclient
Streaming Transcription
- WebSocket or HTTP/2 bidirectional stream
- Partial results returned as audio arrives
- Best for: live captions, voice assistants, phone calls
- Requires
amazon-transcribeasync SDK (PyPI) — not standard boto3 - Latency: ~300–500ms for partial results
Key Features
- 100+ languages — automatic language identification available
- Speaker diarization — identifies and labels individual speakers in multi-person audio
- Custom vocabulary — add domain-specific terms (product names, jargon) to improve accuracy
- Custom language models — fine-tune on your own text corpus for domain adaptation
- Automatic punctuation and formatting — produces readable output without manual post-processing
- Word-level confidence scores — flag uncertain transcriptions for human review
- PII redaction — automatically masks SSNs, phone numbers, credit card numbers, and more
- Vocabulary filters — block or replace specific words in output
- Amazon Transcribe Medical — HIPAA-eligible variant with clinical vocabulary
- Amazon Transcribe Call Analytics — adds sentiment analysis, call categories, and Bedrock-powered summaries
boto3 Usage (Batch)
Batch transcription is the most common pattern. Store your audio in S3, submit a job, and poll until complete:
import boto3
import time
transcribe = boto3.client("transcribe", region_name="us-east-1")
# Start a transcription job
transcribe.start_transcription_job(
TranscriptionJobName="my-meeting-2026-03",
Media={"MediaFileUri": "s3://my-bucket/recordings/meeting.mp3"},
MediaFormat="mp3",
LanguageCode="en-US",
OutputBucketName="my-output-bucket",
Settings={
"ShowSpeakerLabels": True,
"MaxSpeakerLabels": 4,
},
)
# Poll until complete
while True:
response = transcribe.get_transcription_job(
TranscriptionJobName="my-meeting-2026-03"
)
status = response["TranscriptionJob"]["TranscriptionJobStatus"]
if status in ["COMPLETED", "FAILED"]:
break
print(f"Status: {status} — waiting...")
time.sleep(15)
if status == "COMPLETED":
uri = response["TranscriptionJob"]["Transcript"]["TranscriptFileUri"]
print(f"Transcript available at: {uri}")Accuracy vs Competitors (2025)
| Service | Typical WER | Best For |
|---|---|---|
| OpenAI Whisper v3 / gpt-4o-transcribe | ~8% | Best zero-shot accuracy, batch |
| Amazon Transcribe | ~18–22% | AWS-native, real-time streaming, compliance |
| Google Speech-to-Text | ~17–21% | GCP-native stacks |
WER = Word Error Rate. Lower is better. Transcribe's main edge over Whisper is real-time streaming, HIPAA/compliance features, and deep AWS integration — not raw accuracy.
Pricing (US East, 2025)
| Tier | Price per minute |
|---|---|
| Standard (0–250K min/mo) | $0.024 |
| Volume tier 2 (250K–1M min/mo) | $0.015 |
| Volume tier 3 (1M+ min/mo) | $0.0102 |
| Medical | $0.075/min |
| Free tier | 60 min/month for first 12 months |
Integration with Bedrock
Transcribe and Bedrock are commonly combined in AWS pipelines. The audio layer (Transcribe) and the intelligence layer (Bedrock) complement each other:
- Meeting notes pipeline: Transcribe audio → pass transcript to Claude via Bedrock → generate structured summary, action items, decisions
- Call analytics: Transcribe Call Analytics (with built-in Bedrock LLM) adds sentiment, categories, and AI summaries to call recordings
- Voice chatbot: Real-time Transcribe stream → Bedrock LLM → Polly TTS → spoken response
- Bedrock Data Automation (BDA): Pass audio through BDA for enhanced transcription plus LLM-driven analysis in one managed pipeline
When to Use Transcribe vs Whisper
Choose Transcribe when:
- You need real-time streaming transcription
- HIPAA compliance is required (Medical variant)
- You're already deep in the AWS ecosystem
- Call analytics (sentiment, categories) add value
- Volume pricing makes sense at scale
Choose Whisper/OpenAI when:
- Maximum accuracy is the top priority
- You're working in the OpenAI ecosystem
- Batch-only workflow (no real-time needed)
- Simpler single-vendor setup preferred
Checklist: Do You Understand This?
- Amazon Transcribe converts audio to text — batch (via S3 + boto3) and real-time streaming
- Supports 100+ languages, speaker diarization, PII redaction, custom vocabulary
- Word Error Rate ~18–22% — good for AWS-native stacks; Whisper is more accurate for batch-only
- Standard pricing: $0.024/min; free tier 60 min/month for 12 months
- Pairs naturally with Bedrock LLMs for meeting notes, call analytics, and voice pipelines
- Transcribe Medical adds HIPAA-eligible processing for healthcare use cases