Intermediate

AWS Transcribe — Speech-to-Text

Amazon Transcribe is AWS's managed automatic speech recognition (ASR) service. It converts audio and video recordings to text at scale — with support for batch and real-time streaming, 100+ languages, speaker diarization, PII redaction, and deep AWS ecosystem integration. It sits alongside Bedrock as part of AWS's broader AI services stack.

What It Does

Transcribe takes audio (MP3, WAV, FLAC, MP4, and more) and returns a structured transcript. You can use it for call centre analytics, meeting notes, subtitles, voice-to-text pipelines, and any workflow where spoken language needs to become structured data.

It runs as a fully managed service — no models to host, no GPUs to provision. You call the API and AWS handles the rest.

Batch vs Real-Time Streaming

Batch Transcription

Submit audio files stored in S3
Job runs asynchronously — poll for completion
Results written to S3 as JSON + text
Best for: recorded calls, meetings, video files
Used via boto3 transcribe client

Streaming Transcription

WebSocket or HTTP/2 bidirectional stream
Partial results returned as audio arrives
Best for: live captions, voice assistants, phone calls
Requires amazon-transcribe async SDK (PyPI) — not standard boto3
Latency: ~300–500ms for partial results

Key Features

100+ languages — automatic language identification available
Speaker diarization — identifies and labels individual speakers in multi-person audio
Custom vocabulary — add domain-specific terms (product names, jargon) to improve accuracy
Custom language models — fine-tune on your own text corpus for domain adaptation
Automatic punctuation and formatting — produces readable output without manual post-processing
Word-level confidence scores — flag uncertain transcriptions for human review
PII redaction — automatically masks SSNs, phone numbers, credit card numbers, and more
Vocabulary filters — block or replace specific words in output
Amazon Transcribe Medical — HIPAA-eligible variant with clinical vocabulary
Amazon Transcribe Call Analytics — adds sentiment analysis, call categories, and Bedrock-powered summaries

boto3 Usage (Batch)

Batch transcription is the most common pattern. Store your audio in S3, submit a job, and poll until complete:

import boto3
import time

transcribe = boto3.client("transcribe", region_name="us-east-1")

# Start a transcription job
transcribe.start_transcription_job(
    TranscriptionJobName="my-meeting-2026-03",
    Media={"MediaFileUri": "s3://my-bucket/recordings/meeting.mp3"},
    MediaFormat="mp3",
    LanguageCode="en-US",
    OutputBucketName="my-output-bucket",
    Settings={
        "ShowSpeakerLabels": True,
        "MaxSpeakerLabels": 4,
    },
)

# Poll until complete
while True:
    response = transcribe.get_transcription_job(
        TranscriptionJobName="my-meeting-2026-03"
    )
    status = response["TranscriptionJob"]["TranscriptionJobStatus"]
    if status in ["COMPLETED", "FAILED"]:
        break
    print(f"Status: {status} — waiting...")
    time.sleep(15)

if status == "COMPLETED":
    uri = response["TranscriptionJob"]["Transcript"]["TranscriptFileUri"]
    print(f"Transcript available at: {uri}")

Accuracy vs Competitors (2025)

Service	Typical WER	Best For
OpenAI Whisper v3 / gpt-4o-transcribe	~8%	Best zero-shot accuracy, batch
Amazon Transcribe	~18–22%	AWS-native, real-time streaming, compliance
Google Speech-to-Text	~17–21%	GCP-native stacks

WER = Word Error Rate. Lower is better. Transcribe's main edge over Whisper is real-time streaming, HIPAA/compliance features, and deep AWS integration — not raw accuracy.

Pricing (US East, 2025)

Tier	Price per minute
Standard (0–250K min/mo)	$0.024
Volume tier 2 (250K–1M min/mo)	$0.015
Volume tier 3 (1M+ min/mo)	$0.0102
Medical	$0.075/min
Free tier	60 min/month for first 12 months

Integration with Bedrock

Transcribe and Bedrock are commonly combined in AWS pipelines. The audio layer (Transcribe) and the intelligence layer (Bedrock) complement each other:

Meeting notes pipeline: Transcribe audio → pass transcript to Claude via Bedrock → generate structured summary, action items, decisions
Call analytics: Transcribe Call Analytics (with built-in Bedrock LLM) adds sentiment, categories, and AI summaries to call recordings
Voice chatbot: Real-time Transcribe stream → Bedrock LLM → Polly TTS → spoken response
Bedrock Data Automation (BDA): Pass audio through BDA for enhanced transcription plus LLM-driven analysis in one managed pipeline

When to Use Transcribe vs Whisper

Choose Transcribe when:

You need real-time streaming transcription
HIPAA compliance is required (Medical variant)
You're already deep in the AWS ecosystem
Call analytics (sentiment, categories) add value
Volume pricing makes sense at scale

Choose Whisper/OpenAI when:

Maximum accuracy is the top priority
You're working in the OpenAI ecosystem
Batch-only workflow (no real-time needed)
Simpler single-vendor setup preferred

Checklist: Do You Understand This?

Amazon Transcribe converts audio to text — batch (via S3 + boto3) and real-time streaming
Supports 100+ languages, speaker diarization, PII redaction, custom vocabulary
Word Error Rate ~18–22% — good for AWS-native stacks; Whisper is more accurate for batch-only
Standard pricing: $0.024/min; free tier 60 min/month for 12 months
Pairs naturally with Bedrock LLMs for meeting notes, call analytics, and voice pipelines
Transcribe Medical adds HIPAA-eligible processing for healthcare use cases