Intermediate

AWS Transcribe — Speech-to-Text

Amazon Transcribe is AWS's managed automatic speech recognition (ASR) service. It converts audio and video recordings to text at scale — with support for batch and real-time streaming, 100+ languages, speaker diarization, PII redaction, and deep AWS ecosystem integration. It sits alongside Bedrock as part of AWS's broader AI services stack.

What It Does

Transcribe takes audio (MP3, WAV, FLAC, MP4, and more) and returns a structured transcript. You can use it for call centre analytics, meeting notes, subtitles, voice-to-text pipelines, and any workflow where spoken language needs to become structured data.

It runs as a fully managed service — no models to host, no GPUs to provision. You call the API and AWS handles the rest.

Batch vs Real-Time Streaming

Batch Transcription

  • Submit audio files stored in S3
  • Job runs asynchronously — poll for completion
  • Results written to S3 as JSON + text
  • Best for: recorded calls, meetings, video files
  • Used via boto3 transcribe client

Streaming Transcription

  • WebSocket or HTTP/2 bidirectional stream
  • Partial results returned as audio arrives
  • Best for: live captions, voice assistants, phone calls
  • Requires amazon-transcribe async SDK (PyPI) — not standard boto3
  • Latency: ~300–500ms for partial results

Key Features

  • 100+ languages — automatic language identification available
  • Speaker diarization — identifies and labels individual speakers in multi-person audio
  • Custom vocabulary — add domain-specific terms (product names, jargon) to improve accuracy
  • Custom language models — fine-tune on your own text corpus for domain adaptation
  • Automatic punctuation and formatting — produces readable output without manual post-processing
  • Word-level confidence scores — flag uncertain transcriptions for human review
  • PII redaction — automatically masks SSNs, phone numbers, credit card numbers, and more
  • Vocabulary filters — block or replace specific words in output
  • Amazon Transcribe Medical — HIPAA-eligible variant with clinical vocabulary
  • Amazon Transcribe Call Analytics — adds sentiment analysis, call categories, and Bedrock-powered summaries

boto3 Usage (Batch)

Batch transcription is the most common pattern. Store your audio in S3, submit a job, and poll until complete:

import boto3
import time

transcribe = boto3.client("transcribe", region_name="us-east-1")

# Start a transcription job
transcribe.start_transcription_job(
    TranscriptionJobName="my-meeting-2026-03",
    Media={"MediaFileUri": "s3://my-bucket/recordings/meeting.mp3"},
    MediaFormat="mp3",
    LanguageCode="en-US",
    OutputBucketName="my-output-bucket",
    Settings={
        "ShowSpeakerLabels": True,
        "MaxSpeakerLabels": 4,
    },
)

# Poll until complete
while True:
    response = transcribe.get_transcription_job(
        TranscriptionJobName="my-meeting-2026-03"
    )
    status = response["TranscriptionJob"]["TranscriptionJobStatus"]
    if status in ["COMPLETED", "FAILED"]:
        break
    print(f"Status: {status} — waiting...")
    time.sleep(15)

if status == "COMPLETED":
    uri = response["TranscriptionJob"]["Transcript"]["TranscriptFileUri"]
    print(f"Transcript available at: {uri}")

Accuracy vs Competitors (2025)

ServiceTypical WERBest For
OpenAI Whisper v3 / gpt-4o-transcribe~8%Best zero-shot accuracy, batch
Amazon Transcribe~18–22%AWS-native, real-time streaming, compliance
Google Speech-to-Text~17–21%GCP-native stacks

WER = Word Error Rate. Lower is better. Transcribe's main edge over Whisper is real-time streaming, HIPAA/compliance features, and deep AWS integration — not raw accuracy.

Pricing (US East, 2025)

TierPrice per minute
Standard (0–250K min/mo)$0.024
Volume tier 2 (250K–1M min/mo)$0.015
Volume tier 3 (1M+ min/mo)$0.0102
Medical$0.075/min
Free tier60 min/month for first 12 months

Integration with Bedrock

Transcribe and Bedrock are commonly combined in AWS pipelines. The audio layer (Transcribe) and the intelligence layer (Bedrock) complement each other:

  • Meeting notes pipeline: Transcribe audio → pass transcript to Claude via Bedrock → generate structured summary, action items, decisions
  • Call analytics: Transcribe Call Analytics (with built-in Bedrock LLM) adds sentiment, categories, and AI summaries to call recordings
  • Voice chatbot: Real-time Transcribe stream → Bedrock LLM → Polly TTS → spoken response
  • Bedrock Data Automation (BDA): Pass audio through BDA for enhanced transcription plus LLM-driven analysis in one managed pipeline

When to Use Transcribe vs Whisper

Choose Transcribe when:

  • You need real-time streaming transcription
  • HIPAA compliance is required (Medical variant)
  • You're already deep in the AWS ecosystem
  • Call analytics (sentiment, categories) add value
  • Volume pricing makes sense at scale

Choose Whisper/OpenAI when:

  • Maximum accuracy is the top priority
  • You're working in the OpenAI ecosystem
  • Batch-only workflow (no real-time needed)
  • Simpler single-vendor setup preferred

Checklist: Do You Understand This?

  • Amazon Transcribe converts audio to text — batch (via S3 + boto3) and real-time streaming
  • Supports 100+ languages, speaker diarization, PII redaction, custom vocabulary
  • Word Error Rate ~18–22% — good for AWS-native stacks; Whisper is more accurate for batch-only
  • Standard pricing: $0.024/min; free tier 60 min/month for 12 months
  • Pairs naturally with Bedrock LLMs for meeting notes, call analytics, and voice pipelines
  • Transcribe Medical adds HIPAA-eligible processing for healthcare use cases

Page built: 01 Jun 2026