Intermediate

AWS Polly — Text-to-Speech

Amazon Polly converts text to lifelike spoken audio. With 100+ voices across 40+ languages and four voice engine tiers — including a Generative engine added in 2025 — it's the go-to TTS choice for AWS-native stacks. It trades voice variety and peak quality (ElevenLabs wins those) for cost-effectiveness, scale, and seamless AWS integration.

Four Voice Engines

Standard

Traditional concatenative synthesis. Robotic by modern standards but cheapest. Good for high-volume, cost-sensitive pipelines where naturalness is secondary.

Neural

Deep learning-based. Significantly more natural than Standard. Available in 36 languages/variants. The default choice for most production workloads.

Long-Form

Optimised for reading long documents — articles, books, reports. More expressive pacing and intonation variation than Neural for sustained reading.

Generative (new 2025)

Highest quality. 31 voices across 20 locales as of Nov 2025 — with major expansions in Aug and Nov 2025. Best for premium user-facing applications.

Languages & Voices

100+ voices total across all engines
40+ languages and variants — including English (US, UK, AU, IN), Spanish (ES, US, MX), French (FR, CA), German, Japanese, Korean, Hindi, Brazilian Portuguese, Arabic, and more
SSML support — full Speech Synthesis Markup Language for fine-grained control: pauses, emphasis, speaking rate, pitch, whisper effect, phoneme overrides
Custom lexicons — define exactly how brand names or technical terms are pronounced

boto3 Usage

For short text, use synthesize_speech() — it returns an audio stream you write to a file:

import boto3

polly = boto3.client("polly", region_name="us-east-1")

response = polly.synthesize_speech(
    Text="Welcome to the AWS Bedrock platform. Let's get started.",
    OutputFormat="mp3",
    VoiceId="Joanna",
    Engine="neural",   # standard | neural | long-form | generative
)

# Write the audio stream to a file
with open("output.mp3", "wb") as f:
    f.write(response["AudioStream"].read())

For long text that exceeds the 3,000-character limit, use the async task API:

# For long documents — output written to S3
response = polly.start_speech_synthesis_task(
    Text=long_text,
    OutputFormat="mp3",
    VoiceId="Matthew",
    Engine="long-form",
    OutputS3BucketName="my-audio-bucket",
    OutputS3KeyPrefix="polly-output/",
)
task_id = response["SynthesisTask"]["TaskId"]
print(f"Task started: {task_id}")

For SSML input, set TextType="ssml" and wrap your text in <speak> tags. The Engine parameter selects which engine tier to use — not all voices are available in all engines.

Pricing (per 1 million characters)

Engine	Price / 1M chars	Free tier (first 12 months)
Standard	$4.00	5M chars/month
Neural	$16.00	1M chars/month
Long-Form	$100.00	500K chars/month
Generative	$30.00	100K chars/month

1 million characters ≈ 8–10 hours of audio depending on speaking rate. New AWS customers also receive $200 in credits (from July 2025) applicable to Polly usage.

How Polly Compares

Service	Voices	Quality	Best For
Amazon Polly (Neural)	100+	Good	AWS stacks, high-volume, $16/1M
Amazon Polly (Generative)	31	Very good	Premium user-facing audio, $30/1M
OpenAI TTS	13	Good	OpenAI ecosystem, $15/1M
Google Cloud TTS (Neural)	380+	Good	GCP stacks, most voice variety
ElevenLabs	1,200+	Best-in-class	Creative, narration, voice cloning

Integration with Bedrock

Polly is the TTS layer in many Bedrock-powered voice pipelines:

Voice chatbot: Transcribe (audio → text) → Bedrock LLM (Claude, Nova, etc.) → Polly (text → audio response)
Video auto-dubbing: Transcribe video audio → Translate text to target language → Bedrock refines translation → Polly synthesizes dubbed audio
Document narration: Extract text from S3 documents → Polly Long-Form engine → MP3 audio output
IVR systems: Dynamic responses from Bedrock LLMs, spoken by Polly in real-time via Connect

Amazon Connect (AWS's contact centre platform) has native Polly integration — customer-facing voice applications built on Connect automatically use Polly for TTS without additional configuration.

When to Use Polly

Use Polly when:

You're already building on AWS
High-volume TTS at low cost is the priority
You need SSML for precise prosody control
AWS Connect or Lambda integration is needed
Document narration at scale (Long-Form engine)

Consider alternatives when:

Voice cloning is needed (ElevenLabs)
Maximum naturalness for creative/narration (ElevenLabs)
Widest language variety (Google — 380+ voices)
You're on GCP or OpenAI stacks already

Checklist: Do You Understand This?

Amazon Polly has four engine tiers: Standard ($4), Neural ($16), Long-Form ($100), Generative ($30) per 1M chars
100+ voices across 40+ languages; Generative engine has 31 voices in 20 locales (as of late 2025)
Use synthesize_speech() for short text; start_speech_synthesis_task() for long documents via S3
SSML support gives precise control over pauses, emphasis, speed, and pronunciation
Pairs naturally with Transcribe + Bedrock for full voice pipeline (STT → LLM → TTS)
ElevenLabs leads on quality; Polly leads on AWS integration and volume cost