AWS Polly — Text-to-Speech
Amazon Polly converts text to lifelike spoken audio. With 100+ voices across 40+ languages and four voice engine tiers — including a Generative engine added in 2025 — it's the go-to TTS choice for AWS-native stacks. It trades voice variety and peak quality (ElevenLabs wins those) for cost-effectiveness, scale, and seamless AWS integration.
Four Voice Engines
Standard
Traditional concatenative synthesis. Robotic by modern standards but cheapest. Good for high-volume, cost-sensitive pipelines where naturalness is secondary.
Neural
Deep learning-based. Significantly more natural than Standard. Available in 36 languages/variants. The default choice for most production workloads.
Long-Form
Optimised for reading long documents — articles, books, reports. More expressive pacing and intonation variation than Neural for sustained reading.
Generative (new 2025)
Highest quality. 31 voices across 20 locales as of Nov 2025 — with major expansions in Aug and Nov 2025. Best for premium user-facing applications.
Languages & Voices
- 100+ voices total across all engines
- 40+ languages and variants — including English (US, UK, AU, IN), Spanish (ES, US, MX), French (FR, CA), German, Japanese, Korean, Hindi, Brazilian Portuguese, Arabic, and more
- SSML support — full Speech Synthesis Markup Language for fine-grained control: pauses, emphasis, speaking rate, pitch, whisper effect, phoneme overrides
- Custom lexicons — define exactly how brand names or technical terms are pronounced
boto3 Usage
For short text, use synthesize_speech() — it returns an audio stream you write to a file:
import boto3
polly = boto3.client("polly", region_name="us-east-1")
response = polly.synthesize_speech(
Text="Welcome to the AWS Bedrock platform. Let's get started.",
OutputFormat="mp3",
VoiceId="Joanna",
Engine="neural", # standard | neural | long-form | generative
)
# Write the audio stream to a file
with open("output.mp3", "wb") as f:
f.write(response["AudioStream"].read())For long text that exceeds the 3,000-character limit, use the async task API:
# For long documents — output written to S3
response = polly.start_speech_synthesis_task(
Text=long_text,
OutputFormat="mp3",
VoiceId="Matthew",
Engine="long-form",
OutputS3BucketName="my-audio-bucket",
OutputS3KeyPrefix="polly-output/",
)
task_id = response["SynthesisTask"]["TaskId"]
print(f"Task started: {task_id}")For SSML input, set TextType="ssml" and wrap your text in <speak> tags. The Engine parameter selects which engine tier to use — not all voices are available in all engines.
Pricing (per 1 million characters)
| Engine | Price / 1M chars | Free tier (first 12 months) |
|---|---|---|
| Standard | $4.00 | 5M chars/month |
| Neural | $16.00 | 1M chars/month |
| Long-Form | $100.00 | 500K chars/month |
| Generative | $30.00 | 100K chars/month |
1 million characters ≈ 8–10 hours of audio depending on speaking rate. New AWS customers also receive $200 in credits (from July 2025) applicable to Polly usage.
How Polly Compares
| Service | Voices | Quality | Best For |
|---|---|---|---|
| Amazon Polly (Neural) | 100+ | Good | AWS stacks, high-volume, $16/1M |
| Amazon Polly (Generative) | 31 | Very good | Premium user-facing audio, $30/1M |
| OpenAI TTS | 13 | Good | OpenAI ecosystem, $15/1M |
| Google Cloud TTS (Neural) | 380+ | Good | GCP stacks, most voice variety |
| ElevenLabs | 1,200+ | Best-in-class | Creative, narration, voice cloning |
Integration with Bedrock
Polly is the TTS layer in many Bedrock-powered voice pipelines:
- Voice chatbot: Transcribe (audio → text) → Bedrock LLM (Claude, Nova, etc.) → Polly (text → audio response)
- Video auto-dubbing: Transcribe video audio → Translate text to target language → Bedrock refines translation → Polly synthesizes dubbed audio
- Document narration: Extract text from S3 documents → Polly Long-Form engine → MP3 audio output
- IVR systems: Dynamic responses from Bedrock LLMs, spoken by Polly in real-time via Connect
Amazon Connect (AWS's contact centre platform) has native Polly integration — customer-facing voice applications built on Connect automatically use Polly for TTS without additional configuration.
When to Use Polly
Use Polly when:
- You're already building on AWS
- High-volume TTS at low cost is the priority
- You need SSML for precise prosody control
- AWS Connect or Lambda integration is needed
- Document narration at scale (Long-Form engine)
Consider alternatives when:
- Voice cloning is needed (ElevenLabs)
- Maximum naturalness for creative/narration (ElevenLabs)
- Widest language variety (Google — 380+ voices)
- You're on GCP or OpenAI stacks already
Checklist: Do You Understand This?
- Amazon Polly has four engine tiers: Standard ($4), Neural ($16), Long-Form ($100), Generative ($30) per 1M chars
- 100+ voices across 40+ languages; Generative engine has 31 voices in 20 locales (as of late 2025)
- Use
synthesize_speech()for short text;start_speech_synthesis_task()for long documents via S3 - SSML support gives precise control over pauses, emphasis, speed, and pronunciation
- Pairs naturally with Transcribe + Bedrock for full voice pipeline (STT → LLM → TTS)
- ElevenLabs leads on quality; Polly leads on AWS integration and volume cost