Intermediate

AWS Polly — Text-to-Speech

Amazon Polly converts text to lifelike spoken audio. With 100+ voices across 40+ languages and four voice engine tiers — including a Generative engine added in 2025 — it's the go-to TTS choice for AWS-native stacks. It trades voice variety and peak quality (ElevenLabs wins those) for cost-effectiveness, scale, and seamless AWS integration.

Four Voice Engines

Standard

Traditional concatenative synthesis. Robotic by modern standards but cheapest. Good for high-volume, cost-sensitive pipelines where naturalness is secondary.

Neural

Deep learning-based. Significantly more natural than Standard. Available in 36 languages/variants. The default choice for most production workloads.

Long-Form

Optimised for reading long documents — articles, books, reports. More expressive pacing and intonation variation than Neural for sustained reading.

Generative (new 2025)

Highest quality. 31 voices across 20 locales as of Nov 2025 — with major expansions in Aug and Nov 2025. Best for premium user-facing applications.

Languages & Voices

  • 100+ voices total across all engines
  • 40+ languages and variants — including English (US, UK, AU, IN), Spanish (ES, US, MX), French (FR, CA), German, Japanese, Korean, Hindi, Brazilian Portuguese, Arabic, and more
  • SSML support — full Speech Synthesis Markup Language for fine-grained control: pauses, emphasis, speaking rate, pitch, whisper effect, phoneme overrides
  • Custom lexicons — define exactly how brand names or technical terms are pronounced

boto3 Usage

For short text, use synthesize_speech() — it returns an audio stream you write to a file:

import boto3

polly = boto3.client("polly", region_name="us-east-1")

response = polly.synthesize_speech(
    Text="Welcome to the AWS Bedrock platform. Let's get started.",
    OutputFormat="mp3",
    VoiceId="Joanna",
    Engine="neural",   # standard | neural | long-form | generative
)

# Write the audio stream to a file
with open("output.mp3", "wb") as f:
    f.write(response["AudioStream"].read())

For long text that exceeds the 3,000-character limit, use the async task API:

# For long documents — output written to S3
response = polly.start_speech_synthesis_task(
    Text=long_text,
    OutputFormat="mp3",
    VoiceId="Matthew",
    Engine="long-form",
    OutputS3BucketName="my-audio-bucket",
    OutputS3KeyPrefix="polly-output/",
)
task_id = response["SynthesisTask"]["TaskId"]
print(f"Task started: {task_id}")

For SSML input, set TextType="ssml" and wrap your text in <speak> tags. The Engine parameter selects which engine tier to use — not all voices are available in all engines.

Pricing (per 1 million characters)

EnginePrice / 1M charsFree tier (first 12 months)
Standard$4.005M chars/month
Neural$16.001M chars/month
Long-Form$100.00500K chars/month
Generative$30.00100K chars/month

1 million characters ≈ 8–10 hours of audio depending on speaking rate. New AWS customers also receive $200 in credits (from July 2025) applicable to Polly usage.

How Polly Compares

ServiceVoicesQualityBest For
Amazon Polly (Neural)100+GoodAWS stacks, high-volume, $16/1M
Amazon Polly (Generative)31Very goodPremium user-facing audio, $30/1M
OpenAI TTS13GoodOpenAI ecosystem, $15/1M
Google Cloud TTS (Neural)380+GoodGCP stacks, most voice variety
ElevenLabs1,200+Best-in-classCreative, narration, voice cloning

Integration with Bedrock

Polly is the TTS layer in many Bedrock-powered voice pipelines:

  • Voice chatbot: Transcribe (audio → text) → Bedrock LLM (Claude, Nova, etc.) → Polly (text → audio response)
  • Video auto-dubbing: Transcribe video audio → Translate text to target language → Bedrock refines translation → Polly synthesizes dubbed audio
  • Document narration: Extract text from S3 documents → Polly Long-Form engine → MP3 audio output
  • IVR systems: Dynamic responses from Bedrock LLMs, spoken by Polly in real-time via Connect

Amazon Connect (AWS's contact centre platform) has native Polly integration — customer-facing voice applications built on Connect automatically use Polly for TTS without additional configuration.

When to Use Polly

Use Polly when:

  • You're already building on AWS
  • High-volume TTS at low cost is the priority
  • You need SSML for precise prosody control
  • AWS Connect or Lambda integration is needed
  • Document narration at scale (Long-Form engine)

Consider alternatives when:

  • Voice cloning is needed (ElevenLabs)
  • Maximum naturalness for creative/narration (ElevenLabs)
  • Widest language variety (Google — 380+ voices)
  • You're on GCP or OpenAI stacks already

Checklist: Do You Understand This?

  • Amazon Polly has four engine tiers: Standard ($4), Neural ($16), Long-Form ($100), Generative ($30) per 1M chars
  • 100+ voices across 40+ languages; Generative engine has 31 voices in 20 locales (as of late 2025)
  • Use synthesize_speech() for short text; start_speech_synthesis_task() for long documents via S3
  • SSML support gives precise control over pauses, emphasis, speed, and pronunciation
  • Pairs naturally with Transcribe + Bedrock for full voice pipeline (STT → LLM → TTS)
  • ElevenLabs leads on quality; Polly leads on AWS integration and volume cost

Page built: 01 Jun 2026