Intermediate

Batch Inference

Batch inference lets you process large volumes of requests asynchronously. Instead of sending requests one by one and waiting for each response, you upload a file of requests to S3, submit a batch job, and Bedrock processes them in the background. Results land in another S3 path when the job completes.

When to Use It

Good for batch

• Nightly document classification
• Bulk embedding generation
• Dataset annotation or enrichment
• Evaluation runs across many test cases
• Report generation on fixed schedules

Not for batch

• Real-time user-facing requests
• Tasks where response time matters
• Interactive chat or streaming
• Anything that needs a result in under a minute

Pricing Benefit

Batch inference is priced at 50% of on-demand inference cost. For high-volume workloads where latency is not a concern, this halves your inference costs compared to real-time calls. The tradeoff is completion time — jobs may take minutes to hours depending on queue depth and job size. Maximum job duration is 24 hours.

Input Format — JSONL

Input is a JSONL file (one JSON object per line) stored in S3. Each line has a recordId(your identifier for matching results) and a modelInput matching the model's InvokeModel body format:

// input.jsonl — one request per line
{"recordId": "doc-001", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 256, "messages": [{"role": "user", "content": "Classify sentiment: Great product!"}]}}
{"recordId": "doc-002", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 256, "messages": [{"role": "user", "content": "Classify sentiment: Terrible experience."}]}}
{"recordId": "doc-003", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 256, "messages": [{"role": "user", "content": "Classify sentiment: It was okay."}]}}

Output is also JSONL, with matching recordId values so you can join results back to inputs. Failed records include error details — the job doesn't fail completely if individual records error.

Workflow

Upload JSONL — Put your input.jsonl in an S3 bucket Bedrock can read

Create Job — Call CreateModelInvocationJob with model ID, input S3 URI, output S3 URI

Poll Status — GetModelInvocationJob returns InProgress / Completed / Failed

Download Results — Output JSONL appears at the output S3 path when Completed

import boto3

bedrock = boto3.client("bedrock", region_name="us-east-1")

# Create the batch job
job = bedrock.create_model_invocation_job(
    jobName="sentiment-batch-001",
    modelId="anthropic.claude-haiku-3-5",
    inputDataConfig={
        "s3InputDataConfig": {
            "s3Uri": "s3://my-bucket/input.jsonl",
            "s3InputFormat": "JSONL",
        }
    },
    outputDataConfig={
        "s3OutputDataConfig": {"s3Uri": "s3://my-bucket/output/"}
    },
    roleArn="arn:aws:iam::ACCOUNT:role/BedrockBatchRole",
)

print(job["jobArn"])

Limits

Maximum records per job: 50,000. Maximum file size: 512 MB. Maximum job duration: 24 hours. Not all models support batch inference — check the Bedrock documentation for the current supported model list. Claude Haiku models are the most cost-effective choice for large batch jobs.

Checklist: Do You Understand This?

What is the pricing advantage of batch inference over on-demand?
What format does the input file use, and what does each line contain?
Can you describe the four steps to run a batch job?
What types of workloads are well suited to batch inference?