Advanced

Batch vs Realtime AI Processing

Not all AI work needs to happen while a user is waiting. Batch processing — submitting large volumes of requests to be processed asynchronously — reduces cost by 50% or more and removes throughput constraints. The trade-off is latency: batch results may take minutes to hours. Choosing the right mode for each workload is a significant cost optimisation opportunity.

Decision Framework

Factor	Realtime	Batch
Latency tolerance	User is waiting; must respond within seconds	Results needed within minutes to hours; no user waiting
Interaction type	Interactive: user sends message, expects immediate response	Non-interactive: system submits work and polls for results
Volume pattern	Unpredictable bursts; must handle peak load	High, predictable volume; can be smoothed over time
Cost priority	Cost is secondary to responsiveness	Cost efficiency is primary; latency is acceptable trade-off
Streaming required	Yes — token-by-token streaming improves perceived latency	No — full response returned when batch job completes

Batch API Capabilities (2025)

OpenAI Batch API

50% cost reduction vs synchronous API
24-hour completion window
Up to 50,000 requests per batch file
Supports all chat completion and embedding models
JSONL input format; results returned as JSONL

Anthropic Message Batches

50% cost reduction; results within 24 hours
Up to 10,000 requests per batch
Supports all Claude models
Streaming not available in batch mode
Use for: document enrichment, training data generation, offline analysis

AWS Bedrock Async Inference

Asynchronous invocation via S3 input/output
No 24-hour constraint — completes when done
Suitable for very large documents or multi-step pipelines
Integrates with Step Functions for orchestration
Cost savings depend on model; check Bedrock pricing

Batch Job Design Patterns

A well-designed batch job handles partial failures, supports idempotent retries, and processes efficiently at scale.

async def run_batch_enrichment(documents: list[Document]):

# 1. Chunk into batch-size batches (max 10K for Anthropic)

batches = chunk(documents, size=5_000)

for batch in batches:

# 2. Build JSONL requests with idempotent custom_id

requests = [

{"custom_id": doc.id, "params": build_request(doc)}

for doc in batch

if not await cache.is_processed(doc.id) # idempotent check

]

# 3. Submit and poll for completion

batch_id = await anthropic.batches.create(requests)

results = await poll_until_complete(batch_id)

# 4. Handle partial failures — some items may have errored

for result in results:

if result.result.type == "succeeded":

await store_result(result.custom_id, result.result.message)

else:

await queue_for_retry(result.custom_id) # retry individually

Batch job checklist

Idempotent request IDs — safe to resubmit without duplication
Partial failure handling — individual errors should not fail the whole batch
Progress tracking — log completion count; alert if batch stalls
Cost cap — set maximum batch size to bound total spend per run
Result expiry — batch results are deleted by providers after 24-29 days; download promptly

Common batch mistakes

Resubmitting already-processed documents — no idempotency check
Not downloading results before provider expiry window
Using batch mode for user-facing requests where latency matters
Single giant batch file — if it fails, lose progress on all items; use smaller chunks
No retry queue for individual item failures

Hybrid Architecture

Most enterprise systems need both modes. A hybrid architecture routes to the right processing tier based on the workload's latency requirements.

Realtime tier

User-facing queries, interactive chat, on-demand code assistance. Synchronous API calls with streaming. Cost: full per-token pricing.

Async queue tier

Semi-interactive: user triggers a task but doesn't wait at the screen. Webhook callback when done. SLA: 5-15 minutes. Cost: full pricing but better throughput control.

Batch tier

Nightly enrichment, training data generation, bulk document processing. Provider batch API. SLA: hours. Cost: 50% discount.

Checklist: Do You Understand This?

What is the cost saving from using OpenAI or Anthropic batch APIs compared to synchronous calls?
What is the maximum completion window for OpenAI and Anthropic batch APIs?
Why must batch job requests use idempotent IDs — what problem does this solve?
What should happen when an individual item in a batch fails — and why should it not fail the whole batch?
Classify these workloads as realtime, async queue, or batch: (a) answering a user's chat question; (b) enriching 100,000 product descriptions overnight; (c) generating a report that a user requested and will check in 10 minutes.
What is the result expiry window for batch results at Anthropic — and what happens if you miss it?