Batch vs Realtime AI Processing
Not all AI work needs to happen while a user is waiting. Batch processing — submitting large volumes of requests to be processed asynchronously — reduces cost by 50% or more and removes throughput constraints. The trade-off is latency: batch results may take minutes to hours. Choosing the right mode for each workload is a significant cost optimisation opportunity.
Decision Framework
| Factor | Realtime | Batch |
|---|---|---|
| Latency tolerance | User is waiting; must respond within seconds | Results needed within minutes to hours; no user waiting |
| Interaction type | Interactive: user sends message, expects immediate response | Non-interactive: system submits work and polls for results |
| Volume pattern | Unpredictable bursts; must handle peak load | High, predictable volume; can be smoothed over time |
| Cost priority | Cost is secondary to responsiveness | Cost efficiency is primary; latency is acceptable trade-off |
| Streaming required | Yes — token-by-token streaming improves perceived latency | No — full response returned when batch job completes |
Batch API Capabilities (2025)
OpenAI Batch API
- 50% cost reduction vs synchronous API
- 24-hour completion window
- Up to 50,000 requests per batch file
- Supports all chat completion and embedding models
- JSONL input format; results returned as JSONL
Anthropic Message Batches
- 50% cost reduction; results within 24 hours
- Up to 10,000 requests per batch
- Supports all Claude models
- Streaming not available in batch mode
- Use for: document enrichment, training data generation, offline analysis
AWS Bedrock Async Inference
- Asynchronous invocation via S3 input/output
- No 24-hour constraint — completes when done
- Suitable for very large documents or multi-step pipelines
- Integrates with Step Functions for orchestration
- Cost savings depend on model; check Bedrock pricing
Batch Job Design Patterns
A well-designed batch job handles partial failures, supports idempotent retries, and processes efficiently at scale.
async def run_batch_enrichment(documents: list[Document]):
# 1. Chunk into batch-size batches (max 10K for Anthropic)
batches = chunk(documents, size=5_000)
for batch in batches:
# 2. Build JSONL requests with idempotent custom_id
requests = [
{"custom_id": doc.id, "params": build_request(doc)}
for doc in batch
if not await cache.is_processed(doc.id) # idempotent check
]
# 3. Submit and poll for completion
batch_id = await anthropic.batches.create(requests)
results = await poll_until_complete(batch_id)
# 4. Handle partial failures — some items may have errored
for result in results:
if result.result.type == "succeeded":
await store_result(result.custom_id, result.result.message)
else:
await queue_for_retry(result.custom_id) # retry individually
Batch job checklist
- Idempotent request IDs — safe to resubmit without duplication
- Partial failure handling — individual errors should not fail the whole batch
- Progress tracking — log completion count; alert if batch stalls
- Cost cap — set maximum batch size to bound total spend per run
- Result expiry — batch results are deleted by providers after 24-29 days; download promptly
Common batch mistakes
- Resubmitting already-processed documents — no idempotency check
- Not downloading results before provider expiry window
- Using batch mode for user-facing requests where latency matters
- Single giant batch file — if it fails, lose progress on all items; use smaller chunks
- No retry queue for individual item failures
Hybrid Architecture
Most enterprise systems need both modes. A hybrid architecture routes to the right processing tier based on the workload's latency requirements.
Realtime tier
User-facing queries, interactive chat, on-demand code assistance. Synchronous API calls with streaming. Cost: full per-token pricing.
Async queue tier
Semi-interactive: user triggers a task but doesn't wait at the screen. Webhook callback when done. SLA: 5-15 minutes. Cost: full pricing but better throughput control.
Batch tier
Nightly enrichment, training data generation, bulk document processing. Provider batch API. SLA: hours. Cost: 50% discount.
Checklist: Do You Understand This?
- What is the cost saving from using OpenAI or Anthropic batch APIs compared to synchronous calls?
- What is the maximum completion window for OpenAI and Anthropic batch APIs?
- Why must batch job requests use idempotent IDs — what problem does this solve?
- What should happen when an individual item in a batch fails — and why should it not fail the whole batch?
- Classify these workloads as realtime, async queue, or batch: (a) answering a user's chat question; (b) enriching 100,000 product descriptions overnight; (c) generating a report that a user requested and will check in 10 minutes.
- What is the result expiry window for batch results at Anthropic — and what happens if you miss it?