🧠 All Things AI
Advanced

Batch vs Realtime AI Processing

Not all AI work needs to happen while a user is waiting. Batch processing — submitting large volumes of requests to be processed asynchronously — reduces cost by 50% or more and removes throughput constraints. The trade-off is latency: batch results may take minutes to hours. Choosing the right mode for each workload is a significant cost optimisation opportunity.

Decision Framework

FactorRealtimeBatch
Latency toleranceUser is waiting; must respond within secondsResults needed within minutes to hours; no user waiting
Interaction typeInteractive: user sends message, expects immediate responseNon-interactive: system submits work and polls for results
Volume patternUnpredictable bursts; must handle peak loadHigh, predictable volume; can be smoothed over time
Cost priorityCost is secondary to responsivenessCost efficiency is primary; latency is acceptable trade-off
Streaming requiredYes — token-by-token streaming improves perceived latencyNo — full response returned when batch job completes

Batch API Capabilities (2025)

OpenAI Batch API

  • 50% cost reduction vs synchronous API
  • 24-hour completion window
  • Up to 50,000 requests per batch file
  • Supports all chat completion and embedding models
  • JSONL input format; results returned as JSONL

Anthropic Message Batches

  • 50% cost reduction; results within 24 hours
  • Up to 10,000 requests per batch
  • Supports all Claude models
  • Streaming not available in batch mode
  • Use for: document enrichment, training data generation, offline analysis

AWS Bedrock Async Inference

  • Asynchronous invocation via S3 input/output
  • No 24-hour constraint — completes when done
  • Suitable for very large documents or multi-step pipelines
  • Integrates with Step Functions for orchestration
  • Cost savings depend on model; check Bedrock pricing

Batch Job Design Patterns

A well-designed batch job handles partial failures, supports idempotent retries, and processes efficiently at scale.

async def run_batch_enrichment(documents: list[Document]):

# 1. Chunk into batch-size batches (max 10K for Anthropic)

batches = chunk(documents, size=5_000)

for batch in batches:

# 2. Build JSONL requests with idempotent custom_id

requests = [

{"custom_id": doc.id, "params": build_request(doc)}

for doc in batch

if not await cache.is_processed(doc.id) # idempotent check

]

# 3. Submit and poll for completion

batch_id = await anthropic.batches.create(requests)

results = await poll_until_complete(batch_id)

# 4. Handle partial failures — some items may have errored

for result in results:

if result.result.type == "succeeded":

await store_result(result.custom_id, result.result.message)

else:

await queue_for_retry(result.custom_id) # retry individually

Batch job checklist

  • Idempotent request IDs — safe to resubmit without duplication
  • Partial failure handling — individual errors should not fail the whole batch
  • Progress tracking — log completion count; alert if batch stalls
  • Cost cap — set maximum batch size to bound total spend per run
  • Result expiry — batch results are deleted by providers after 24-29 days; download promptly

Common batch mistakes

  • Resubmitting already-processed documents — no idempotency check
  • Not downloading results before provider expiry window
  • Using batch mode for user-facing requests where latency matters
  • Single giant batch file — if it fails, lose progress on all items; use smaller chunks
  • No retry queue for individual item failures

Hybrid Architecture

Most enterprise systems need both modes. A hybrid architecture routes to the right processing tier based on the workload's latency requirements.

Realtime tier

User-facing queries, interactive chat, on-demand code assistance. Synchronous API calls with streaming. Cost: full per-token pricing.

Async queue tier

Semi-interactive: user triggers a task but doesn't wait at the screen. Webhook callback when done. SLA: 5-15 minutes. Cost: full pricing but better throughput control.

Batch tier

Nightly enrichment, training data generation, bulk document processing. Provider batch API. SLA: hours. Cost: 50% discount.

Checklist: Do You Understand This?

  • What is the cost saving from using OpenAI or Anthropic batch APIs compared to synchronous calls?
  • What is the maximum completion window for OpenAI and Anthropic batch APIs?
  • Why must batch job requests use idempotent IDs — what problem does this solve?
  • What should happen when an individual item in a batch fails — and why should it not fail the whole batch?
  • Classify these workloads as realtime, async queue, or batch: (a) answering a user's chat question; (b) enriching 100,000 product descriptions overnight; (c) generating a report that a user requested and will check in 10 minutes.
  • What is the result expiry window for batch results at Anthropic — and what happens if you miss it?