🧠 All Things AI
Advanced

Rate Limit Handling for AI Systems

Rate limits are not just a nuisance — they are a reliability problem. A burst of traffic that triggers 429 responses can cascade into queue buildup, user-facing errors, and degraded service. Handling rate limits correctly requires a combination of backoff logic, request shaping, internal budget controls, and multi-provider routing for failover.

Provider Rate Limits in 2025

ProviderLimit dimensionsEntry tier limits (approximate)Tier upgrade path
Anthropic (Claude)RPM (requests/min), TPM (tokens/min), TPD (tokens/day)Tier 1: 50 RPM / 40K TPM; Tier 4 (Build): 4,000 RPM / 400K TPMUsage-based automatic promotion after spending thresholds; request higher tier via console
OpenAIRPM, RPD (requests/day), TPM, TPD, image limitsTier 1: 500 RPM / 200K TPM; Tier 5: 10,000 RPM / 2M TPMAutomatic tier promotion based on cumulative spend; Tier 5 at $250K+ spend
Google (Gemini API)RPM, TPM, RPDFree tier: 15 RPM / 1M TPM; Pay-as-you-go: 2,000 RPM / 4M TPMContact Google Cloud for enterprise limits; Vertex AI has separate higher limits
AWS BedrockRequests per minute per model; provisioned throughput availableOn-demand limits vary by model and region; provisioned throughput bypasses soft limitsRequest limit increases via AWS Support; provisioned throughput for guaranteed capacity

Exact limits change frequently and are tier-dependent. Always check the provider's documentation for current limits at your tier. Plan for the limits of your current tier, not the tier you expect to reach.

Exponential Backoff with Jitter

Naive retry on 429 makes the problem worse — all retrying clients hit the API at the same time, creating a thundering herd. Exponential backoff with jitter spreads retry load across time.

import random, time

from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type

from anthropic import RateLimitError

@retry(

retry=retry_if_exception_type(RateLimitError),

wait=wait_exponential_jitter(initial=1, max=60), # 1s → 2s → 4s... + random jitter

stop=stop_after_attempt(6),

)

async def call_llm(prompt: str) -> str:

return await client.messages.create(...)

# The jitter is critical — without it, multiple clients retrying in sync

# still hit the API simultaneously after the same backoff interval.

# wait_exponential_jitter adds random(0, initial * 2^attempt) to each wait.

Correct backoff behaviour

  • Exponential base wait: 1s, 2s, 4s, 8s, 16s, 32s (cap at 60s)
  • Add full jitter: final wait = random(0, base_wait)
  • Respect Retry-After header if provider sends it
  • Maximum attempt cap: 5-6 retries before returning error to caller
  • Distinguish 429 (retryable) from 400 (not retryable — bad request)

Common mistakes

  • Retrying immediately on 429 — no wait at all
  • Fixed sleep (time.sleep(1)) — no backoff, thundering herd persists
  • No retry cap — infinite retries block threads/goroutines indefinitely
  • Retrying 400 and 401 errors — these won't succeed regardless of retries
  • Retrying inside the LLM call without propagating errors to the queue layer

Request Queuing and Load Spreading

StrategyHow it worksWhen to use
Token bucket (client-side)Track tokens consumed per minute locally; delay requests when approaching limitSingle service making LLM calls; prevents 429s proactively
Request queue (async)Enqueue LLM requests; worker processes them at rate below provider limitBatch workloads; non-interactive requests where latency SLA > 30 seconds is acceptable
Time-based spreadingSchedule burst workloads across off-peak hours (overnight processing)Nightly document processing; training data generation; bulk enrichment
Multi-provider routingRoute to alternative provider when primary is at limitUser-facing workloads where latency SLA cannot be relaxed

Internal Rate Limiting — Protect Your Own Budget

Provider rate limits protect the provider. You also need internal limits to protect your budget from runaway agents, abusive users, and misconfigured jobs.

  • Per-user daily token limit — prevents one user consuming the entire budget
  • Per-use-case budget envelope — ring-fences cost for each feature; one runaway job cannot starve others
  • Per-agent-run step limit — agents with tool-calling loops must have a maximum step count; also set a per-run token budget
  • Per-team monthly limit — allocates cost and creates accountability for efficient use
  • Hard limit vs soft limit: hard = reject the request when limit exceeded; soft = alert and continue (useful for monitoring before enforcing)

LiteLLM budget management (open source)

# litellm_config.yaml — per-user and per-model budget enforcement

general_settings:

master_key: sk-...

router_settings:

model_list:

- model_name: gpt-4o

litellm_params:

model: openai/gpt-4o

tpm: 100000 # internal TPM limit, not provider limit

rpm: 500

litellm_settings:

max_budget: 100 # USD per day across all models

budget_duration: 1d

Monitoring Rate Limit Health

Key metrics to track

  • 429 rate as a percentage of total requests (alert at > 2%)
  • Queue depth — rising queue = approaching rate ceiling
  • P99 latency spike from retry overhead
  • Retry count distribution per request
  • Daily token consumption vs tier limit (alert at 80%)

Alert thresholds

  • 429 rate > 5% → likely at tier limit; request upgrade or add provider
  • Queue depth > 100 sustained → backpressure problem; spread load
  • P99 latency > 3× P50 → retry overhead affecting tail latency
  • Single use case > 80% of daily budget → investigate or cap
  • Agent run exceeding 50 steps → likely in a loop; kill and alert

Checklist: Do You Understand This?

  • Why does retrying immediately on a 429 error make the problem worse rather than better?
  • What is jitter in the context of exponential backoff — and why is it essential in multi-client systems?
  • Name three error codes that should NOT be retried, and explain why.
  • What is the difference between a hard limit and a soft limit for internal rate controls?
  • Why do AI agents with tool-calling loops need a step count limit in addition to a token budget limit?
  • At what 429 rate percentage should you alert and begin planning a tier upgrade or provider addition?