Advanced

Rate Limit Handling for AI Systems

Rate limits are not just a nuisance — they are a reliability problem. A burst of traffic that triggers 429 responses can cascade into queue buildup, user-facing errors, and degraded service. Handling rate limits correctly requires a combination of backoff logic, request shaping, internal budget controls, and multi-provider routing for failover.

Provider Rate Limits in 2025

Provider	Limit dimensions	Entry tier limits (approximate)	Tier upgrade path
Anthropic (Claude)	RPM (requests/min), TPM (tokens/min), TPD (tokens/day)	Tier 1: 50 RPM / 40K TPM; Tier 4 (Build): 4,000 RPM / 400K TPM	Usage-based automatic promotion after spending thresholds; request higher tier via console
OpenAI	RPM, RPD (requests/day), TPM, TPD, image limits	Tier 1: 500 RPM / 200K TPM; Tier 5: 10,000 RPM / 2M TPM	Automatic tier promotion based on cumulative spend; Tier 5 at $250K+ spend
Google (Gemini API)	RPM, TPM, RPD	Free tier: 15 RPM / 1M TPM; Pay-as-you-go: 2,000 RPM / 4M TPM	Contact Google Cloud for enterprise limits; Vertex AI has separate higher limits
AWS Bedrock	Requests per minute per model; provisioned throughput available	On-demand limits vary by model and region; provisioned throughput bypasses soft limits	Request limit increases via AWS Support; provisioned throughput for guaranteed capacity

Exact limits change frequently and are tier-dependent. Always check the provider's documentation for current limits at your tier. Plan for the limits of your current tier, not the tier you expect to reach.

Exponential Backoff with Jitter

Naive retry on 429 makes the problem worse — all retrying clients hit the API at the same time, creating a thundering herd. Exponential backoff with jitter spreads retry load across time.

import random, time

from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type

from anthropic import RateLimitError

@retry(

retry=retry_if_exception_type(RateLimitError),

wait=wait_exponential_jitter(initial=1, max=60), # 1s → 2s → 4s... + random jitter

stop=stop_after_attempt(6),

)

async def call_llm(prompt: str) -> str:

return await client.messages.create(...)

# The jitter is critical — without it, multiple clients retrying in sync

# still hit the API simultaneously after the same backoff interval.

# wait_exponential_jitter adds random(0, initial * 2^attempt) to each wait.

Correct backoff behaviour

Exponential base wait: 1s, 2s, 4s, 8s, 16s, 32s (cap at 60s)
Add full jitter: final wait = random(0, base_wait)
Respect Retry-After header if provider sends it
Maximum attempt cap: 5-6 retries before returning error to caller
Distinguish 429 (retryable) from 400 (not retryable — bad request)

Common mistakes

Retrying immediately on 429 — no wait at all
Fixed sleep (time.sleep(1)) — no backoff, thundering herd persists
No retry cap — infinite retries block threads/goroutines indefinitely
Retrying 400 and 401 errors — these won't succeed regardless of retries
Retrying inside the LLM call without propagating errors to the queue layer

Request Queuing and Load Spreading

Strategy	How it works	When to use
Token bucket (client-side)	Track tokens consumed per minute locally; delay requests when approaching limit	Single service making LLM calls; prevents 429s proactively
Request queue (async)	Enqueue LLM requests; worker processes them at rate below provider limit	Batch workloads; non-interactive requests where latency SLA > 30 seconds is acceptable
Time-based spreading	Schedule burst workloads across off-peak hours (overnight processing)	Nightly document processing; training data generation; bulk enrichment
Multi-provider routing	Route to alternative provider when primary is at limit	User-facing workloads where latency SLA cannot be relaxed

Internal Rate Limiting — Protect Your Own Budget

Provider rate limits protect the provider. You also need internal limits to protect your budget from runaway agents, abusive users, and misconfigured jobs.

Per-user daily token limit — prevents one user consuming the entire budget
Per-use-case budget envelope — ring-fences cost for each feature; one runaway job cannot starve others
Per-agent-run step limit — agents with tool-calling loops must have a maximum step count; also set a per-run token budget
Per-team monthly limit — allocates cost and creates accountability for efficient use
Hard limit vs soft limit: hard = reject the request when limit exceeded; soft = alert and continue (useful for monitoring before enforcing)

LiteLLM budget management (open source)

# litellm_config.yaml — per-user and per-model budget enforcement

general_settings:

master_key: sk-...

router_settings:

model_list:

- model_name: gpt-4o

litellm_params:

model: openai/gpt-4o

tpm: 100000 # internal TPM limit, not provider limit

rpm: 500

litellm_settings:

max_budget: 100 # USD per day across all models

budget_duration: 1d

Monitoring Rate Limit Health

Key metrics to track

429 rate as a percentage of total requests (alert at > 2%)
Queue depth — rising queue = approaching rate ceiling
P99 latency spike from retry overhead
Retry count distribution per request
Daily token consumption vs tier limit (alert at 80%)

Alert thresholds

429 rate > 5% → likely at tier limit; request upgrade or add provider
Queue depth > 100 sustained → backpressure problem; spread load
P99 latency > 3× P50 → retry overhead affecting tail latency
Single use case > 80% of daily budget → investigate or cap
Agent run exceeding 50 steps → likely in a loop; kill and alert

Checklist: Do You Understand This?

Why does retrying immediately on a 429 error make the problem worse rather than better?
What is jitter in the context of exponential backoff — and why is it essential in multi-client systems?
Name three error codes that should NOT be retried, and explain why.
What is the difference between a hard limit and a soft limit for internal rate controls?
Why do AI agents with tool-calling loops need a step count limit in addition to a token budget limit?
At what 429 rate percentage should you alert and begin planning a tier upgrade or provider addition?