Rate Limit Handling for AI Systems
Rate limits are not just a nuisance — they are a reliability problem. A burst of traffic that triggers 429 responses can cascade into queue buildup, user-facing errors, and degraded service. Handling rate limits correctly requires a combination of backoff logic, request shaping, internal budget controls, and multi-provider routing for failover.
Provider Rate Limits in 2025
| Provider | Limit dimensions | Entry tier limits (approximate) | Tier upgrade path |
|---|---|---|---|
| Anthropic (Claude) | RPM (requests/min), TPM (tokens/min), TPD (tokens/day) | Tier 1: 50 RPM / 40K TPM; Tier 4 (Build): 4,000 RPM / 400K TPM | Usage-based automatic promotion after spending thresholds; request higher tier via console |
| OpenAI | RPM, RPD (requests/day), TPM, TPD, image limits | Tier 1: 500 RPM / 200K TPM; Tier 5: 10,000 RPM / 2M TPM | Automatic tier promotion based on cumulative spend; Tier 5 at $250K+ spend |
| Google (Gemini API) | RPM, TPM, RPD | Free tier: 15 RPM / 1M TPM; Pay-as-you-go: 2,000 RPM / 4M TPM | Contact Google Cloud for enterprise limits; Vertex AI has separate higher limits |
| AWS Bedrock | Requests per minute per model; provisioned throughput available | On-demand limits vary by model and region; provisioned throughput bypasses soft limits | Request limit increases via AWS Support; provisioned throughput for guaranteed capacity |
Exact limits change frequently and are tier-dependent. Always check the provider's documentation for current limits at your tier. Plan for the limits of your current tier, not the tier you expect to reach.
Exponential Backoff with Jitter
Naive retry on 429 makes the problem worse — all retrying clients hit the API at the same time, creating a thundering herd. Exponential backoff with jitter spreads retry load across time.
import random, time
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type
from anthropic import RateLimitError
@retry(
retry=retry_if_exception_type(RateLimitError),
wait=wait_exponential_jitter(initial=1, max=60), # 1s → 2s → 4s... + random jitter
stop=stop_after_attempt(6),
)
async def call_llm(prompt: str) -> str:
return await client.messages.create(...)
# The jitter is critical — without it, multiple clients retrying in sync
# still hit the API simultaneously after the same backoff interval.
# wait_exponential_jitter adds random(0, initial * 2^attempt) to each wait.
Correct backoff behaviour
- Exponential base wait: 1s, 2s, 4s, 8s, 16s, 32s (cap at 60s)
- Add full jitter: final wait = random(0, base_wait)
- Respect Retry-After header if provider sends it
- Maximum attempt cap: 5-6 retries before returning error to caller
- Distinguish 429 (retryable) from 400 (not retryable — bad request)
Common mistakes
- Retrying immediately on 429 — no wait at all
- Fixed sleep (time.sleep(1)) — no backoff, thundering herd persists
- No retry cap — infinite retries block threads/goroutines indefinitely
- Retrying 400 and 401 errors — these won't succeed regardless of retries
- Retrying inside the LLM call without propagating errors to the queue layer
Request Queuing and Load Spreading
| Strategy | How it works | When to use |
|---|---|---|
| Token bucket (client-side) | Track tokens consumed per minute locally; delay requests when approaching limit | Single service making LLM calls; prevents 429s proactively |
| Request queue (async) | Enqueue LLM requests; worker processes them at rate below provider limit | Batch workloads; non-interactive requests where latency SLA > 30 seconds is acceptable |
| Time-based spreading | Schedule burst workloads across off-peak hours (overnight processing) | Nightly document processing; training data generation; bulk enrichment |
| Multi-provider routing | Route to alternative provider when primary is at limit | User-facing workloads where latency SLA cannot be relaxed |
Internal Rate Limiting — Protect Your Own Budget
Provider rate limits protect the provider. You also need internal limits to protect your budget from runaway agents, abusive users, and misconfigured jobs.
- Per-user daily token limit — prevents one user consuming the entire budget
- Per-use-case budget envelope — ring-fences cost for each feature; one runaway job cannot starve others
- Per-agent-run step limit — agents with tool-calling loops must have a maximum step count; also set a per-run token budget
- Per-team monthly limit — allocates cost and creates accountability for efficient use
- Hard limit vs soft limit: hard = reject the request when limit exceeded; soft = alert and continue (useful for monitoring before enforcing)
LiteLLM budget management (open source)
# litellm_config.yaml — per-user and per-model budget enforcement
general_settings:
master_key: sk-...
router_settings:
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
tpm: 100000 # internal TPM limit, not provider limit
rpm: 500
litellm_settings:
max_budget: 100 # USD per day across all models
budget_duration: 1d
Monitoring Rate Limit Health
Key metrics to track
- 429 rate as a percentage of total requests (alert at > 2%)
- Queue depth — rising queue = approaching rate ceiling
- P99 latency spike from retry overhead
- Retry count distribution per request
- Daily token consumption vs tier limit (alert at 80%)
Alert thresholds
- 429 rate > 5% → likely at tier limit; request upgrade or add provider
- Queue depth > 100 sustained → backpressure problem; spread load
- P99 latency > 3× P50 → retry overhead affecting tail latency
- Single use case > 80% of daily budget → investigate or cap
- Agent run exceeding 50 steps → likely in a loop; kill and alert
Checklist: Do You Understand This?
- Why does retrying immediately on a 429 error make the problem worse rather than better?
- What is jitter in the context of exponential backoff — and why is it essential in multi-client systems?
- Name three error codes that should NOT be retried, and explain why.
- What is the difference between a hard limit and a soft limit for internal rate controls?
- Why do AI agents with tool-calling loops need a step count limit in addition to a token budget limit?
- At what 429 rate percentage should you alert and begin planning a tier upgrade or provider addition?