Cost Guardrails for AI Systems
Cost guardrails prevent unplanned spend from reaching the provider. Provider rate limits protect the provider's infrastructure — they do not protect your budget. Internal guardrails enforce your spending policy before a request leaves your system. Without them, a single misconfigured agent, an abusive user, or a runaway batch job can exhaust a monthly budget in hours.
Hard Limits vs Soft Limits
Hard limits
- Request is rejected when the limit is exceeded
- User receives an explicit error: "Daily usage limit reached"
- Prevents any further spend beyond the defined ceiling
- Use for: per-user daily limits, per-agent-run budgets, emergency kill switch
- Risk: degrades user experience if limits are set too low
Soft limits
- Alert fires when limit is approached; request still proceeds
- Operations team is notified to investigate before hard limit is reached
- Use for: team monthly budget warnings (alert at 80%, hard stop at 100%)
- Useful as a monitoring tool before enforcing hard limits in a new system
- Risk: spend continues past the soft limit if alerts are not acted on
Guardrail Levels
| Level | Scope | Typical limit | Purpose |
|---|---|---|---|
| Per-request | Single LLM call | max_tokens in API call (e.g., 2,000 output tokens) | Prevents runaway output; sets latency ceiling; must be set on every call |
| Per-agent-run | Single agent execution | Maximum steps (e.g., 25) + maximum total tokens (e.g., 100K) | Prevents tool-calling loops from consuming unlimited tokens over many steps |
| Per-user daily | Individual user, 24-hour window | Token budget (e.g., 500K tokens/day) or dollar amount (e.g., $5/day) | Prevents single user from consuming disproportionate share; enables fair-use enforcement |
| Per-use-case | Named feature or workflow | Monthly dollar envelope (e.g., "document processing: $2,000/month") | Ring-fences cost per feature; runaway in one use case cannot starve others |
| Per-team monthly | Organisational team | Monthly dollar budget allocated by finance | Enables chargeback; creates team accountability for AI spend |
Implementation Patterns
Middleware token counter (before LLM call)
import tiktoken # or anthropic tokenizer
async def guarded_llm_call(request: LLMRequest, user_id: str) -> str:
# Count tokens before sending to provider
estimated_input_tokens = count_tokens(request.messages)
estimated_cost = estimate_cost(estimated_input_tokens, model=request.model)
# Check user daily budget
user_today_spend = await budget_store.get_daily_spend(user_id)
if user_today_spend + estimated_cost > USER_DAILY_LIMIT:
raise DailyLimitExceededError(f"Daily limit of ${USER_DAILY_LIMIT} reached")
# Send to LLM and record actual cost
response = await llm_client.call(request)
actual_cost = calculate_actual_cost(response.usage)
await budget_store.record_spend(user_id, actual_cost)
return response
LiteLLM budget management
- Built-in per-user and per-team budget enforcement
- Hard limits enforced at the proxy layer before requests reach providers
- Budget dashboard with spend by model, user, and team
- Virtual keys with individual budget ceilings per API key
AWS Bedrock spend alerts
- AWS Budgets: alert when AI service cost exceeds threshold
- Cost Anomaly Detection: ML-based flagging of unexpected spend spikes
- Resource tagging: tag Bedrock API calls with use case and team for attribution
- Account-level SCP: Service Control Policies can hard-block certain model calls
Runaway Agent Protection
Agents with tool-calling loops require both a step limit AND a token budget
A token budget alone does not stop a loop — the agent can take many low-token steps and still accumulate significant cost and take destructive actions. A step limit alone does not cap cost — the agent can generate long outputs on each step. Both limits are required. Set the step limit conservatively for the use case (a simple research agent rarely needs more than 15 steps; alert at 10).
class BudgetedAgent:
MAX_STEPS = 20
MAX_TOKENS = 150_000 # total for the run
def run(self, task: str) -> AgentResult:
steps = 0
tokens_used = 0
while not done:
if steps >= self.MAX_STEPS:
return AgentResult(status="step_limit_exceeded", steps=steps)
if tokens_used >= self.MAX_TOKENS:
return AgentResult(status="token_budget_exceeded", tokens=tokens_used)
response = self.llm_call()
tokens_used += response.usage.total_tokens
steps += 1
Anomaly Detection
- Alert when a use case's daily spend doubles compared to its rolling 7-day average — early signal of misuse or runaway process
- Alert when a single user consumes > 10% of total daily spend — unusual usage pattern
- Alert when agent run token count exceeds 2× the P95 for that agent type — possible loop
- Alert when total organisational AI spend grows > 50% month-over-month without a corresponding increase in usage volume — signals efficiency degradation
Checklist: Do You Understand This?
- What is the difference between a hard limit and a soft limit — and when should you use each?
- Why must max_tokens be set on every LLM call, and what happens if you omit it?
- Why does an agent require both a step limit and a token budget — why is one insufficient?
- What is a per-use-case budget envelope and why does it prevent a runaway process from starving other features?
- Design a guardrail architecture for a B2B SaaS product where AI cost is a variable cost per customer.
- At what anomaly threshold should you alert for daily spend growth, and what should the investigation workflow be?