Advanced

Cost Guardrails for AI Systems

Cost guardrails prevent unplanned spend from reaching the provider. Provider rate limits protect the provider's infrastructure — they do not protect your budget. Internal guardrails enforce your spending policy before a request leaves your system. Without them, a single misconfigured agent, an abusive user, or a runaway batch job can exhaust a monthly budget in hours.

Hard Limits vs Soft Limits

Hard limits

Request is rejected when the limit is exceeded
User receives an explicit error: "Daily usage limit reached"
Prevents any further spend beyond the defined ceiling
Use for: per-user daily limits, per-agent-run budgets, emergency kill switch
Risk: degrades user experience if limits are set too low

Soft limits

Alert fires when limit is approached; request still proceeds
Operations team is notified to investigate before hard limit is reached
Use for: team monthly budget warnings (alert at 80%, hard stop at 100%)
Useful as a monitoring tool before enforcing hard limits in a new system
Risk: spend continues past the soft limit if alerts are not acted on

Guardrail Levels

Level	Scope	Typical limit	Purpose
Per-request	Single LLM call	max_tokens in API call (e.g., 2,000 output tokens)	Prevents runaway output; sets latency ceiling; must be set on every call
Per-agent-run	Single agent execution	Maximum steps (e.g., 25) + maximum total tokens (e.g., 100K)	Prevents tool-calling loops from consuming unlimited tokens over many steps
Per-user daily	Individual user, 24-hour window	Token budget (e.g., 500K tokens/day) or dollar amount (e.g., $5/day)	Prevents single user from consuming disproportionate share; enables fair-use enforcement
Per-use-case	Named feature or workflow	Monthly dollar envelope (e.g., "document processing: $2,000/month")	Ring-fences cost per feature; runaway in one use case cannot starve others
Per-team monthly	Organisational team	Monthly dollar budget allocated by finance	Enables chargeback; creates team accountability for AI spend

Implementation Patterns

Middleware token counter (before LLM call)

import tiktoken # or anthropic tokenizer

async def guarded_llm_call(request: LLMRequest, user_id: str) -> str:

# Count tokens before sending to provider

estimated_input_tokens = count_tokens(request.messages)

estimated_cost = estimate_cost(estimated_input_tokens, model=request.model)

# Check user daily budget

user_today_spend = await budget_store.get_daily_spend(user_id)

if user_today_spend + estimated_cost > USER_DAILY_LIMIT:

raise DailyLimitExceededError(f"Daily limit of ${USER_DAILY_LIMIT} reached")

# Send to LLM and record actual cost

response = await llm_client.call(request)

actual_cost = calculate_actual_cost(response.usage)

await budget_store.record_spend(user_id, actual_cost)

return response

LiteLLM budget management

Built-in per-user and per-team budget enforcement
Hard limits enforced at the proxy layer before requests reach providers
Budget dashboard with spend by model, user, and team
Virtual keys with individual budget ceilings per API key

AWS Bedrock spend alerts

AWS Budgets: alert when AI service cost exceeds threshold
Cost Anomaly Detection: ML-based flagging of unexpected spend spikes
Resource tagging: tag Bedrock API calls with use case and team for attribution
Account-level SCP: Service Control Policies can hard-block certain model calls

Runaway Agent Protection

Agents with tool-calling loops require both a step limit AND a token budget

A token budget alone does not stop a loop — the agent can take many low-token steps and still accumulate significant cost and take destructive actions. A step limit alone does not cap cost — the agent can generate long outputs on each step. Both limits are required. Set the step limit conservatively for the use case (a simple research agent rarely needs more than 15 steps; alert at 10).

class BudgetedAgent:

MAX_STEPS = 20

MAX_TOKENS = 150_000 # total for the run

def run(self, task: str) -> AgentResult:

steps = 0

tokens_used = 0

while not done:

if steps >= self.MAX_STEPS:

return AgentResult(status="step_limit_exceeded", steps=steps)

if tokens_used >= self.MAX_TOKENS:

return AgentResult(status="token_budget_exceeded", tokens=tokens_used)

response = self.llm_call()

tokens_used += response.usage.total_tokens

steps += 1

Anomaly Detection

Alert when a use case's daily spend doubles compared to its rolling 7-day average — early signal of misuse or runaway process
Alert when a single user consumes > 10% of total daily spend — unusual usage pattern
Alert when agent run token count exceeds 2× the P95 for that agent type — possible loop
Alert when total organisational AI spend grows > 50% month-over-month without a corresponding increase in usage volume — signals efficiency degradation

Checklist: Do You Understand This?

What is the difference between a hard limit and a soft limit — and when should you use each?
Why must max_tokens be set on every LLM call, and what happens if you omit it?
Why does an agent require both a step limit and a token budget — why is one insufficient?
What is a per-use-case budget envelope and why does it prevent a runaway process from starving other features?
Design a guardrail architecture for a B2B SaaS product where AI cost is a variable cost per customer.
At what anomaly threshold should you alert for daily spend growth, and what should the investigation workflow be?