Agent Guardrails & Permissions
An agent that can browse the web, write files, execute code, and send emails has real-world consequences. Without guardrails, a single misconstrued instruction or prompt injection can cause data deletion, credential leakage, or runaway API spending. Guardrails are the safety layer that sits between what a user asks and what the agent is actually allowed to do. This page covers the four guardrail layers, how to design permissions, and the controls that prevent agents from causing irreversible harm.
What Guardrails Do
A guardrail is a programmatic or model-based check that intercepts agent behaviour at a defined boundary. Guardrails do not make an agent smarter β they constrain it. The goal is to ensure the agent cannot exceed its intended scope even when given adversarial inputs or when the model makes a reasoning error.
| Layer | What it checks | When it runs |
|---|---|---|
| Input guardrail | User message content, PII, toxicity, jailbreak patterns | Before reaching the LLM |
| Tool permission layer | Which tools the agent may call, with what arguments | Before tool execution |
| Output guardrail | Model response for hallucinations, PII leakage, policy violations | After LLM generates, before delivery |
| Action confirmation layer | Irreversible or high-impact actions before execution | Before destructive tool calls |
Layer 1 β Input Guardrails
The first gate. Input guardrails run on every user message before it reaches the model. They catch problems cheaply β a regex or small classifier is orders of magnitude cheaper than sending a malicious prompt to a powerful model and cleaning up afterward.
Checks to implement
- PII detection: regex + ML classifier for SSNs, credit cards, passwords in prompts
- Jailbreak / prompt injection detection: pattern matching + embedding similarity against known attacks
- Topic scope: off-topic classifier for domain-restricted agents (e.g., customer support bot rejecting code-gen requests)
- Language / content policy: toxicity classifier, hate-speech filters
- Rate limiting: per-user, per-session limits to prevent abuse loops
Implementation approach
- Run cheap checks first (regex) before expensive ones (LLM classifiers)
- Use a fast small model (e.g., Haiku, GPT-4o mini) as the guard, not the main model
- Return canned refusal messages β do not explain which rule was triggered (reduces gaming)
- Log all rejections with reason codes for monitoring
- Libraries:
guardrails-ai, NeMo Guardrails, LlamaGuard
Layer 2 β Tool Permissions
This is the most critical layer for agents with real-world effects. Every tool an agent can call must be classified by its reversibility and blast radius, then protected accordingly.
Classify every tool by risk
| Risk tier | Characteristics | Examples | Default policy |
|---|---|---|---|
| Read-only | No side effects, fully reversible | web search, read file, query DB | Allow freely |
| Write / append | Creates or modifies state, reversible with effort | write file, create ticket, POST API | Allow with argument validation |
| Destructive | Deletes data, hard to reverse | delete record, close account, archive project | Require confirmation |
| Irreversible / high-blast | Cannot be undone, affects many users or systems | send email to 10k users, drop table, charge card | Human-in-the-loop required |
Argument-level validation
Classifying tools by type is not enough β validate the arguments too. An agent calling delete_file(path) should be blocked if path resolves outside the allowed working directory. An agent calling send_email(to, body)should be blocked if to is not on the allowlist.
Argument validation rules:
- Validate path arguments against an allowed directory whitelist (prevent path traversal)
- Validate resource IDs against the current user's owned resources (prevent IDOR)
- Cap numeric arguments (e.g.,
limit,count) to sane maximums - Block shell metacharacters in string arguments (
;,&&,|, backticks) - Reject arguments that reference environment variables, system paths, or credential files
Layer 3 β Output Guardrails
After the model generates a response, output guardrails inspect it before delivery. This catches what input guardrails cannot: hallucinated facts, unintended PII in the response, and policy-violating content generated by the model itself.
Output checks to run
- PII scan: mask SSNs, emails, phone numbers before displaying to user
- Grounding check: flag claims not supported by retrieved context (RAG agents)
- Format validation: confirm structured output (JSON) parses against schema
- Length / repetition check: detect runaway output or stuck loops
- Toxicity / brand safety: block outputs violating content policy
Output guardrail pitfalls
- Over-blocking: overly strict classifiers reject valid responses, frustrating users
- Latency: each check adds round-trip time β chain them in parallel where possible
- False confidence: passing output guardrails does not mean the response is factually correct
- Bypass via encoding: attackers encode harmful content in base64 or Unicode variants to evade text classifiers
Layer 4 β Action Confirmation (Human-in-the-Loop)
For irreversible or high-impact actions, the agent must pause and request human approval before proceeding. This is not optional for production agents β it is a design requirement.
Confirmation checkpoint pattern
- Agent plans an action and identifies it as destructive/irreversible
- Agent emits a structured
confirmation_requiredevent with: action, arguments, estimated impact, reversal cost - Orchestrator routes to human operator or approval UI
- Human approves, modifies, or cancels
- Agent proceeds only on explicit approval β timeout = cancel, not proceed
What triggers confirmation
- Any delete, drop, truncate, purge operation
- Outbound communications (email, Slack, webhooks) with external recipients
- Financial transactions above a configured threshold
- Privilege escalation (requesting new permissions or credentials)
- Actions affecting more than N records (configurable blast-radius limit)
- Any action the agent itself flags as uncertain or low-confidence
Prompt Injection Defence
Prompt injection is the top security risk for agents in 2025. An agent that reads external content (web pages, emails, documents, tool results) can be manipulated by malicious text embedded in that content. This is called cross-prompt injection attack (XPIA) β the environment injects instructions into the agent's context.
Defences that work
- Delimit external content: wrap retrieved/tool content in XML tags or a separator the system prompt explicitly defines as untrusted
- Dual-prompt architecture: system prompt instructs the model to treat content between
<external>tags as data, never as instructions - Injection classifier: run a second model pass on retrieved content before inserting into context, checking for embedded instructions
- Capability isolation: an agent that reads emails should not have write/send capabilities in the same execution
- Audit all tool results: log what entered the context window β not just user messages
Common injection vectors
- Hidden text in web pages (white text on white background)
- Instructions in document metadata or alt-text
- Malicious tool result payloads (attacker-controlled APIs)
- Email bodies containing "Ignore previous instructionsβ¦"
- Markdown or HTML that renders instructions visibly to the agent but not the user
Designing a Permissions Model
Apply least-privilege to agents just as you would to microservices. The agent should only have access to the tools and data it needs for the current task β and nothing more.
Least-privilege agent design checklist:
- Define a tool allowlist per agent role β not a global tool pool
- Scope data access by session: inject only the current user's data, not the full dataset
- Use scoped API keys per agent instance β rotate after task completion
- Separate read agents from write agents in multi-agent systems; handoff requires explicit approval
- Never put admin credentials in the agent's context β use a broker service that validates each request
- Log all tool calls with args, timestamps, and the reasoning step that triggered them
Putting It Together β The Guardrail Stack
Request flow with full guardrail stack:
User message
β [Input guardrail] PII / injection / topic check
β LLM (with system prompt + tools)
β [Tool permission layer] risk classification + arg validation
β [Action confirmation] human approval if destructive
β Tool executes
β LLM generates final response
β [Output guardrail] PII mask / grounding / format check
User receives response
Each layer is independent. A failure at any layer blocks the request and logs the event. The layers do not share state β this ensures a bypass at one layer does not automatically compromise others.
Guardrail Failure Modes
Under-guarding
- No input sanitisation β prompt injection succeeds on first attempt
- Tools exposed without argument validation β path traversal, IDOR attacks succeed
- Irreversible actions allowed without confirmation β accidental deletions, mass emails sent
- No output filtering β PII from one user leaked to another in shared deployments
Over-guarding
- Too many confirmation steps β agent becomes unusable, humans approve everything reflexively
- Overly strict classifiers β blocks legitimate requests, erodes user trust
- Guardrail latency dominates response time β each check adds 200β800ms
- False sense of security β guardrails as a checkbox rather than a real threat model
Checklist: Do You Understand This?
- Can you name the four guardrail layers and explain when each one runs?
- What is a cross-prompt injection attack (XPIA) and how does the dual-prompt architecture defend against it?
- How would you classify a
send_emailtool by risk tier, and what controls would you apply? - What is the least-privilege principle for agents, and what does violating it look like in practice?
- When does an action confirmation checkpoint trigger, and what happens on timeout?
- What is the difference between under-guarding and over-guarding, and which is more dangerous in a production deployment?