Policy Enforcement
Policy enforcement for AI systems is not a single control — it is a stack of overlapping layers, each catching what the others miss. No single layer is sufficient: system prompt instructions can be jailbroken, classifiers have false negative rates, and human review does not scale to every conversation. The goal is a defence-in-depth stack where the cost of bypassing multiple layers simultaneously exceeds the attacker's effort.
The Policy Enforcement Stack
| Layer | What it enforces | Latency cost | Bypass risk |
|---|---|---|---|
| System prompt | Scope definition, persona, prohibited topics, output format rules | None (part of prompt) | High — jailbreak-susceptible |
| Input classifier | Blocks known-bad input patterns before reaching model | 20-100ms | Medium — novel attacks bypass |
| Model generation | Provider safety fine-tuning (Constitutional AI, RLHF) | None (built-in) | Medium — jailbreak-susceptible |
| Output classifier | Blocks policy-violating responses before delivery | 50-200ms | Low — independent of model |
| Human review (sampling) | Catches classifier false negatives; finds policy drift | N/A (async) | Very low — human judgement |
Input Policy: What to Block Before the Model
Common input policy categories
- Explicitly prohibited content types (per your use case and jurisdiction)
- PII in contexts where it should not be submitted (e.g., code review tool should not receive SSNs)
- Known jailbreak patterns (role-play overrides, many-shot setups)
- Competitor mentions in contexts where response would be harmful
- Regulatory prohibitions (specific financial or medical advice in non-approved contexts)
Input policy pitfalls
- Over-blocking: keyword blocking is too broad — use classifiers, not keyword lists
- Silent blocking: user gets no response and cannot understand why
- Logging blocked inputs: essential for tuning; blocked inputs reveal attack patterns
Output Policy: What to Filter After Generation
Common output policy checks
- PII in output: block or redact before delivery
- Tone violations: outputs that are harmful, discriminatory, or off-brand
- Factual claims in regulated domains (financial advice, medical diagnosis, legal opinion)
- Competitor mentions: flag for review; block in some contexts
- Prohibited content reaching the user despite model refusal failure
Output policy design principles
- Fail safe: when classifier is uncertain, block and ask user to rephrase — do not pass
- Explain the block: tell the user why their request could not be completed (without revealing policy details that help circumvention)
- Separate classifiers for different policy categories — one model per concern performs better than one model for all
Guardrails Tooling Comparison
| Tool | What it blocks | Latency | Deployment |
|---|---|---|---|
| AWS Bedrock Guardrails | Topic filters, PII, harmful content, word filters | 50-150ms | Cloud (AWS); integrated with Bedrock models |
| Azure AI Content Safety | Hate, violence, sexual, self-harm, jailbreak detection | 50-200ms | Cloud (Azure); REST API; any model |
| LlamaGuard | Input/output safety classification; customisable categories | 100-500ms (self-hosted) | Self-hosted; open-source Meta model |
| NeMo Guardrails | Programmable dialogue rails; topic blocking; hallucination checks | 200-800ms | Self-hosted; NVIDIA open-source |
| Lakera Guard | Prompt injection, jailbreak, PII; specialised LLM security | 30-100ms | Cloud API; low latency; security-focused |
Sampling and Human Review
No automated policy is complete. Sample a percentage of conversations for human review to catch false negatives and discover new policy gaps.
- Sample rate: 1-5% of conversations for general review; 100% for high-risk use cases
- Bias sampling: oversample conversations where classifiers were uncertain (near threshold)
- Annotation: reviewers flag policy violations, label the failure category, and escalate confirmed violations
- Feedback loop: confirmed violations update the classifier training data and the red team test suite
Preventing Policy Drift
Policy drift is the slow erosion of enforcement over time
Use cases expand beyond original scope. System prompts accumulate exceptions. Classifiers are not updated as new violation patterns emerge. Teams stop reviewing samples. The result is a policy that exists on paper but is not enforced in practice. Prevent it with: quarterly policy reviews / classifier eval runs after model updates / an annual red team that specifically tests whether each policy layer is functioning.
Checklist: Do You Understand This?
- What are the five layers of a policy enforcement stack — and what does each one catch that the others miss?
- Why should you use classifiers for input policy rather than keyword blocklists?
- What is "fail safe" design for an output classifier — and why does it matter?
- Choose two guardrails tools and explain which use case each is best suited for.
- What sampling strategy catches classifier false negatives most efficiently?
- Define policy drift and describe two concrete mechanisms that cause it in enterprise AI deployments.