Advanced

Policy Enforcement

Policy enforcement for AI systems is not a single control — it is a stack of overlapping layers, each catching what the others miss. No single layer is sufficient: system prompt instructions can be jailbroken, classifiers have false negative rates, and human review does not scale to every conversation. The goal is a defence-in-depth stack where the cost of bypassing multiple layers simultaneously exceeds the attacker's effort.

The Policy Enforcement Stack

Layer	What it enforces	Latency cost	Bypass risk
System prompt	Scope definition, persona, prohibited topics, output format rules	None (part of prompt)	High — jailbreak-susceptible
Input classifier	Blocks known-bad input patterns before reaching model	20-100ms	Medium — novel attacks bypass
Model generation	Provider safety fine-tuning (Constitutional AI, RLHF)	None (built-in)	Medium — jailbreak-susceptible
Output classifier	Blocks policy-violating responses before delivery	50-200ms	Low — independent of model
Human review (sampling)	Catches classifier false negatives; finds policy drift	N/A (async)	Very low — human judgement

Input Policy: What to Block Before the Model

Common input policy categories

Explicitly prohibited content types (per your use case and jurisdiction)
PII in contexts where it should not be submitted (e.g., code review tool should not receive SSNs)
Known jailbreak patterns (role-play overrides, many-shot setups)
Competitor mentions in contexts where response would be harmful
Regulatory prohibitions (specific financial or medical advice in non-approved contexts)

Input policy pitfalls

Over-blocking: keyword blocking is too broad — use classifiers, not keyword lists
Silent blocking: user gets no response and cannot understand why
Logging blocked inputs: essential for tuning; blocked inputs reveal attack patterns

Output Policy: What to Filter After Generation

Common output policy checks

PII in output: block or redact before delivery
Tone violations: outputs that are harmful, discriminatory, or off-brand
Factual claims in regulated domains (financial advice, medical diagnosis, legal opinion)
Competitor mentions: flag for review; block in some contexts
Prohibited content reaching the user despite model refusal failure

Output policy design principles

Fail safe: when classifier is uncertain, block and ask user to rephrase — do not pass
Explain the block: tell the user why their request could not be completed (without revealing policy details that help circumvention)
Separate classifiers for different policy categories — one model per concern performs better than one model for all

Guardrails Tooling Comparison

Tool	What it blocks	Latency	Deployment
AWS Bedrock Guardrails	Topic filters, PII, harmful content, word filters	50-150ms	Cloud (AWS); integrated with Bedrock models
Azure AI Content Safety	Hate, violence, sexual, self-harm, jailbreak detection	50-200ms	Cloud (Azure); REST API; any model
LlamaGuard	Input/output safety classification; customisable categories	100-500ms (self-hosted)	Self-hosted; open-source Meta model
NeMo Guardrails	Programmable dialogue rails; topic blocking; hallucination checks	200-800ms	Self-hosted; NVIDIA open-source
Lakera Guard	Prompt injection, jailbreak, PII; specialised LLM security	30-100ms	Cloud API; low latency; security-focused

Sampling and Human Review

No automated policy is complete. Sample a percentage of conversations for human review to catch false negatives and discover new policy gaps.

Sample rate: 1-5% of conversations for general review; 100% for high-risk use cases
Bias sampling: oversample conversations where classifiers were uncertain (near threshold)
Annotation: reviewers flag policy violations, label the failure category, and escalate confirmed violations
Feedback loop: confirmed violations update the classifier training data and the red team test suite

Preventing Policy Drift

Policy drift is the slow erosion of enforcement over time

Use cases expand beyond original scope. System prompts accumulate exceptions. Classifiers are not updated as new violation patterns emerge. Teams stop reviewing samples. The result is a policy that exists on paper but is not enforced in practice. Prevent it with: quarterly policy reviews / classifier eval runs after model updates / an annual red team that specifically tests whether each policy layer is functioning.

Checklist: Do You Understand This?

What are the five layers of a policy enforcement stack — and what does each one catch that the others miss?
Why should you use classifiers for input policy rather than keyword blocklists?
What is "fail safe" design for an output classifier — and why does it matter?
Choose two guardrails tools and explain which use case each is best suited for.
What sampling strategy catches classifier false negatives most efficiently?
Define policy drift and describe two concrete mechanisms that cause it in enterprise AI deployments.