🧠 All Things AI
Advanced

Policy Enforcement

Policy enforcement for AI systems is not a single control — it is a stack of overlapping layers, each catching what the others miss. No single layer is sufficient: system prompt instructions can be jailbroken, classifiers have false negative rates, and human review does not scale to every conversation. The goal is a defence-in-depth stack where the cost of bypassing multiple layers simultaneously exceeds the attacker's effort.

The Policy Enforcement Stack

LayerWhat it enforcesLatency costBypass risk
System promptScope definition, persona, prohibited topics, output format rulesNone (part of prompt)High — jailbreak-susceptible
Input classifierBlocks known-bad input patterns before reaching model20-100msMedium — novel attacks bypass
Model generationProvider safety fine-tuning (Constitutional AI, RLHF)None (built-in)Medium — jailbreak-susceptible
Output classifierBlocks policy-violating responses before delivery50-200msLow — independent of model
Human review (sampling)Catches classifier false negatives; finds policy driftN/A (async)Very low — human judgement

Input Policy: What to Block Before the Model

Common input policy categories

  • Explicitly prohibited content types (per your use case and jurisdiction)
  • PII in contexts where it should not be submitted (e.g., code review tool should not receive SSNs)
  • Known jailbreak patterns (role-play overrides, many-shot setups)
  • Competitor mentions in contexts where response would be harmful
  • Regulatory prohibitions (specific financial or medical advice in non-approved contexts)

Input policy pitfalls

  • Over-blocking: keyword blocking is too broad — use classifiers, not keyword lists
  • Silent blocking: user gets no response and cannot understand why
  • Logging blocked inputs: essential for tuning; blocked inputs reveal attack patterns

Output Policy: What to Filter After Generation

Common output policy checks

  • PII in output: block or redact before delivery
  • Tone violations: outputs that are harmful, discriminatory, or off-brand
  • Factual claims in regulated domains (financial advice, medical diagnosis, legal opinion)
  • Competitor mentions: flag for review; block in some contexts
  • Prohibited content reaching the user despite model refusal failure

Output policy design principles

  • Fail safe: when classifier is uncertain, block and ask user to rephrase — do not pass
  • Explain the block: tell the user why their request could not be completed (without revealing policy details that help circumvention)
  • Separate classifiers for different policy categories — one model per concern performs better than one model for all

Guardrails Tooling Comparison

ToolWhat it blocksLatencyDeployment
AWS Bedrock GuardrailsTopic filters, PII, harmful content, word filters50-150msCloud (AWS); integrated with Bedrock models
Azure AI Content SafetyHate, violence, sexual, self-harm, jailbreak detection50-200msCloud (Azure); REST API; any model
LlamaGuardInput/output safety classification; customisable categories100-500ms (self-hosted)Self-hosted; open-source Meta model
NeMo GuardrailsProgrammable dialogue rails; topic blocking; hallucination checks200-800msSelf-hosted; NVIDIA open-source
Lakera GuardPrompt injection, jailbreak, PII; specialised LLM security30-100msCloud API; low latency; security-focused

Sampling and Human Review

No automated policy is complete. Sample a percentage of conversations for human review to catch false negatives and discover new policy gaps.

  • Sample rate: 1-5% of conversations for general review; 100% for high-risk use cases
  • Bias sampling: oversample conversations where classifiers were uncertain (near threshold)
  • Annotation: reviewers flag policy violations, label the failure category, and escalate confirmed violations
  • Feedback loop: confirmed violations update the classifier training data and the red team test suite

Preventing Policy Drift

Policy drift is the slow erosion of enforcement over time

Use cases expand beyond original scope. System prompts accumulate exceptions. Classifiers are not updated as new violation patterns emerge. Teams stop reviewing samples. The result is a policy that exists on paper but is not enforced in practice. Prevent it with: quarterly policy reviews / classifier eval runs after model updates / an annual red team that specifically tests whether each policy layer is functioning.

Checklist: Do You Understand This?

  • What are the five layers of a policy enforcement stack — and what does each one catch that the others miss?
  • Why should you use classifiers for input policy rather than keyword blocklists?
  • What is "fail safe" design for an output classifier — and why does it matter?
  • Choose two guardrails tools and explain which use case each is best suited for.
  • What sampling strategy catches classifier false negatives most efficiently?
  • Define policy drift and describe two concrete mechanisms that cause it in enterprise AI deployments.