Guardrails
A validation layer that runs after (and sometimes during) LLM generation to check, filter, or block output based on safety, compliance, quality, or policy rules. Guardrails catch problems that schema validation alone can't — toxic content, PII leakage, off-topic responses, policy violations, and hallucinated claims.
Structure
Guardrails can run on input (before the agent processes it), output (after generation), or both. Input guardrails prevent prompt injection and off-topic requests. Output guardrails catch unsafe, non-compliant, or low-quality responses.
How It Works
- Define rules — specify what should be checked (safety, PII, policy, format, factuality)
- Run checks — guardrail evaluates the output against each rule
- Decide action — pass (output is fine), block (reject entirely), modify (filter/redact), or retry (regenerate with feedback)
Types of guardrails:
- Content safety — toxicity detection, hate speech, NSFW content
- PII detection — redact names, emails, SSNs, phone numbers before output
- Policy compliance — enforce brand tone, legal disclaimers, scope boundaries
- Factual grounding — verify claims against retrieved sources (reduce hallucination)
- Format validation — schema compliance (overlaps with Structured Output)
- Topic boundaries — reject off-topic requests or responses
Key Characteristics
- Defense in depth — catches issues that prompting alone can't prevent
- Configurable — rules can be updated without changing the agent
- Latency cost — each guardrail check adds processing time
- False positives — overly aggressive guardrails block legitimate output
- Not foolproof — determined adversaries can sometimes bypass guardrails
When to Use
- Customer-facing agents where safety and brand reputation matter
- Regulated industries requiring compliance (healthcare, finance, legal)
- Agents that handle sensitive data (PII, credentials, financial info)
- You need defense against prompt injection and jailbreak attempts
- Any production agent where the cost of a bad output is high