Skill

guardrails

A defensive pattern where inputs and outputs are inspected by dedicated safety agents or rules to preventing malicious use, jailbreaks, and harmful content. Use when user asks to "add safety checks", "set up guardrails", "prevent harmful outputs", or mentions agent boundaries, output validation, or content filtering.

Install

npx claudepluginhub lauraflorentin/skills-marketplace --plugin agentic-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Guardrails are the firewall of an AI system. They sit between the user and the agent (Input Guardrail) and between the agent and the user (Output Guardrail). They enforce policy, security, and tone. Unlike the main agent, which tries to be helpful, the guardrail tries to be safe and compliant.

SKILL.md

Similar Skills

cache-components

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

139.0k

claude-opus-4-5-migration

2 files

Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.

claude-opus-4-5-migration

83.2k

bmad-editorial-review-prose

Reviews prose for communication issues impeding comprehension, outputs minimal fixes in a three-column table per Microsoft Writing Style Guide. Useful for 'review prose' or 'improve prose' requests.

bmad-pro-skills

43.8k

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitMar 10, 2026

Actions

View Source View Plugin View on GitHub View README

Guardrails & Safety

When to Use

Jailbreak Prevention: Stopping users from tricking the model ("Ignore previous instructions...").
PII Protection: Detecting and redacting phone numbers, emails, or credit cards.
Topic Adherence: Ensuring a customer support bot doesn't discuss politics or religion.
Brand Safety: preventing the model from generating offensive or competitor-promoting content.

Use Cases

Input Filter: Blocking prompts that violate usage policies.
Output Filter: Blocking model responses that contain hate speech or hallucinations.
Sandboxing: Ensuring code generated by the agent acts within safe bounds (e.g., no network access).

Implementation Pattern

def guarded_execution(user_input):
    # Layer 1: Input Guardrail
    # Check for prompt injection or policy violations
    if not safety_agent.check_input(user_input).safe:
        return "I cannot answer that request."
        
    # Layer 2: Main Execution
    response = main_agent.run(user_input)
    
    # Layer 3: Output Guardrail
    # Check for PII or harmful content in the response
    if not safety_agent.check_output(response).safe:
        log_violation(user_input, response)
        return "Response withheld due to safety policy."
        
    return response

Troubleshooting

Problem	Cause	Fix
Guardrail blocks legitimate requests	Over-broad pattern matching	Tune guardrail thresholds using a labeled test set; track false positive rate
Agent bypasses guardrails	Prompt injection in user input	Apply guardrails before injecting user content into agent context
Guardrail adds too much latency	Synchronous pre-call check	Run guardrail in parallel with the first LLM call; cancel if flagged
Silent failures	Guardrail raises exception but agent continues	Treat guardrail exceptions as hard stops; log and escalate