LLM safety guardrails and content moderation
Monitors all inputs and outputs for safety violations using configurable content filters. Triggers automatically on every request to block harmful, violent, or inappropriate content before delivery.
/plugin marketplace add pluginagentmarketplace/custom-plugin-prompt-engineering/plugin install prompt-engineering-assistant@pluginagentmarketplace-prompt-engineeringThis skill inherits all available tools. When active, it can use any tool Claude has access to.
assets/config.yamlassets/schema.jsonreferences/GUIDE.mdreferences/PATTERNS.mdscripts/validate.pyBonded to: prompt-security-agent
Skill("custom-plugin-prompt-engineering:safety-guardrails")
parameters:
safety_level:
type: enum
values: [permissive, standard, strict, maximum]
default: standard
content_filters:
type: array
values: [harmful, hate, violence, adult, pii]
default: [harmful, hate, violence]
output_validation:
type: boolean
default: true
| Guardrail | Purpose | Implementation |
|---|---|---|
| Input filtering | Block harmful requests | Pattern matching |
| Output filtering | Prevent harmful outputs | Content analysis |
| Topic boundaries | Stay on-topic | Scope enforcement |
| Format validation | Ensure safe formats | Schema checking |
content_categories:
harmful:
- dangerous_activities
- illegal_actions
- self_harm
hate_speech:
- discrimination
- slurs
- targeted_harassment
violence:
- graphic_violence
- threats
- weapons_instructions
pii:
- personal_data
- financial_info
- credentials
## Content Guidelines
NEVER generate content that:
1. Provides instructions for harmful activities
2. Contains hate speech or discrimination
3. Describes graphic violence
4. Exposes personal information
5. Bypasses safety measures
If a request violates these guidelines:
1. Decline politely
2. Explain which guideline applies
3. Offer a safe alternative if possible
validation_rules:
format_check:
- valid_json_if_requested
- no_executable_code_in_text
- no_embedded_commands
content_check:
- no_pii_exposure
- no_harmful_instructions
- appropriate_for_audience
consistency_check:
- matches_role_constraints
- within_topic_boundaries
I can't help with that request because [reason].
Here's what I can help with instead:
- [Alternative 1]
- [Alternative 2]
Would any of these work for you?
I notice this request is [description of concern].
To ensure I'm being helpful in the right way:
1. Could you clarify [specific aspect]?
2. Here's a safe approach to [related task]:
[Safe alternative]
| Issue | Cause | Solution |
|---|---|---|
| Over-blocking | Too strict | Tune sensitivity |
| Under-blocking | Too permissive | Add patterns |
| False positives | Ambiguous content | Context-aware rules |
| Inconsistent | Rule conflicts | Prioritize rules |
See: Anthropic Constitutional AI, OpenAI Moderation API
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.