Plan AI safety measures including alignment, guardrails, red teaming, and regulatory compliance (EU AI Act, NIST AI RMF).
Plans AI safety measures including alignment, guardrails, red teaming, and regulatory compliance (EU AI Act, NIST AI RMF). Use when designing safety frameworks, implementing guardrails, or conducting red team exercises for AI systems.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install ai-ml-planning@melodic-softwareThis skill is limited to using the following tools:
Use this skill when:
AI safety encompasses the practices, techniques, and governance structures needed to ensure AI systems behave as intended, avoid harm, and comply with regulations. Effective safety planning must be integrated from project inception, not bolted on afterward.
┌─────────────────────────────────────────────────────────────────┐
│ AI SAFETY FRAMEWORK │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ GOVERNANCE LAYER │ │
│ │ Risk Classification │ Policies │ Compliance │ Auditing │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ TECHNICAL LAYER │ │
│ │ Input Guards │ Output Filters │ Model Alignment │ Monitoring│
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ OPERATIONAL LAYER │ │
│ │ Red Teaming │ Incident Response │ Continuous Testing │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
| Risk Level | Description | Requirements | Examples |
|---|---|---|---|
| Unacceptable | Banned uses | Prohibited | Social scoring, subliminal manipulation |
| High-Risk | Significant impact | Full compliance | HR screening, credit scoring, medical |
| Limited Risk | Transparency needed | Disclosure | Chatbots, emotion recognition |
| Minimal Risk | Low concern | Best practices | Spam filters, games |
| Requirement | Description |
|---|---|
| Risk Management | Continuous risk assessment and mitigation |
| Data Governance | Training data quality, bias testing |
| Documentation | Technical documentation, logging |
| Transparency | Clear disclosure to users |
| Human Oversight | Meaningful human control |
| Accuracy/Robustness | Performance standards, testing |
| Cybersecurity | Security by design |
public class AiActClassifier
{
public RiskClassification Classify(AiSystemProfile profile)
{
// Check for prohibited uses
if (IsProhibitedUse(profile))
return RiskClassification.Unacceptable;
// Check Annex III high-risk categories
if (IsHighRiskCategory(profile))
return RiskClassification.HighRisk;
// Check for transparency obligations
if (RequiresTransparency(profile))
return RiskClassification.LimitedRisk;
return RiskClassification.MinimalRisk;
}
private bool IsHighRiskCategory(AiSystemProfile profile)
{
var highRiskAreas = new[]
{
"biometric_identification",
"critical_infrastructure",
"education_vocational",
"employment_hr",
"essential_services",
"law_enforcement",
"migration_asylum",
"justice_democracy"
};
return highRiskAreas.Contains(profile.ApplicationArea);
}
private bool RequiresTransparency(AiSystemProfile profile)
{
return profile.InteractsWithHumans
|| profile.GeneratesContent
|| profile.DetectsEmotions
|| profile.UsesDeepfakes;
}
}
| Function | Description | Key Activities |
|---|---|---|
| Govern | Culture of risk management | Policies, accountability, oversight |
| Map | Context and risk identification | Stakeholders, impacts, constraints |
| Measure | Risk assessment | Metrics, testing, monitoring |
| Manage | Risk treatment | Mitigations, responses, priorities |
## AI Risk Map: [System Name]
### Stakeholder Analysis
| Stakeholder | Interest | Potential Harm | Severity |
|-------------|----------|----------------|----------|
| [Group 1] | [Interest] | [Harm] | [H/M/L] |
| [Group 2] | [Interest] | [Harm] | [H/M/L] |
### Risk Categories
- **Reliability**: [Assessment]
- **Safety**: [Assessment]
- **Security**: [Assessment]
- **Accountability**: [Assessment]
- **Transparency**: [Assessment]
- **Explainability**: [Assessment]
- **Privacy**: [Assessment]
- **Fairness**: [Assessment]
### Identified Risks
| Risk ID | Description | Likelihood | Impact | Mitigation |
|---------|-------------|------------|--------|------------|
| R-001 | [Risk] | [H/M/L] | [H/M/L] | [Action] |
| Type | Purpose | Implementation |
|---|---|---|
| Input Guards | Block harmful prompts | Content filters, injection detection |
| Output Filters | Prevent harmful outputs | PII detection, toxicity filters |
| Topic Restrictions | Limit scope | Topic classifiers, keyword blockers |
| Behavioral Constraints | Enforce policies | System prompts, fine-tuning |
| Rate Limits | Prevent abuse | Usage quotas, throttling |
public class GuardrailPipeline
{
private readonly List<IInputGuard> _inputGuards;
private readonly List<IOutputFilter> _outputFilters;
private readonly IAuditLogger _auditLogger;
public async Task<GuardrailResult> ProcessRequest(
UserRequest request,
CancellationToken ct)
{
// Pre-processing guards
foreach (var guard in _inputGuards)
{
var result = await guard.Check(request, ct);
if (!result.Allowed)
{
await _auditLogger.LogBlocked(request, guard.Name, result.Reason);
return GuardrailResult.Blocked(
$"Request blocked by {guard.Name}: {result.Reason}");
}
}
// Generate response
var response = await GenerateResponse(request, ct);
// Post-processing filters
foreach (var filter in _outputFilters)
{
response = await filter.Filter(response, ct);
if (response.WasFiltered)
{
await _auditLogger.LogFiltered(request, filter.Name);
}
}
return GuardrailResult.Success(response);
}
}
public class PromptInjectionGuard : IInputGuard
{
public string Name => "PromptInjectionGuard";
public async Task<GuardResult> Check(UserRequest request, CancellationToken ct)
{
var indicators = new[]
{
"ignore previous instructions",
"disregard your training",
"you are now",
"pretend you are",
"act as if",
"system prompt:",
"new instructions:",
"\\[INST\\]",
"\\[/INST\\]"
};
var normalizedInput = request.Content.ToLowerInvariant();
foreach (var indicator in indicators)
{
if (Regex.IsMatch(normalizedInput, indicator, RegexOptions.IgnoreCase))
{
return GuardResult.Blocked(
$"Potential prompt injection detected: {indicator}");
}
}
// ML-based detection for sophisticated attacks
var mlScore = await _injectionClassifier.Classify(request.Content, ct);
if (mlScore > 0.8)
{
return GuardResult.Blocked("ML classifier detected injection attempt");
}
return GuardResult.Allowed();
}
}
public class ContentSafetyFilter : IOutputFilter
{
private readonly IContentSafetyClient _safetyClient;
public async Task<FilteredResponse> Filter(
LlmResponse response,
CancellationToken ct)
{
var analysis = await _safetyClient.AnalyzeAsync(
response.Content,
new AnalysisOptions
{
Categories = new[]
{
ContentCategory.Hate,
ContentCategory.Violence,
ContentCategory.SelfHarm,
ContentCategory.Sexual
},
OutputType = OutputType.FourSeverityLevels
},
ct);
// Check if any category exceeds threshold
var violations = analysis.CategoriesAnalysis
.Where(c => c.Severity >= Severity.Medium)
.ToList();
if (violations.Any())
{
return new FilteredResponse
{
WasFiltered = true,
OriginalContent = response.Content,
FilteredContent = GenerateSafeResponse(violations),
Violations = violations
};
}
return new FilteredResponse
{
WasFiltered = false,
FilteredContent = response.Content
};
}
}
| Objective | Description | Techniques |
|---|---|---|
| Jailbreaking | Bypass safety controls | Prompt manipulation, roleplay |
| Data Extraction | Leak training data | Completion attacks, memorization |
| Bias Exploitation | Trigger unfair outputs | Targeted demographic testing |
| Harmful Content | Generate prohibited content | Edge cases, adversarial inputs |
| Denial of Service | Degrade performance | Resource exhaustion, loops |
## Red Team Test Plan: [System Name]
### Test Categories
#### 1. Prompt Injection
- [ ] Direct injection attempts
- [ ] Indirect injection (via user content)
- [ ] Multi-turn manipulation
- [ ] Encoding tricks (base64, unicode)
#### 2. Jailbreak Attempts
- [ ] Roleplay scenarios ("pretend you are...")
- [ ] Hypothetical framing
- [ ] Translation tricks
- [ ] System prompt extraction
#### 3. Harmful Content Generation
- [ ] Violence and weapons
- [ ] Self-harm content
- [ ] Hate speech and discrimination
- [ ] Illegal activities
#### 4. Privacy Attacks
- [ ] PII extraction attempts
- [ ] Training data extraction
- [ ] Membership inference
#### 5. Bias Testing
- [ ] Demographic disparities
- [ ] Stereotype reinforcement
- [ ] Cultural bias
### Severity Classification
| Severity | Description | Response Time |
|----------|-------------|---------------|
| Critical | System compromised, severe harm | Immediate |
| High | Safety bypass, harmful output | 24 hours |
| Medium | Partial bypass, concerning output | 1 week |
| Low | Minor issues, edge cases | Next release |
public class AutomatedRedTeam
{
private readonly List<IAttackGenerator> _attackGenerators;
private readonly ITargetSystem _target;
public async Task<RedTeamReport> Execute(
RedTeamConfig config,
CancellationToken ct)
{
var results = new List<AttackResult>();
foreach (var generator in _attackGenerators)
{
var attacks = await generator.GenerateAttacks(config, ct);
foreach (var attack in attacks)
{
var response = await _target.Query(attack.Prompt, ct);
var success = await EvaluateAttackSuccess(
attack,
response,
ct);
results.Add(new AttackResult
{
Attack = attack,
Response = response,
Success = success,
Category = attack.Category
});
if (success)
{
await LogVulnerability(attack, response);
}
}
}
return new RedTeamReport
{
TotalAttacks = results.Count,
SuccessfulAttacks = results.Count(r => r.Success),
ByCategory = results.GroupBy(r => r.Category)
.ToDictionary(g => g.Key, g => g.Count(r => r.Success)),
CriticalFindings = results.Where(r => r.Success && r.Attack.Severity == Severity.Critical)
};
}
}
| Category | Metrics | Target |
|---|---|---|
| Refusal Rate | % harmful requests refused | > 99% |
| False Positive Rate | % benign requests refused | < 5% |
| Jailbreak Resistance | % jailbreak attempts blocked | > 95% |
| Toxicity Score | Average output toxicity | < 0.1 |
| Bias Score | Demographic parity | > 0.9 |
public class SafetyMonitor
{
public async Task RecordInteraction(
Interaction interaction,
CancellationToken ct)
{
// Analyze response safety
var safetyScore = await _safetyClassifier.Score(
interaction.Response,
ct);
// Log metrics
await _metrics.RecordAsync(new SafetyMetrics
{
Timestamp = DateTime.UtcNow,
SafetyScore = safetyScore,
WasRefused = interaction.WasRefused,
Category = interaction.Category,
Latency = interaction.Latency
});
// Alert on concerning patterns
if (safetyScore < _thresholds.CriticalSafetyScore)
{
await _alerting.SendAlert(new SafetyAlert
{
Severity = AlertSeverity.Critical,
Interaction = interaction,
Score = safetyScore
});
}
// Check for pattern-based attacks
await DetectAttackPatterns(interaction, ct);
}
}
# AI Safety Plan: [Project Name]
## 1. Risk Classification
- **EU AI Act Category**: [Unacceptable/High/Limited/Minimal]
- **NIST AI RMF Profile**: [Reference]
- **Internal Risk Score**: [1-5]
## 2. Identified Risks
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| [Risk 1] | [H/M/L] | [H/M/L] | [Strategy] |
## 3. Guardrails
### Input Guards
- [ ] Prompt injection detection
- [ ] Content filtering
- [ ] Rate limiting
### Output Filters
- [ ] Toxicity filtering
- [ ] PII detection
- [ ] Topic restrictions
## 4. Testing Plan
- [ ] Pre-launch red teaming
- [ ] Continuous adversarial testing
- [ ] Bias evaluation
## 5. Monitoring
- [ ] Safety metrics dashboard
- [ ] Alerting thresholds
- [ ] Incident response plan
## 6. Compliance
- [ ] Documentation complete
- [ ] Human oversight defined
- [ ] Audit trail configured
Inputs from:
bias-assessment skill → Fairness requirementsOutputs to:
explainability-planning skill → Transparency requirementshitl-design skill → Human oversight designThis skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.