From magic-powers
Use when adding safety layers to AI features - output validation, hallucination detection, content filtering, PII redaction, input sanitization
npx claudepluginhub kienbui1995/magic-powers --plugin magic-powersThis skill uses the workspace's default tool permissions.
LLMs will confidently produce harmful, incorrect, or leaked content if you don't add guardrails. Every AI feature needs input validation, output validation, and fallback behavior.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
LLMs will confidently produce harmful, incorrect, or leaked content if you don't add guardrails. Every AI feature needs input validation, output validation, and fallback behavior.
User Input → Input Guard → LLM → Output Guard → User
↓ ↓
Block/sanitize Validate/filter
| Guard | What | Implementation |
|---|---|---|
| Prompt injection detection | Block "ignore instructions" attacks | Classifier or regex filter |
| Input length limit | Prevent context stuffing | Max token count |
| PII detection | Redact before sending to LLM | Regex + NER model |
| Topic filtering | Block off-topic requests | Classifier |
| Guard | What | Implementation |
|---|---|---|
| Hallucination check | Verify claims against source | Cross-reference with retrieved docs |
| PII leak detection | Catch leaked personal data | Regex scan on output |
| Format validation | Ensure JSON/structured output | Schema validation |
| Toxicity filter | Block harmful content | Classifier (Perspective API, etc.) |
| Confidence threshold | Reject low-confidence answers | "I don't know" fallback |
IF output fails any guard:
→ Don't show raw LLM output
→ Return safe fallback: "I'm not sure about that. Let me connect you with support."
→ Log the failure for review
Modern jailbreaks are sophisticated — detect by pattern + behavior, not just keywords:
Common jailbreak patterns to detect:
Defense approach:
def detect_jailbreak_attempt(user_input: str) -> JailbreakResult:
signals = []
# Pattern-based (fast, cheap)
if re.search(r"ignore (previous|all|your) (instructions|rules)", user_input, re.I):
signals.append("instruction_override")
if re.search(r"(pretend|act as|roleplay|you are now) .*(no restrictions|DAN|uncensored)", user_input, re.I):
signals.append("persona_escape")
# Semantic check (moderate cost)
if count_tokens(user_input) > 2000: # long inputs may hide injection
injection_score = classifier.score(user_input, "prompt_injection")
if injection_score > 0.7:
signals.append("long_form_injection")
return JailbreakResult(
detected=len(signals) > 0,
signals=signals,
action="block" if len(signals) >= 2 else "flag_for_review"
)
Indirect prompt injection through data:
Log all jailbreak attempts for pattern analysis — coordinated attacks show up as clusters.
AI outputs can embed demographic bias. Add systematic checks for high-stakes decisions:
Dimensions to monitor:
Fairness testing in eval:
# Run same query with different demographic contexts — expect consistent quality
test_cases = [
{"query": "Evaluate this resume", "candidate": {"name": "James Smith", "gender": "M"}},
{"query": "Evaluate this resume", "candidate": {"name": "Jamal Smith", "gender": "M"}},
{"query": "Evaluate this resume", "candidate": {"name": "Jane Smith", "gender": "F"}},
]
# Quality scores should not significantly differ
assert max_quality_variance(test_cases) < 0.05 # 5% tolerance
For high-stakes use cases (hiring, lending, medical, legal):
Before LLM: "John Smith (john@email.com) ordered..."
Redacted: "[NAME] ([EMAIL]) ordered..."
After LLM: Re-inject PII only if needed in response
EU AI Act (effective 2025-2026):
| Risk level | Examples | Requirements |
|---|---|---|
| Unacceptable | Social scoring, subliminal manipulation | Prohibited |
| High | Hiring, credit, medical, law enforcement | Impact assessment, human oversight, logging |
| Limited | Chatbots, deepfakes | Disclosure required |
| Minimal | Spam filters, games | No specific requirements |
For high-risk AI systems:
# Required logging for EU AI Act compliance
def log_high_risk_decision(decision, user_id, model_version, confidence):
audit_log.write({
"timestamp": datetime.utcnow().isoformat(),
"decision": decision,
"user_id": hash(user_id), # pseudonymize
"model_version": model_version,
"confidence": confidence,
"human_reviewed": False,
"data_sources": get_data_sources()
})
Sensitive domain guardrails (medical/legal/financial):
SENSITIVE_DOMAIN_RESPONSES = {
"medical": "This is general information only. Consult a qualified healthcare provider for medical advice.",
"legal": "This is not legal advice. Consult a licensed attorney for guidance on your specific situation.",
"financial": "This is not financial advice. Consult a registered financial advisor before making investment decisions.",
}
def add_domain_disclaimer(output: str, detected_domain: str) -> str:
if detected_domain in SENSITIVE_DOMAIN_RESPONSES:
return output + f"\n\n⚠️ {SENSITIVE_DOMAIN_RESPONSES[detected_domain]}"
return output
Always include: