From evaluation
Classifies AI failures into content, behavioral, technical, and safety categories with severity levels. Helps teams log, trend, prioritize, and analyze issues like hallucinations and refusals.
npx claudepluginhub owl-listener/ai-design-skills --plugin evaluationThis skill uses the workspace's default tool permissions.
Not all AI failures are the same. A hallucination is different from a refusal, which is different from a tone mismatch. A failure taxonomy classifies failure types so teams can track, prioritise, and address them systematically.
Analyzes real AI/LLM traces: judge pass/fail, categorize failures from data, compute rates, prioritize fixes. For 50+ test cases or production failures.
Guides analysis of LLM pipeline traces to identify, categorize, and prioritize failure modes. Use for new eval projects, pipeline changes, metric drops, or incidents.
Diagnoses root causes of AI feature failures like hallucinations, inconsistency, or slowness using 4D audit (Demand, Data, Discovery, Defense) from symptoms or Linear issues.
Share bugs, ideas, or general feedback.
Not all AI failures are the same. A hallucination is different from a refusal, which is different from a tone mismatch. A failure taxonomy classifies failure types so teams can track, prioritise, and address them systematically.
Content Failures: