From evaluation
Classifies AI failures into content, behavioral, technical, and safety categories with severity levels. Helps teams log, trend, prioritize, and analyze issues like hallucinations and refusals.
How this skill is triggered — by the user, by Claude, or both
Slash command
/evaluation:failure-taxonomyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Not all AI failures are the same. A hallucination is different from a refusal, which is different from a tone mismatch. A failure taxonomy classifies failure types so teams can track, prioritise, and address them systematically.
Not all AI failures are the same. A hallucination is different from a refusal, which is different from a tone mismatch. A failure taxonomy classifies failure types so teams can track, prioritise, and address them systematically.
Content Failures:
npx claudepluginhub owl-listener/ai-design-skills --plugin evaluationUse this skill when the user asks to "analyze AI errors", "error analysis for our AI feature", "open coding", "axial coding", "analyze model failures", "categorize AI mistakes", "find patterns in bad AI outputs", "what's wrong with our AI", or has a set of bad AI outputs and wants to understand what's failing and why. This is the first step in the AI eval methodology from Hamel Husain and Shreya Shankar.
Guides analysis of LLM pipeline traces to identify, categorize, and prioritize failure modes. Use for new eval projects, pipeline changes, metric drops, or incidents.
Diagnoses root causes of AI feature failures like hallucinations, inconsistency, or slowness using 4D audit (Demand, Data, Discovery, Defense) from symptoms or Linear issues.