Help us improve
Share bugs, ideas, or general feedback.
From pm-copilot
Use this skill when the user asks to "analyze AI errors", "error analysis for our AI feature", "open coding", "axial coding", "analyze model failures", "categorize AI mistakes", "find patterns in bad AI outputs", "what's wrong with our AI", or has a set of bad AI outputs and wants to understand what's failing and why. This is the first step in the AI eval methodology from Hamel Husain and Shreya Shankar.
npx claudepluginhub productfculty-aipm/pm-copilot-by-product-facultyHow this skill is triggered — by the user, by Claude, or both
Slash command
/pm-copilot:error-analysisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are conducting a structured error analysis of AI output failures — the first and most important step in building an effective eval system. Most teams skip this step and build evals that measure the wrong things, leading to dashboards that are ignored.
Classifies AI failures into content, behavioral, technical, and safety categories with severity levels. Helps teams log, trend, prioritize, and analyze issues like hallucinations and refusals.
Guides analysis of LLM pipeline traces to identify, categorize, and prioritize failure modes. Use for new eval projects, pipeline changes, metric drops, or incidents.
Diagnoses root causes of AI feature failures like hallucinations, inconsistency, or slowness using 4D audit (Demand, Data, Discovery, Defense) from symptoms or Linear issues.
Share bugs, ideas, or general feedback.
You are conducting a structured error analysis of AI output failures — the first and most important step in building an effective eval system. Most teams skip this step and build evals that measure the wrong things, leading to dashboards that are ignored.
Framework: Hamel Husain + Shreya Shankar (Building eval systems that improve your AI product, 2025), Aman Khan (Beyond vibe checks, 2025).
Key principle: "Many teams build eval dashboards that look useful but are ultimately ignored and don't lead to better products, because the metrics these evals report are disconnected from real user problems." — Hamel Husain + Shreya Shankar, Lenny's Newsletter (2025)
The solution: ground evals in real failure modes first, then build evals to catch those failures.
Ask the user to provide failure data:
Minimum viable set for open coding: 30–100 failure examples. More is better, but 30 gives enough signal to start.
Open coding = free-form critique of failures, one at a time, without a predetermined category system.
For each failure:
This produces a list of free-form failure descriptions.
Common AI failure types to watch for (but don't force fit):
Axial coding = grouping the open coding descriptions into a manageable set of named failure categories.
Target: fewer than 10 categories. More than 10 is too granular to act on.
For each group of similar descriptions:
The categories should be mutually exclusive (each failure fits in one category) and collectively exhaustive (every failure fits somewhere, even if there's an "Other" category).
For each category:
The top 3 categories by frequency are your evaluation priorities. Build evals to catch these first. Everything else is secondary.
For the top 3 failure categories:
What causes this type of failure?
Is this fixable with prompt engineering, or does it require a different approach (better retrieval, fine-tuning, model upgrade)?
Produce: