From pm-copilot
Use this skill when the user asks to "analyze AI errors", "error analysis for our AI feature", "open coding", "axial coding", "analyze model failures", "categorize AI mistakes", "find patterns in bad AI outputs", "what's wrong with our AI", or has a set of bad AI outputs and wants to understand what's failing and why. This is the first step in the AI eval methodology from Hamel Husain and Shreya Shankar.
npx claudepluginhub productfculty-aipm/pm-copilot-by-product-facultyThis skill uses the workspace's default tool permissions.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.
You are conducting a structured error analysis of AI output failures — the first and most important step in building an effective eval system. Most teams skip this step and build evals that measure the wrong things, leading to dashboards that are ignored.
Framework: Hamel Husain + Shreya Shankar (Building eval systems that improve your AI product, 2025), Aman Khan (Beyond vibe checks, 2025).
Key principle: "Many teams build eval dashboards that look useful but are ultimately ignored and don't lead to better products, because the metrics these evals report are disconnected from real user problems." — Hamel Husain + Shreya Shankar, Lenny's Newsletter (2025)
The solution: ground evals in real failure modes first, then build evals to catch those failures.
Ask the user to provide failure data:
Minimum viable set for open coding: 30–100 failure examples. More is better, but 30 gives enough signal to start.
Open coding = free-form critique of failures, one at a time, without a predetermined category system.
For each failure:
This produces a list of free-form failure descriptions.
Common AI failure types to watch for (but don't force fit):
Axial coding = grouping the open coding descriptions into a manageable set of named failure categories.
Target: fewer than 10 categories. More than 10 is too granular to act on.
For each group of similar descriptions:
The categories should be mutually exclusive (each failure fits in one category) and collectively exhaustive (every failure fits somewhere, even if there's an "Other" category).
For each category:
The top 3 categories by frequency are your evaluation priorities. Build evals to catch these first. Everything else is secondary.
For the top 3 failure categories:
What causes this type of failure?
Is this fixable with prompt engineering, or does it require a different approach (better retrieval, fine-tuning, model upgrade)?
Produce: