Skill

error-analysis

Use this skill when the user asks to "analyze AI errors", "error analysis for our AI feature", "open coding", "axial coding", "analyze model failures", "categorize AI mistakes", "find patterns in bad AI outputs", "what's wrong with our AI", or has a set of bad AI outputs and wants to understand what's failing and why. This is the first step in the AI eval methodology from Hamel Husain and Shreya Shankar.

npx claudepluginhub productfculty-aipm/pm-copilot-by-product-faculty

Popularity

Stars

Forks

Shared by

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/pm-copilot:error-analysis

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are conducting a structured error analysis of AI output failures — the first and most important step in building an effective eval system. Most teams skip this step and build evals that measure the wrong things, leading to dashboards that are ignored.

SKILL.md

91 lines · ~1.2k tokens

Similar Skills

failure-taxonomy

108

Classifies AI failures into content, behavioral, technical, and safety categories with severity levels. Helps teams log, trend, prioritize, and analyze issues like hallucinations and refusals.

evaluation

error-analysis

1.4k

Guides analysis of LLM pipeline traces to identify, categorize, and prioritize failure modes. Use for new eval projects, pipeline changes, metric drops, or incidents.

evals-skills

ai-debug

Diagnoses root causes of AI feature failures like hallucinations, inconsistency, or slowness using 4D audit (Demand, Data, Discovery, Defense) from symptoms or Linear issues.

bette-think

Stats

Stars37

Forks24

MaintenanceExcellent

Last CommitApr 16, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

AI Error Analysis

Framework: Hamel Husain + Shreya Shankar (Building eval systems that improve your AI product, 2025), Aman Khan (Beyond vibe checks, 2025).

Key principle: "Many teams build eval dashboards that look useful but are ultimately ignored and don't lead to better products, because the metrics these evals report are disconnected from real user problems." — Hamel Husain + Shreya Shankar, Lenny's Newsletter (2025)

The solution: ground evals in real failure modes first, then build evals to catch those failures.

Step 1 — Gather Failure Data

Ask the user to provide failure data:

Traces: Input → output pairs where the AI failed. Can be pasted directly or uploaded.
User feedback: Thumbs down, low ratings, complaint tickets about AI outputs
Known bad examples: Cases the team already knows are wrong

Minimum viable set for open coding: 30–100 failure examples. More is better, but 30 gives enough signal to start.

Step 2 — Open Coding

Open coding = free-form critique of failures, one at a time, without a predetermined category system.

For each failure:

Read the input and the bad output
Write a brief, specific description of what went wrong: "The model contradicted the user's stated preference", "The output is factually wrong about [X]", "The response is unhelpfully vague when the user needs specifics"
Do NOT categorize yet — just describe in your own words

This produces a list of free-form failure descriptions.

Common AI failure types to watch for (but don't force fit):

Factual error / hallucination: Model states something incorrect as fact
Instruction following failure: Model ignores or misinterprets the system prompt or user instruction
Format error: Output is in the wrong format (too long, wrong structure, wrong tone)
Context failure: Model ignores or loses relevant context from earlier in the conversation
Hedging over-use: Model adds excessive caveats that undermine the usefulness of the response
Scope creep: Model does more than asked, confusing the user
Retrieval failure (RAG): Retrieved context is wrong, irrelevant, or not used
Reasoning error: Model reaches the wrong conclusion through flawed reasoning
Safety over-refusal: Model refuses a reasonable request due to over-sensitive safety filters

Step 3 — Axial Coding

Axial coding = grouping the open coding descriptions into a manageable set of named failure categories.

Target: fewer than 10 categories. More than 10 is too granular to act on.

For each group of similar descriptions:

Name the category (3–5 words, noun-phrase)
Write a one-sentence definition
List 2–3 example failure descriptions that fit

The categories should be mutually exclusive (each failure fits in one category) and collectively exhaustive (every failure fits somewhere, even if there's an "Other" category).

Step 4 — Frequency Count

For each category:

Count how many failures fall into it
Calculate the percentage of total failures
Rank by frequency

The top 3 categories by frequency are your evaluation priorities. Build evals to catch these first. Everything else is secondary.

Step 5 — Root Cause Analysis

For the top 3 failure categories:

What causes this type of failure?
- System prompt issue (instructions are unclear, missing, or contradictory)
- Model capability issue (this task is at the edge of the model's capability)
- Context issue (the model doesn't have the information it needs)
- Retrieval issue (RAG is returning wrong or irrelevant chunks)
- Data issue (training/fine-tuning examples were poor)
- Prompt engineering issue (user prompt format triggers this failure)
Is this fixable with prompt engineering, or does it require a different approach (better retrieval, fine-tuning, model upgrade)?

Step 6 — Output

Produce:

Open coding output (list of failure descriptions)
Axial coding result (category names, definitions, counts)
Ranked failure category table (category | count | % | root cause)
Top 3 categories with root cause diagnosis and recommended next step
Eval design recommendation: what evaluation method would catch each top failure category?

error-analysis

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

error-analysis

Popularity

Invocation

Context Preview

SKILL.md

AI Error Analysis

Step 1 — Gather Failure Data

Step 2 — Open Coding

Step 3 — Axial Coding

Step 4 — Frequency Count

Step 5 — Root Cause Analysis

Step 6 — Output

Similar Skills

Help us improve

AI Error Analysis

Step 1 — Gather Failure Data

Step 2 — Open Coding

Step 3 — Axial Coding

Step 4 — Frequency Count

Step 5 — Root Cause Analysis

Step 6 — Output