From pm-thought-partner
Analyzes real AI/LLM traces: judge pass/fail, categorize failures from data, compute rates, prioritize fixes. For 50+ test cases or production failures.
npx claudepluginhub breethomas/bette-thinkThis skill uses the workspace's default tool permissions.
Guide the PM through reading real LLM pipeline traces and building a catalog of how the system fails. Categories emerge from traces, never brainstormed.
Guides analysis of LLM pipeline traces to identify, categorize, and prioritize failure modes. Use for new eval projects, pipeline changes, metric drops, or incidents.
Classifies AI failures into content, behavioral, technical, and safety categories with severity levels. Helps teams log, trend, prioritize, and analyze issues like hallucinations and refusals.
Guides post-launch AI feature calibration: document production error patterns, review eval performance, decide agency promotion. Uses CC/CD loop with /calibrate shortcuts.
Share bugs, ideas, or general feedback.
Guide the PM through reading real LLM pipeline traces and building a catalog of how the system fails. Categories emerge from traces, never brainstormed.
When this skill is invoked, start with:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
UPGRADE EVALS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Error analysis on real traces. Categories emerge from data.
What AI feature are we analyzing?
How are you collecting traces today?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
/start-evals or equivalent (has basic familiarity with pass/fail judgment on outputs).Ask the PM where traces live. Help them pull ~100 representative traces.
Target: ~100 traces. New failure types stop appearing around this number. Adjust for system complexity.
From real user data (preferred):
From synthetic data (when real data is sparse):
/generate-test-data to create diverse inputs.What to capture per trace: Input, all intermediate LLM calls, tool uses, retrieved documents, reasoning steps, and final output. The more of the pipeline visible, the better the root cause analysis.
Claude Code executes: Write Python scripts to pull traces from the PM's logging system, API, or database. Format them for review.
Present each trace to the PM. For each one, ask: did the system produce a good result? Pass or Fail.
For failures, the PM notes what went wrong. Focus on the first thing that went wrong in the trace — errors cascade, so downstream symptoms disappear when the root cause is fixed.
Write observations, not explanations. "SQL missed the budget constraint" not "The model probably didn't understand the budget."
Template for tracking:
| Trace ID | Pass/Fail | What went wrong |
|----------|-----------|-----------------|
| 001 | Fail | Missing filter: pet-friendly requirement ignored in SQL |
| 002 | Fail | Proposed unavailable times despite calendar conflicts |
| 003 | Fail | Used casual tone for luxury client; wrong property type |
| 004 | Pass | - |
If the PM is stuck articulating what feels wrong: prompt with common failure types — made-up facts, malformed output, ignored user requirements, wrong tone, tool misuse. These are conversation starters, not a category list.
Do NOT start with a pre-defined failure list. Let categories emerge from what the PM actually sees.
Claude Code executes: Build the tracking spreadsheet/table. Pull and format individual traces for review. Compute running statistics.
After reviewing 30-50 traces, start grouping similar notes into categories. Don't wait until all 100 are done — grouping early sharpens what to look for in remaining traces.
When to split vs. group:
Split these (different root causes):
Group these (same root cause):
LLM-assisted clustering (use only after the PM has reviewed 30-50 traces):
Claude Code executes: Run clustering on failure annotations. Present suggested groupings to PM for review. LLMs cluster by surface similarity (e.g., grouping "app crashes" and "login is slow" because both mention login). The PM validates whether groupings reflect actual root causes.
Aim for 5-10 categories that are:
Go back through all traces and apply binary labels (pass/fail) for each failure category. Each trace gets a column per category.
Claude Code executes: Build the labeling spreadsheet or script. Automate where possible — some categories may be detectable with simple code checks.
Claude Code executes:
failure_rates = labeled_df[failure_columns].sum() / len(labeled_df)
failure_rates.sort_values(ascending=False)
Present the ranked list to the PM. The most frequent failure category is where to focus first.
Work through each category with the PM in this order:
Can we just fix it? Many failures have obvious fixes that don't need an evaluator:
If a clear fix resolves the failure, do that first. Only consider an evaluator for failures that persist after fixing.
Is an evaluator worth the effort? Not every remaining failure needs one. Ask the PM:
Reserve evaluators for failures the PM will iterate on repeatedly.
For failures that warrant an evaluator: prefer code-based checks (regex, parsing, schema validation) for anything objective. Use /build-judge only for failures that require judgment. Critical requirements (safety, compliance) may warrant an evaluator even after fixing the prompt, as a guardrail.
What to tell your engineer: For each category, produce a one-line summary: category name, frequency, recommended action (fix prompt / fix code / build evaluator / skip).
Expect 2-3 rounds of reviewing and refining categories. After each round:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ERROR ANALYSIS RESULTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: [name]
Traces reviewed: [count]
Overall pass rate: [%]
FAILURE CATEGORIES (ranked by frequency):
| Category | Rate | Action |
|----------|------|--------|
| [name] | [%] | Fix prompt |
| [name] | [%] | Build evaluator → /build-judge |
| [name] | [%] | Fix code |
| [name] | [%] | Skip (low impact) |
NEXT STEPS:
1. Fix [top category] — [specific action]
2. Build judge for [category] — run /build-judge
3. Re-run analysis after fixes
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Stop reviewing when new traces aren't revealing new kinds of failures. Roughly: ~100 traces reviewed with no new failure types in the last 20.
When production volume is high, use a mix:
| Strategy | When to Use | Method |
|---|---|---|
| Random | Default starting point | Sample uniformly from recent traces |
| Outlier | Surface unusual behavior | Sort by response length, latency, tool call count; review extremes |
| Failure-driven | After guardrail violations or user complaints | Prioritize flagged traces |
| Uncertainty | When automated judges exist | Focus on traces where judges disagree or have low confidence |
| Stratified | Ensure coverage across user segments | Sample within each dimension |
Next steps after error analysis:
/generate-test-data/build-judge/eval-rag/calibrateMethodology: Adapted from Hamel Husain's error-analysis skill (evals-skills, MIT license) PM adaptation: Claude Code pulls/reads traces, PM provides pass/fail judgment and domain expertise