Help us improve
Share bugs, ideas, or general feedback.
From cc-skills-meta
Structured behavioral analysis for LLM performance debugging — hypothesis testing for session patterns (loops, context degradation, decision inefficiency, cognitive overload, attention drift)
npx claudepluginhub enduser123/cc-skills-metaHow this skill is triggered — by the user, by Claude, or both
Slash command
/cc-skills-meta:behaveThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Structured hypothesis-testing analysis for LLM performance debugging in chat sessions.
Guides technical evaluation of code review feedback: read fully, restate for understanding, verify against codebase, respond with reasoning or pushback before implementing.
Share bugs, ideas, or general feedback.
Structured hypothesis-testing analysis for LLM performance debugging in chat sessions.
Analyze session history to detect and diagnose behavioral patterns that indicate performance issues:
Required: Bullet structure with explicit hypothesis labels (H₁, H₂, H₃), NOT tables or narrative conclusions.
For each observed symptom, generate 3-5 competing root cause hypotheses. No filtering — list all candidates.
Format:
H₁: [specific mechanism]
H₂: [alternative mechanism]
H₃: [third mechanism]
H₄: [fourth mechanism]
H₅: [fifth mechanism]
Rule: Do NOT suppress or pre-judge any hypothesis.
For each hypothesis, specify:
TEST METHOD: How to distinguish this from alternatives
DATA NEEDED: What evidence would falsify this
COST: Execution cost (log search, code trace, re-run, etc.)
Order tests from cheapest to most expensive.
| Cost Level | Examples |
|---|---|
| Cheap | Log search, grep, Read existing files |
| Medium | Add diagnostic print, verify data exists, swap tool |
| Expensive | Re-run in fresh context, cross-environment test |
State which hypotheses remain unfalsified based on available evidence.
Rule: DO NOT converge to single hypothesis until alternatives are ruled out.
Format:
Unfalsified after available evidence: H₁, H₂, H₃ (H₄ ruled out by [evidence])
| Remaining Candidates | Confidence Level |
|---|---|
| 1 hypothesis | HIGH (H rejected) |
| 2-3 hypotheses | MODERATE (list candidates) |
| 4+ hypotheses | LOW (cannot converge without more data) |
Only after Steps 1-4: Produce structured output with confidence qualifier.
Finding: [brief description of observed symptom]
Hypotheses:
H₁: [specific mechanism]
H₂: [alternative mechanism]
H₃: [third mechanism]
H₄: [fourth mechanism]
Test sequence:
[Cheapest] [test description] → [expected distinguishing evidence]
[Medium] [test description] → [expected distinguishing evidence]
[Expensive] [test description] → [expected distinguishing evidence]
Unfalsified after available evidence: H₁, H₂, ...
Confidence: [HIGH/MODERATE/LOW] ([reason])
Finding: [restated with appropriate confidence qualifier]
Symptoms:
Hypotheses for loops:
Symptoms:
Hypotheses for context degradation:
Symptoms:
Hypotheses for decision inefficiency:
Symptoms:
Hypotheses for cognitive overload:
Symptoms:
Hypotheses for attention drift:
Finding: Lines 41-54 empty Python output
Hypotheses:
H₁: Query succeeded but returned no matching data
H₂: Code path not reached (wrong conditional)
H₃: Stdout capture failed (environment issue)
H₄: Silent failure in code execution
Test sequence:
[Cheapest] Add print("REACHED") before query → check logs for execution confirmation
[Medium] Verify target data exists before querying → ls/Read confirmation
[Medium] Swap diagnostic tool → test with grep/glob alternative
[Expensive] Re-run in fresh terminal context → cross-environment comparison
Unfalsified after available evidence: H₁, H₂, H₃ (H₄ ruled out by RC:0 exit)
Confidence: MODERATE (3 candidates remain; cannot converge without re-execution)
Finding: Model ran empty diagnostic without confirming prerequisites
DO NOT:
/diagnose — Structured diagnostic protocol for code bugs/trace — Manual trace-through verification/gto — Gap/Task/Opportunity analysisVersion: 1.0.0