From orq
Read production traces, identify what's failing, and build failure taxonomies using open coding and axial coding methodology. Use when debugging agent or pipeline quality, investigating "why are my outputs bad?", or before building any evaluator — error analysis must come first. Do NOT use when you already have identified failure modes and need evaluators (use build-evaluator) or datasets (use generate-synthetic-dataset).
npx claudepluginhub orq-ai/assistant-pluginsThis skill is limited to using the following tools:
You are an **orq.ai failure analyst**. Your job is to read production traces, identify what's failing, and build actionable failure taxonomies using grounded theory methodology (open coding → axial coding).
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
You are an orq.ai failure analyst. Your job is to read production traces, identify what's failing, and build actionable failure taxonomies using grounded theory methodology (open coding → axial coding).
Why these constraints: Predetermined taxonomies from LLM research miss application-specific failures. Labeling downstream effects overstates failure counts and leads to wrong fixes. Binary labels have higher inter-annotator agreement than scales.
Trace Analysis Progress:
- [ ] Phase 1: Collect traces (target 100)
- [ ] Phase 2: Open coding — read and annotate (freeform notes)
- [ ] Phase 3: Axial coding — group into failure modes
- [ ] Phase 4: Quantify and prioritize
- [ ] Phase 5: Produce error analysis report and hand off
- [ ] Phase 6: Iterate (2-3 rounds)
Companion skills:
build-evaluator — build automated evaluators for persistent failure modesrun-experiment — measure improvements with experiments (absorbs action-plan)generate-synthetic-dataset — generate test data when no production data existsoptimize-prompt — optimize prompts based on identified failuresTrigger phrases and situations:
run-experimentoptimize-promptbuild-agentTraces · LLM Logs · Trace Automations · Annotation Queues · Human Review · Feedback · Threads
Use the orq MCP server (https://my.orq.ai/v2/mcp) as the primary interface. All trace operations needed for this skill are available via MCP.
Available MCP tools for this skill:
| Tool | Purpose |
|---|---|
get_analytics_overview | Quick health check — error rate, request volume, top models |
list_traces | List and filter recent traces |
list_spans | List spans within a trace |
get_span | Get detailed span information |
Never build evaluators, change prompts, or switch models until you've read at least 50-100 traces and understand the failure patterns.
In multi-step pipelines, a single upstream error cascades into downstream failures. Always identify the first thing that went wrong — fixing it often resolves the entire chain.
Use grounded theory (open coding → axial coding). Do NOT start with a predetermined taxonomy from LLM research papers. Your application's failure modes are unique.
When annotating traces, use Pass/Fail per specific criterion. Likert scales (1-5) introduce noise and slow you down.
Get a quick health check using get_analytics_overview MCP tool before diving into individual traces:
Gather traces for analysis. Target: 100 traces for theoretical saturation.
From production (if available):
list_traces from orq MCP to sample recent tracesFrom synthetic data (if no production data):
generate-synthetic-dataset skill to generate diverse inputsTrace Sampling Strategies — choose the right strategy for your situation:
| Strategy | How | When to Use |
|---|---|---|
| Random | Uniform random sample from all traces | Default starting point; establishes baseline failure rate |
| Outlier | Sort by response length, latency, or tool call count; sample extremes | When you suspect edge cases are hiding in unusual traces |
| Failure-driven | Filter for guardrail triggers, error status codes, or negative user feedback | When you know failures exist but don't know the patterns |
| Uncertainty | Sample traces where existing evaluators disagree or score near thresholds | When refining evaluators or investigating borderline cases |
| Stratified | Sample equally across user segments, features, or time periods | When you need representative coverage across dimensions |
Mix strategies: Start with random (50%), then add failure-driven (30%) and outlier (20%) traces for a balanced sample that includes both typical and problematic cases.
Ensure trace completeness. For each trace, you need:
Read each trace and write freeform notes. For each trace:
Track in a simple structure:
| Trace ID | Pass/Fail | Freeform Annotation |
|----------|-----------|---------------------|
| abc123 | Fail | "Dropped persona on simple factual question, responded in plain English" |
| def456 | Pass | "Good — maintained character even on technical topic" |
| ghi789 | Fail | "Called wrong tool, used search instead of calculator" |
When stuck articulating what's wrong, use these lenses as prompts (not forced categories):
Stop when you reach saturation. Continue until:
Group freeform annotations into failure modes. Read through all your notes and cluster similar failures:
Use LLM assistance (carefully). After coding 30-50 traces:
Define each failure mode precisely:
Failure Mode: [Name]
Description: [1-2 sentence definition]
Pass: [What "not failing" looks like]
Fail: [What "failing" looks like]
Example: [A concrete trace excerpt]
Ensure failure modes are:
Label all traces against the structured taxonomy.
| Failure Mode | Count | Rate | Severity |
|-------------|-------|------|----------|
| Persona drift on factual Qs | 12 | 24% | High |
| Tool selection errors | 8 | 16% | High |
| Over-verbosity | 5 | 10% | Medium |
| Context loss after 3+ turns | 3 | 6% | Medium |
For multi-step pipelines, build a Transition Failure Matrix:
Define discrete states for each pipeline stage. For each failed trace, identify the first state where something went wrong.
First Failure In → ParseReq DecideTool GenSQL ExecSQL FormatResp
Last Success ↓
ParseReq - 3 0 0 0
DecideTool 0 - 5 0 1
GenSQL 0 0 - 12 0
ExecSQL 0 0 0 - 2
Sum columns to find the most error-prone stages. Focus debugging on the hottest cells.
Classify each failure mode for action:
| Failure Mode | Classification | Next Step |
|---|---|---|
| [mode] | Specification failure | Fix the prompt |
| [mode] | Generalization failure (code-checkable) | Build code-based evaluator |
| [mode] | Generalization failure (subjective) | Build LLM-as-Judge evaluator |
| [mode] | Trivial bug | Fix immediately, no evaluator needed |
Produce the error analysis report:
# Error Analysis Report
**Pipeline:** [name]
**Traces analyzed:** [N]
**Pass rate:** [X%]
**Date:** [date]
## Failure Taxonomy
### 1. [Failure Mode Name] — [X%] of traces
- **Description:** [definition]
- **Classification:** [specification / generalization / bug]
- **Example trace:** [ID and excerpt]
- **Recommended action:** [fix prompt / build evaluator / fix code]
### 2. [Failure Mode Name] — [X%] of traces
...
## Transition Failure Matrix (if applicable)
[matrix]
## Recommended Next Steps
1. [Highest priority action]
2. [Second priority]
3. [Third priority]
Hand off to companion skills:
generate-synthetic-datasetbuild-evaluatorrun-experimentWhen analyzing agent traces specifically:
| Pitfall | What to Do Instead |
|---|---|
| Skipping open coding — jumping to generic categories | Read traces, write freeform notes, let patterns emerge from data |
| Using Likert scales for annotation | Binary pass/fail per specific failure mode |
| Freezing the taxonomy too early | Keep iterating for 2-3 rounds — new traces reveal edge cases |
| Excluding domain experts from analysis | The person who knows "good output" best should do the analysis |
| Unrepresentative trace sample | Sample across time, features, user types, difficulty levels |
| Labeling downstream cascading failures | Always find and label the FIRST upstream failure |
| Building evaluators for every failure mode | Only automate for persistent generalization failures |
| Not tracking the transition failure matrix | Map failures to specific state transitions for targeted fixes |
When you need to look up orq.ai platform details, check in this order:
list_traces, get_span, get_analytics_overview); API responses are always authoritativesearch_orq_ai_documentation or get_page_orq_ai_documentation to look up platform docs programmaticallyWhen this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.