Skill

agentv-trace-analyst

Analyzes AgentV evaluation traces and JSONL result files using agentv trace and compare CLI commands to inspect eval results, detect regressions, identify failure patterns, analyze tool trajectories, and compute cost/latency/score stats.

Bash

ai-ml

testing

npx claudepluginhub entityprocess/agentv --plugin agentv-dev

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Analyze evaluation traces headlessly using `agentv trace` primitives and `jq`.

SKILL.md

Similar Skills

agent-evaluation

Evaluates and optimizes LLM agent output using MLflow datasets, scorers, judges, and tracing. Improves tool selection accuracy, answer quality, reduces costs, fixes incomplete responses.

17 files6 tools

mlflow

langsmith-fetch

Fetches and analyzes LangSmith traces to debug LangChain and LangGraph agents. Use for agent errors, tool calls, memory operations, and performance review.

superpowers

agentv-bench

Runs AgentV evaluations to benchmark AI agents, optimize prompts/skills via eval-driven iteration, compare outputs across providers, and analyze results.

12 files

agentv-dev

Stats

Parent Repo Stars11

Parent Repo Forks0

Last CommitApr 5, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

AgentV Trace Analyst

Analyze evaluation traces headlessly using agentv trace primitives and jq.

Primitives

# List result files (most recent first)
agentv trace list [--limit N] [--format json|table]

# Show results with trace details
agentv trace show <result-file> [--test-id <id>] [--tree] [--format json|table]

# Percentile statistics
agentv trace stats <result-file> [--group-by target|suite|test-id] [--format json|table]

# A/B comparison between runs
agentv compare <baseline.jsonl> <candidate.jsonl> [--threshold 0.1] [--format json|table]

Analysis Workflow

1. Discover results

agentv trace list

Pick the result file to analyze. Most recent is first.

2. Get overview

agentv trace stats <result-file>

Read the percentile table. Key signals:

score p50 < 0.8: Significant quality issues
latency p90 > 30s: Performance bottleneck
cost p99 spike: Outlier cost tests to investigate
tool_calls p90 >> p50: Some tests are much chattier

3. Investigate failures

agentv trace show <result-file> --format json | jq '[.[] | select(.score < 0.8) | {test_id, score, assertions: [.assertions[] | select(.passed | not)], trace: {tools: (.trace.tool_calls | keys)}, duration_ms, cost_usd}]'

For each failing test, examine:

assertions (failed): What criteria were not met? (filter for passed: false)
trace.tool_calls: Did the agent use expected tools?
duration_ms: Did it time out or run too long?
reasoning: Why did the evaluator score it low?

4. Inspect specific tests

# Flat view with trace summary
agentv trace show <result-file> --test-id <id>

# Tree view (if output messages available)
agentv trace show <result-file> --test-id <id> --tree

The tree view shows the agent's execution path — LLM calls interspersed with tool invocations. Look for:

Excessive tool calls: Agent looping or exploring unnecessarily
Missing tools: Expected tool not called
Long durations: Specific tool calls that are slow

5. Compare runs

agentv compare <baseline.jsonl> <candidate.jsonl>

Look for:

Wins vs losses: Net improvement or regression?
Mean delta: Overall direction of change
Per-test deltas: Which tests regressed?

6. Group analysis

# By target provider
agentv trace stats <result-file> --group-by target

# By suite
agentv trace stats <result-file> --group-by suite

Compare providers side-by-side: which is cheaper, faster, more accurate?

Advanced Queries with jq

All commands support --format json for piping to jq:

# Top 3 most expensive tests
agentv trace show <result-file> --format json \
  | jq 'sort_by(-.cost_usd) | .[0:3] | .[] | {test_id, cost: .cost_usd, score}'

# Tests where token usage exceeds 10k
agentv trace show <result-file> --format json \
  | jq '[.[] | select(.token_usage.input + .token_usage.output > 10000) | {test_id, tokens: (.token_usage.input + .token_usage.output)}]'

# Score distribution by suite
agentv trace show <result-file> --format json \
  | jq 'group_by(.suite) | .[] | {suite: .[0].suite, count: length, avg_score: ([.[].score] | add / length)}'

# Tool usage frequency across all tests
agentv trace show <result-file> --format json \
  | jq '[.[].trace.tool_calls // {} | to_entries[]] | group_by(.key) | .[] | {tool: .[0].key, total_calls: ([.[].value] | add)}'

# Find regressions > 0.1 between two runs
agentv compare baseline.jsonl candidate.jsonl --format json \
  | jq '.matched[] | select(.delta < -0.1) | {test_id: .testId, delta, from: .score1, to: .score2}'

Reasoning Patterns

When analyzing traces, think about:

Efficiency: Are tool calls/tokens proportional to task complexity? High tokens-per-tool may indicate verbose prompts or unnecessary context.
Error patterns: Do failures cluster by target, suite, or tool usage? Common patterns:
- Tool errors → agent can't access required resources
- High LLM calls with low tool calls → agent stuck in reasoning loop
- Missing tool calls → wrong tool routing
Cost optimization: Identify tests with high cost but acceptable scores — can they use a cheaper model? Compare --group-by target stats.
Latency distribution: P50 vs P99 spread indicates consistency. Large spread means unpredictable performance — investigate P99 outliers.
Regression detection: After a prompt/config change, compare before/after. Mean delta > 0 is good, but check individual test regressions — a few large losses can hide behind many small wins.