From agentv-dev
Analyzes AgentV evaluation traces and JSONL result files using agentv trace and compare CLI commands to inspect eval results, detect regressions, identify failure patterns, analyze tool trajectories, and compute cost/latency/score stats.
npx claudepluginhub entityprocess/agentv --plugin agentv-devThis skill uses the workspace's default tool permissions.
Analyze evaluation traces headlessly using `agentv trace` primitives and `jq`.
Evaluates and optimizes LLM agent output using MLflow datasets, scorers, judges, and tracing. Improves tool selection accuracy, answer quality, reduces costs, fixes incomplete responses.
Fetches and analyzes LangSmith traces to debug LangChain and LangGraph agents. Use for agent errors, tool calls, memory operations, and performance review.
Runs AgentV evaluations to benchmark AI agents, optimize prompts/skills via eval-driven iteration, compare outputs across providers, and analyze results.
Share bugs, ideas, or general feedback.
Analyze evaluation traces headlessly using agentv trace primitives and jq.
# List result files (most recent first)
agentv trace list [--limit N] [--format json|table]
# Show results with trace details
agentv trace show <result-file> [--test-id <id>] [--tree] [--format json|table]
# Percentile statistics
agentv trace stats <result-file> [--group-by target|suite|test-id] [--format json|table]
# A/B comparison between runs
agentv compare <baseline.jsonl> <candidate.jsonl> [--threshold 0.1] [--format json|table]
agentv trace list
Pick the result file to analyze. Most recent is first.
agentv trace stats <result-file>
Read the percentile table. Key signals:
agentv trace show <result-file> --format json | jq '[.[] | select(.score < 0.8) | {test_id, score, assertions: [.assertions[] | select(.passed | not)], trace: {tools: (.trace.tool_calls | keys)}, duration_ms, cost_usd}]'
For each failing test, examine:
passed: false)# Flat view with trace summary
agentv trace show <result-file> --test-id <id>
# Tree view (if output messages available)
agentv trace show <result-file> --test-id <id> --tree
The tree view shows the agent's execution path — LLM calls interspersed with tool invocations. Look for:
agentv compare <baseline.jsonl> <candidate.jsonl>
Look for:
# By target provider
agentv trace stats <result-file> --group-by target
# By suite
agentv trace stats <result-file> --group-by suite
Compare providers side-by-side: which is cheaper, faster, more accurate?
All commands support --format json for piping to jq:
# Top 3 most expensive tests
agentv trace show <result-file> --format json \
| jq 'sort_by(-.cost_usd) | .[0:3] | .[] | {test_id, cost: .cost_usd, score}'
# Tests where token usage exceeds 10k
agentv trace show <result-file> --format json \
| jq '[.[] | select(.token_usage.input + .token_usage.output > 10000) | {test_id, tokens: (.token_usage.input + .token_usage.output)}]'
# Score distribution by suite
agentv trace show <result-file> --format json \
| jq 'group_by(.suite) | .[] | {suite: .[0].suite, count: length, avg_score: ([.[].score] | add / length)}'
# Tool usage frequency across all tests
agentv trace show <result-file> --format json \
| jq '[.[].trace.tool_calls // {} | to_entries[]] | group_by(.key) | .[] | {tool: .[0].key, total_calls: ([.[].value] | add)}'
# Find regressions > 0.1 between two runs
agentv compare baseline.jsonl candidate.jsonl --format json \
| jq '.matched[] | select(.delta < -0.1) | {test_id: .testId, delta, from: .score1, to: .score2}'
When analyzing traces, think about:
Efficiency: Are tool calls/tokens proportional to task complexity? High tokens-per-tool may indicate verbose prompts or unnecessary context.
Error patterns: Do failures cluster by target, suite, or tool usage? Common patterns:
Cost optimization: Identify tests with high cost but acceptable scores — can they use a cheaper model? Compare --group-by target stats.
Latency distribution: P50 vs P99 spread indicates consistency. Large spread means unpredictable performance — investigate P99 outliers.
Regression detection: After a prompt/config change, compare before/after. Mean delta > 0 is good, but check individual test regressions — a few large losses can hide behind many small wins.