Help us improve
Share bugs, ideas, or general feedback.
From agentops-toolkit
Read, regenerate, and explain AgentOps evaluation reports. Trigger on "show report", "explain scores", "regenerate report", "what do these metrics mean". Operates on results.json and report.md produced by `agentops eval run`.
npx claudepluginhub azure/agentops --plugin agentops-acceleratorHow this skill is triggered — by the user, by Claude, or both
Slash command
/agentops-toolkit:agentops-reportThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Help the user understand a finished AgentOps run.
Guides technical evaluation of code review feedback: read fully, restate for understanding, verify against codebase, respond with reasoning or pushback before implementing.
Share bugs, ideas, or general feedback.
Help the user understand a finished AgentOps run.
Latest run: .agentops/results/latest/. Each run produces:
results.json - machine-readable metrics, per-row scores, thresholds.report.md - human-readable summary suitable for PR comments.cloud_evaluation.json (only when publish: true) - deep-link to
the Foundry Evaluations panel. mode: classic when execution: local
(metrics uploaded to Classic Foundry), mode: cloud when
execution: cloud (preview, server-side run via the OpenAI Evals API).agentops report generate # uses .agentops/results/latest/results.json
agentops report generate --in <results.json> --out <report.md>
report generate always reads the flat 1.0 results schema and emits
Markdown. There is no HTML format.
Common metrics and their meaning:
| Metric | Range | Higher is better? | Notes |
|---|---|---|---|
similarity | 1-5 | yes | LLM-judged similarity to expected. |
coherence | 1-5 | yes | Answer is internally consistent. |
fluency | 1-5 | yes | Natural language quality. |
groundedness | 1-5 | yes | Answer is supported by context (RAG). |
relevance | 1-5 | yes | Answer is on-topic for input. |
f1_score | 0-1 | yes | Token overlap with expected. |
tool_call_accuracy | 0-1 | yes | Predicted tool calls match tool_calls. |
intent_resolution | 0-1 | yes | User intent was resolved. |
task_completion | 0-1 | yes | Multi-step task finished. |
avg_latency_seconds | seconds | no | Wall-clock latency per row. |
Pass/fail rows are derived from thresholds: in agentops.yaml. The
exit code of the original run reflects the gate:
0 → all thresholds passed2 → one or more thresholds failed1 → runtime errorresults.json (row_metrics[] and item_evaluations[]) and
suggest concrete prompt or retrieval changes.run_metrics.avg_latency_seconds and
per-row latency.agentops eval run --baseline <previous-results.json> and explain the
generated Comparison vs Baseline section.results.json by hand - re-run the eval.