Help us improve
Share bugs, ideas, or general feedback.
From agentops-toolkit
Run AgentOps evaluations end-to-end against any agent (Foundry hosted/prompt agent, HTTP/JSON endpoint, or raw model deployment). Trigger on phrases like "run eval", "evaluate my agent", "benchmark", "agentops eval", "compare runs". Uses the flat agentops.yaml schema.
npx claudepluginhub azure/agentops --plugin agentops-acceleratorHow this skill is triggered — by the user, by Claude, or both
Slash command
/agentops-toolkit:agentops-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
End-to-end workflow: install → init → configure → run → read report.
Guides technical evaluation of code review feedback: read fully, restate for understanding, verify against codebase, respond with reasoning or pushback before implementing.
Share bugs, ideas, or general feedback.
End-to-end workflow: install → init → configure → run → read report.
pip install "agentops-toolkit[foundry] @ git+https://github.com/Azure/agentops.git@develop".agentops.yaml does not exist at the project root, run agentops init.
The init wizard prompts (azd-style) for the Foundry project endpoint,
agent reference, and dataset path, persists each answer to
.azure/<env>/.env + agentops.yaml as it goes, and installs coding
skills. Pass --no-prompt plus the explicit flags
(--project-endpoint, --agent, --dataset, …) for non-interactive
runs. Run agentops init show later to inspect the resolved config.Run the deterministic local triage first:
agentops eval analyze
Use its output to decide whether the repo is ready for agentops eval run or
needs skill-assisted setup. If it recommends agentops-config, fix the target
and protocol. If it recommends agentops-dataset, create/map realistic JSONL
rows. If it recommends agentops-eval, inspect the app scenario and evaluator
expectations before running.
Read the codebase (README, entry point, env vars) and pick the right value
for the agent: field of agentops.yaml:
| Pattern in code / env | agent: value |
|---|---|
AIProjectClient, azure-ai-projects, Foundry agent ID like name:1 | "<name>:<version>" (Foundry prompt agent) |
Foundry hosted agent endpoint URL ending in /agents/... | "https://<resource>.services.ai.azure.com/api/projects/<p>/agents/..." |
| Plain HTTP/JSON endpoint (FastAPI, Express, ACA, AKS) | "https://<host>/<path>" |
| Raw Foundry/Azure OpenAI model deployment | "model:<deployment-name>" |
If nothing is found, ask the user once for the agent identifier.
agentops.yaml points to a JSONL file (default
.agentops/data/smoke.jsonl). Each row needs at least input and a label
that maps to the metric you care about (expected, context,
tool_calls...). If the dataset is empty or unrelated, run the
agentops-dataset skill before running the eval.
agentops eval run
Optional flags:
--config <path> - point at a different agentops.yaml.--output <dir> - choose where to write results.json and report.md
(defaults to .agentops/results/<timestamp>/).Exit codes:
0 - succeeded and all thresholds passed2 - succeeded but at least one threshold failed (gate-friendly)1 - runtime/configuration erroragentops report generate # regenerate report.md from latest results.json
agentops report generate --in <results.json>
Open .agentops/results/latest/report.md. To compare two runs, hand both
results.json files to the user or run the next eval with
--baseline <previous-results.json> so AgentOps adds a Comparison vs
Baseline section to the report.
For production promotion, generate the Doctor evidence pack:
agentops doctor --evidence-pack
Open .agentops/release/latest/evidence.md. It summarizes eval, baseline,
Doctor, workflow, Foundry, monitoring, AI Landing Zone, and trace-regression
readiness without creating a second exit-code contract.
If the user has exported Foundry/App Insights traces, preview candidate regression rows first:
agentops eval promote-traces --source <traces.jsonl>
Only write files after review:
agentops eval promote-traces --source <traces.jsonl> --apply
Default self-similarity labels are for drift detection, not human-verified
ground truth. Use --label-mode pending when reviewers must fill expected
answers before the dataset gates releases.
Two modes are supported. Both write a deep-link into
.agentops/results/latest/cloud_evaluation.json and require
AZURE_AI_FOUNDRY_PROJECT_ENDPOINT (or the inline project_endpoint).
Classic Foundry Evaluations panel (works for any target kind): AgentOps runs locally first, then uploads the metrics it computed.
execution: local
publish: true
# project_endpoint: "https://<resource>.services.ai.azure.com/api/projects/<p>"
New Foundry Evaluations panel (preview): Foundry runs the agent +
evaluators server-side via the OpenAI Evals API. Only works for
name:version Foundry agents. publish is implicit - a cloud run is
always recorded by Foundry. The local JSONL remains the dataset source of
truth; AgentOps syncs it to Foundry Data/Datasets by default and uses that
dataset version in the Evals run.
execution: cloud
# project_endpoint: "https://<resource>.services.ai.azure.com/api/projects/<p>"
dataset_sync:
mode: auto # sync local JSONL to Foundry Data/Datasets
With execution: local and no publish: true, AgentOps runs locally
and only writes local artifacts.
After a cloud run, inspect .agentops/results/latest/cloud_evaluation.json.
Its dataset block explains whether the run used inline rows or a Foundry
dataset reference.
evaluators: block - most users do
not need it.thresholds: to gate CI:
thresholds:
coherence: ">=3"
avg_latency_seconds: "<=10"
auth_header_env: MY_TOKEN_VAR and AgentOps adds
Authorization: Bearer $MY_TOKEN_VAR.