The open-source behavior regression gate for AI agents.
Think Playwright, but for tool-calling and multi-turn AI agents.
Your agent can still return 200 and be wrong. A model or provider update can change tool choice, skip a clarification, or degrade output quality without changing your code or breaking a health check. EvalView catches those silent regressions before users do — and gives you the loop to investigate them, grade the confidence, and broadcast the verdict to your team.
You don't need frontier-lab resources to run a serious agent regression loop. EvalView gives solo devs, startups, and small AI teams the same core discipline: snapshot behavior, detect drift, classify changes, and review or heal them safely.
Traditional tests tell you if your agent is up. EvalView tells you if it still behaves correctly. It tracks drift across outputs, tools, model IDs, and runtime fingerprints with graded confidence — not a binary alarm — so you can tell "the provider changed" from "my system regressed."

30-second live demo.
Most eval tools stop at detect and compare. EvalView helps you classify changes, inspect drift, and auto-heal the safe cases.
- Catch silent regressions that normal tests miss
- Separate provider/model drift from real system regressions
- Auto-heal flaky failures with retries, review gates, and audit logs
- Replay deterministically — cassettes capture real tool calls once so CI never re-hits live services
Built for frontier-lab rigor, startup-team practicality:
- targeted behavior runs instead of giant always-on eval suites
- deterministic diffs first, LLM judgment where it adds signal
- faster loops from change -> eval -> review -> ship
How we run EvalView with this operating model →
✓ login-flow PASSED
⚠ refund-request TOOLS_CHANGED
- lookup_order → check_policy → process_refund
+ lookup_order → check_policy → process_refund → escalate_to_human
✗ billing-dispute REGRESSION -30 pts
Score: 85 → 55 Output similarity: 35%
The money screen is the one-line verdict that lands under every check — a single ship/don't-ship decision derived from the diff, quarantine state, cost delta, and drift confidence:
─────────────────────────────────────────────
VERDICT: 🛑 BLOCK RELEASE
─────────────────────────────────────────────
• 1 regression: billing-dispute
• 1 test changed behavior: refund-request
• Cost up 14% vs baseline
Likely cause & next actions:
1. Rerun statistically to distinguish flake from real drift
(high severity, high confidence)
→ evalview check --statistical 5
2. Review tool descriptions for: escalate_to_human
(high severity, high confidence)
Tool selection changed — usually a prompt edit nudged the model
→ evalview replay refund-request --trace
→ evalview golden update refund-request # if the new path is correct
Four tiers: SAFE_TO_SHIP, SHIP_WITH_QUARANTINE, INVESTIGATE, BLOCK_RELEASE. The verdict is part of --json output, the PR comment, and the cloud ship page — CLI, CI, and dashboard all tell the same story.
Quick Start
pip install evalview
evalview init # Detect agent, auto-configure profile + starter suite
evalview snapshot # Save current behavior as baseline
evalview check # Catch regressions after every change
That's it. Three commands to regression-test any AI agent. init auto-detects your agent type (chat, tool-use, multi-step, RAG, coding) and configures the right evaluators, thresholds, and assertions.
After check, the investigative loop: