Plugin

evalview

Name: evalview
Author: hidai25

Generate test cases for AI agent skills from SKILL.md files, manual specs, or interaction captures, then run regression checks against golden baselines to detect behavior changes after code, prompt, or model updates. Enable continuous monitoring with watch mode on file changes and integrate evaluations via local MCP server with OpenAI and Anthropic APIs.

npx claudepluginhub hidai25/eval-view

Component Overview

Skills

MCP Servers

Component Details

Skills (3)

generate-tests

/generate-tests

Generate EvalView test cases — either from a SKILL.md file using LLM-powered generation, or by capturing real agent interactions through a proxy.

run-eval

/run-eval

Run EvalView regression checks against golden baselines to detect regressions in AI agent behavior after code, prompt, or model changes.

watch

/watch

Start EvalView watch mode to automatically re-run regression checks whenever project files change.

MCP Servers (1)

Connects to external services

evalview

2 secrets

README

The open-source behavior regression gate for AI agents.
Think Playwright, but for tool-calling and multi-turn AI agents.

Your agent can still return 200 and be wrong. A model or provider update can change tool choice, skip a clarification, or degrade output quality without changing your code or breaking a health check. EvalView catches those silent regressions before users do — and gives you the loop to investigate them, grade the confidence, and broadcast the verdict to your team.

You don't need frontier-lab resources to run a serious agent regression loop. EvalView gives solo devs, startups, and small AI teams the same core discipline: snapshot behavior, detect drift, classify changes, and review or heal them safely.

Traditional tests tell you if your agent is up. EvalView tells you if it still behaves correctly. It tracks drift across outputs, tools, model IDs, and runtime fingerprints with graded confidence — not a binary alarm — so you can tell "the provider changed" from "my system regressed."

30-second live demo.

Most eval tools stop at detect and compare. EvalView helps you classify changes, inspect drift, and auto-heal the safe cases.

Catch silent regressions that normal tests miss
Separate provider/model drift from real system regressions
Auto-heal flaky failures with retries, review gates, and audit logs
Replay deterministically — cassettes capture real tool calls once so CI never re-hits live services

Built for frontier-lab rigor, startup-team practicality:

targeted behavior runs instead of giant always-on eval suites
deterministic diffs first, LLM judgment where it adds signal
faster loops from change -> eval -> review -> ship

How we run EvalView with this operating model →

  ✓ login-flow           PASSED
  ⚠ refund-request       TOOLS_CHANGED
      - lookup_order → check_policy → process_refund
      + lookup_order → check_policy → process_refund → escalate_to_human
  ✗ billing-dispute      REGRESSION  -30 pts
      Score: 85 → 55  Output similarity: 35%

The money screen is the one-line verdict that lands under every check — a single ship/don't-ship decision derived from the diff, quarantine state, cost delta, and drift confidence:

─────────────────────────────────────────────
 VERDICT: 🛑 BLOCK RELEASE
─────────────────────────────────────────────

  • 1 regression: billing-dispute
  • 1 test changed behavior: refund-request
  • Cost up 14% vs baseline

Likely cause & next actions:

  1. Rerun statistically to distinguish flake from real drift
     (high severity, high confidence)
     → evalview check --statistical 5

  2. Review tool descriptions for: escalate_to_human
     (high severity, high confidence)
     Tool selection changed — usually a prompt edit nudged the model
     → evalview replay refund-request --trace
     → evalview golden update refund-request   # if the new path is correct

Four tiers: SAFE_TO_SHIP, SHIP_WITH_QUARANTINE, INVESTIGATE, BLOCK_RELEASE. The verdict is part of --json output, the PR comment, and the cloud ship page — CLI, CI, and dashboard all tell the same story.

Quick Start

pip install evalview

evalview init        # Detect agent, auto-configure profile + starter suite
evalview snapshot    # Save current behavior as baseline
evalview check       # Catch regressions after every change

That's it. Three commands to regression-test any AI agent. init auto-detects your agent type (chat, tool-use, multi-step, RAG, coding) and configures the right evaluators, thresholds, and assertions.

After check, the investigative loop:

View full README on GitHub

Similar Plugins

agentic-engineering

Design and review AI agent systems — architecture patterns, workflow design, and plugin quality review

1mo

v1.0.0

Stats

Version0.6.1

Stars97

Forks21

MaintenanceExcellent

LicenseApache-2.0

Last CommitMay 4, 2026

AddedApr 7, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Safety Signals

Caution

Requires secrets

Needs API keys or credentials to function

Help us improve

Share bugs, ideas, or general feedback.

Back to Plugins

evalview

✓ login-flow PASSED ⚠ refund-request TOOLS_CHANGED - lookup_order → check_policy → process_refund + lookup_order → check_policy → process_refund → escalate_to_human ✗ billing-dispute REGRESSION -30 pts Score: 85 → 55 Output similarity: 35%

───────────────────────────────────────────── VERDICT: 🛑 BLOCK RELEASE ───────────────────────────────────────────── • 1 regression: billing-dispute • 1 test changed behavior: refund-request • Cost up 14% vs baseline Likely cause & next actions: 1. Rerun statistically to distinguish flake from real drift (high severity, high confidence) → evalview check --statistical 5 2. Review tool descriptions for: escalate_to_human (high severity, high confidence) Tool selection changed — usually a prompt edit nudged the model → evalview replay refund-request --trace → evalview golden update refund-request # if the new path is correct

evalview

Component Overview

Component Details

Skills (3)

MCP Servers (1)

README

Quick Start

Similar Plugins

agentic-engineering

Help us improve

Help us improve

evalview

Component Overview

Component Details

Skills (3)

MCP Servers (1)

README

Quick Start

Similar Plugins

agentic-engineering

Help us improve

agent-validator

evaluate-agent

opik

skill-optimizer

rhesis