npx claudepluginhub hidai25/eval-viewThis skill uses the workspace's default tool permissions.
Use this skill when the user wants continuous regression monitoring during development. Watch mode observes file changes and automatically re-runs `evalview check` with debounced triggers.
Runs EvalView regression checks against golden baselines to detect regressions in AI agent behavior after code, prompt, or model changes.
Evaluates TandemKit Generator output against specs using Codex as second opinion. Autonomous verification loops via bash state watchers and signals until pass or user intervention.
Implements eval-driven development (EDD) framework for Claude Code sessions with capability/regression evals, pass@k metrics, and code/model/human graders for agent reliability.
Share bugs, ideas, or general feedback.
Use this skill when the user wants continuous regression monitoring during development. Watch mode observes file changes and automatically re-runs evalview check with debounced triggers.
EvalView's watch mode uses watchdog to monitor directories for file changes (.py, .yaml, .yml, .json, .md, .txt, .toml, .cfg, .ini). When a change is detected, it runs a regression check via the gate() API and displays a live scorecard with pass/fail status, score deltas, tool changes, and streak tracking.
Watch mode is a CLI command (not an MCP tool). Help the user run it:
evalview watch
--quick — Skip LLM judge, deterministic checks only ($0 cost, sub-second)--path src/ --path tests/ — Watch specific directories (default: current directory)--test "my-test" — Only check a specific test by name--test-dir tests/evalview — Path to test cases directory (default: tests)--interval 1 — Debounce interval in seconds (default: 2.0)--fail-on REGRESSION,TOOLS_CHANGED — Comma-separated statuses that count as failure (default: REGRESSION)--sound — Terminal bell on regression# Basic: watch everything, full checks
evalview watch
# Fast development loop: no LLM judge, 1-second debounce
evalview watch --quick --interval 1
# Watch specific directories and one test
evalview watch --path src/ --path tests/ --test "calculator-division"
# Strict mode: fail on any behavioral change
evalview watch --fail-on REGRESSION,TOOLS_CHANGED,OUTPUT_CHANGED --sound
Watch mode requires the watchdog package. If not installed:
pip install evalview[watch]
.evalview/, .git/, venv/, node_modules/, __pycache__/, and other common non-source directories automatically.--quick mode is ideal for tight development loops since it costs nothing and runs in sub-second time.