npx claudepluginhub hidai25/eval-viewThis skill uses the workspace's default tool permissions.
Use this skill after making changes to an AI agent (prompt edits, model swaps, tool changes, code refactors) to verify nothing broke.
Starts EvalView watch mode to monitor file changes and automatically re-run regression checks. Useful for continuous monitoring during Python development.
Evaluates LLM agents via behavioral testing, capability assessment, reliability metrics, benchmarking, and production monitoring—where top agents score <50% on real-world benchmarks.
Writes, edits, reviews, and validates AgentV EVAL.yaml files for agent skill evaluations. Adds test cases, configures graders, converts from evals.json or chat transcripts.
Share bugs, ideas, or general feedback.
Use this skill after making changes to an AI agent (prompt edits, model swaps, tool changes, code refactors) to verify nothing broke.
EvalView compares current agent behavior against saved golden baselines. It runs your test cases, evaluates the outputs, and reports a diff status for each test:
Locate the test directory. Look for tests/evalview/ in the project. If it exists, use that. Otherwise check for a tests/ directory with .yaml test files.
Run a regression check using the run_check MCP tool:
run_check with the detected test_pathtest parameter with the test nameInterpret results:
If changes are intentional, offer to update the baseline by calling run_snapshot with an explanatory notes parameter.
Generate a visual report (optional) by calling generate_visual_report for a detailed HTML breakdown of traces, diffs, scores, and timelines.
evalview check tests/evalview/
evalview check tests/evalview/ --test "my-test"
evalview snapshot tests/evalview/ --notes "updated after prompt refactor"
run_check frequently — it calls the Python API directly with no subprocess overhead.