From eval-guide
Populates an Eval Suite Planning & Logging Template from an agent description, producing a .xlsx workbook and HTML review page. Use before generating test cases or running evals.
How this skill is triggered — by the user, by Claude, or both
Slash command
/eval-guide:eval-suite-plannerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill produces the **Plan** artifact of the `/eval-guide` lifecycle: a populated copy of the customer's **Eval Suite Planning & Logging Template** plus an interactive HTML review page. The workbook is the source-of-truth artifact; do not replace it with a scenario table, quality-signal table, generic spreadsheet, default `.docx` report, or HTML-only plan.
This skill produces the Plan artifact of the /eval-guide lifecycle: a populated copy of the customer's Eval Suite Planning & Logging Template plus an interactive HTML review page. The workbook is the source-of-truth artifact; do not replace it with a scenario table, quality-signal table, generic spreadsheet, default .docx report, or HTML-only plan.
The skill aligns to skills/eval-guide/playbook.md and skills/eval-guide/eval-suite-template.md. Use the 10-step playbook as the methodology spine and the XLSX template as the output shape.
Copy the blank XLSX template and populate existing cells/rows only. Do not modify the template.
Do not rename sheets, add sheets, delete sheets, add columns, change headers, rewrite README text, edit Dropdown Lists, change styles, change data validation, or convert the template into a different spreadsheet.
If a blank template workbook is available in the session, use it. If not, ask the user to provide the template; do not silently invent a new workbook.
Ask targeted questions only when a workbook field materially affects the plan and cannot be inferred safely:
If the user wants speed or cannot answer, populate TBD - confirm before baseline.
When invoked as /eval-suite-planner <agent description>:
Target pass rate, Target rationale, Gate type, Intended use, Run cadence, and Notes columns to express this; do not add a new column.Notes;3 . Run Log only when useful:
Run type = Baseline;Actionable next step = Validate grader, then run baseline;Status = Open.Intended use = Both or Regression;Gate; the slim subset likely affected by model/tool/policy changes can be Both or Regression;Run cadence using existing dropdown values such as Per-change, Nightly, Weekly, or Milestone-only.4 . Reusable Library:
Use skills/eval-guide/eval-suite-template.md as the exact tab/column map.
READMEDo not edit.
1 . PlanningPopulate only existing input cells:
For the template's Min pass rate - Capability row, reflect v5 Step 4 accurately: use launch floor / high-risk capability floor / regression-governance language, not a generic scenario pass-rate target.
2 . Eval Suite RegistryPopulate one row per eval set. Do not populate one row per test case or legacy planning artifact.
Required row semantics:
Category: Capability or Trust & Safety.Dimension tested: capability dimension or T&S category from the template dropdowns.Purpose / diagnostic signal: what failure in this set diagnoses.Target pass rate: absolute gate for T&S; launch floor or Regression / direction after baseline for most capability sets.Target rationale: v5 Step 4 rationale.Gate type: closest existing dropdown value.Intended use: Gate, Regression, or Both.Run cadence: cadence for Step 8.Human input type, Human input author, Grounding source dependency, Source change -> review?: Step 5.Reusable asset?, Reuse tier, Set status: Step 10 and lifecycle status.Notes: assumptions, open questions, Step 4 nuance, and Step 6 grader-validation plan.3 . Run LogUse this for Step 7 baseline/iteration logging. During planning, add placeholder baseline rows only if useful; keep result fields blank.
4 . Reusable LibraryPopulate candidate reusable assets only. Do not duplicate every eval set; promote assets that could help other agents.
Dropdown ListsDo not edit.
Create eval-suite-<agent-name>-<YYYY-MM-DD>.xlsx as a populated copy of the template.
Then create eval-suite-<agent-name>-<YYYY-MM-DD>-review.html next to the workbook using skills/eval-guide/plan-review-page.md.
Do not paste the summary, eval-set table, or checklist into chat. The HTML page carries that content. The final chat response should be only the workbook path, the HTML review page path, and any blocker/manual action.
Include these in the HTML review page checklist instead of displaying them in chat:
| # | Checkpoint | What to verify |
|---|---|---|
| 1 | Objective, risk tier, owner | The objective is decision-oriented, the five-factor risk tier is right, and a named owner can sign off. |
| 2 | Eval-set decomposition | Capability sets isolate one diagnostic capability each; T&S sets remain separate from capability. |
| 3 | Step 4 bars | T&S has absolute hard gates; capability uses launch floors / regression-direction unless high-risk. |
| 4 | Human inputs | Rubrics, ground truths, golden answers, and source dependencies have owners. |
| 5 | Grader validation | Each set has a plausible grader type and validation plan before baseline. |
| 6 | Regression partition | Capability and slim T&S regression sets have cadence; gate-only T&S sets run at milestones. |
| 7 | Template integrity | No sheets, columns, headers, dropdowns, README text, or formatting were changed. |
Notes..docx unless the user explicitly asks for a narrative report./eval-generator — Generate test cases from the populated workbook registry./eval-result-interpreter — Interpret baseline / iteration results using Step 6-7 and gate status./eval-triage-and-improvement — Diagnose failures and feed the Step 9 optimization loop./eval-library-promoter — Promote Step 10 reusable assets./eval-guide — Orchestrated workflow with dashboard review checkpoints.npx claudepluginhub microsoft/eval-guideGenerates importable Copilot Studio eval sets (CSV) and a .docx manifest from a populated planner workbook or plain-English agent description. Use after planning, before running evals.
Builds AI agent evaluations using Anthropic patterns: code/model/human graders, tasks, trials, benchmarks for coding, conversational, research agents.
Runs evaluations on ADK agents: writing eval datasets, analyzing failures, comparing results, and optimizing agents using the Quality Flywheel methodology.