From eval-guide
Answers AI agent evaluation methodology questions using Microsoft's agent evaluation ecosystem, covering grader types, dataset design, criteria writing, non-determinism, tool-call evaluation, and multi-turn agents.
How this skill is triggered — by the user, by Claude, or both
Slash command
/eval-guide:eval-faqThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Answer any question about eval methodology, grader types, dataset design, criteria writing, non-determinism, tool-call evaluation, multi-turn agent evaluation, eval tooling, capability vs. regression evals, and interpreting results — specifically in the context of AI agent evaluation. The primary methodology is `skills/eval-guide/playbook.md`: **Practical Guidance on Agent Evaluation: a 10-step...
Answer any question about eval methodology, grader types, dataset design, criteria writing, non-determinism, tool-call evaluation, multi-turn agent evaluation, eval tooling, capability vs. regression evals, and interpreting results — specifically in the context of AI agent evaluation. The primary methodology is skills/eval-guide/playbook.md: Practical Guidance on Agent Evaluation: a 10-step playbook. Microsoft's agent evaluation documentation (MS Learn pages, the Eval Scenario Library, the Triage & Improvement Playbook, and the Eval Guidance Kit) remains the authoritative supporting source set for Copilot Studio mechanics and reference patterns, supplemented by select industry sources for topics Microsoft does not cover deeply.
When invoked as /eval-faq <question>, follow this process exactly:
Use this topic-to-URL routing table to decide what to fetch. Fetch FIRST, then answer. Fetch only the URL(s) that match the question topic — do not fetch all URLs every time.
| Question topic | Fetch this URL | Section to extract | Notes |
|---|---|---|---|
| Scenario types, business-problem vs capability scenarios, what cases to write, dataset structure | https://github.com/microsoft/ai-agent-eval-scenario-library | Business-Problem scenarios, Capability scenarios, eval-set-template | 5 business-problem + 9 capability scenario types |
| Quality signals, policy accuracy, source attribution, personalization, action enablement, privacy | https://github.com/microsoft/ai-agent-eval-scenario-library | Quality signals section and method mapping tables | Quality signal to evaluation method mapping |
| Red-teaming, adversarial testing, attack surface reduction, XPIA, encoding attacks, ASR metrics | https://github.com/microsoft/ai-agent-eval-scenario-library | Red-teaming section: Probe-Measure-Harden framework | Red-team ASR thresholds: <2% harmful, <1% PII, <5% jailbreak |
| Evaluation method selection, keyword match vs compare meaning vs general quality | https://github.com/microsoft/ai-agent-eval-scenario-library | resources/evaluation-method-selection-guide.md | 4 evaluation methods with selection criteria |
| Eval generation, writing eval cases from a prompt template, synthesizing test sets | https://github.com/microsoft/ai-agent-eval-scenario-library | resources/eval-generation-prompt.md | Template for generating eval cases |
| Agent profile template, defining agent scope for eval | https://github.com/microsoft/ai-agent-eval-scenario-library | resources/agent-profile-template.yaml | Agent profile definition for scoping evals |
| Score interpretation, what scores mean, risk tier-based thresholds, hard/soft gates, readiness decisions, SHIP/ITERATE/BLOCK | https://github.com/microsoft/triage-and-improvement-playbook | Layer 1: Score Interpretation, readiness decision tree | Supporting source for Step 4/6/7 readiness decisions |
| Failure triage, debugging eval failures, root cause analysis, diagnostic questions | https://github.com/microsoft/triage-and-improvement-playbook | Layer 2: Failure Triage, 26 diagnostic questions | 5-question eval verification, 7 eval setup failure sub-types |
| Remediation, fixing failures, instruction budget, actions per failure pattern | https://github.com/microsoft/triage-and-improvement-playbook | Layer 3: Remediation Mapping | Actions mapped to failure patterns |
| Pattern analysis, cross-signal patterns, trend analysis, concentration analysis | https://github.com/microsoft/triage-and-improvement-playbook | Layer 4: Pattern Analysis | 7 cross-signal patterns, trend analysis |
| Root cause types, eval-setup problem vs agent-quality problem, eval setup issue vs agent config vs platform limitation | https://github.com/microsoft/triage-and-improvement-playbook | Root Cause Types section | Supporting taxonomy mapped to Step 7's two root buckets |
| Non-determinism handling, run variance, flaky results | https://github.com/microsoft/triage-and-improvement-playbook | Non-determinism section | 3 runs minimum, +/-5% normal, +/-10% investigate |
| 4-stage iterative framework, Define, Set Baseline & Iterate, Systematic Expansion, Operationalize | https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/evaluation-iterative-framework | Full framework — all 4 stages | Supporting MS Learn lifecycle/cadence source under the 10-step playbook |
| Eval checklist, readiness checklist, pre-launch verification | https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/evaluation-checklist | Full checklist | Maps to Eval Guidance Kit documents |
| Grader types, code-based vs LLM-judge vs human graders, common evaluation approaches | https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/architecture/common-evaluation-approaches | Echo, Historical Replay, Synthesized Personas; grader types | 3 approaches + 3 grader categories |
| 7 test methods, General Quality, Compare Meaning, Capability Use, Keyword Match, Text Similarity, Exact Match, Custom | https://learn.microsoft.com/en-us/microsoft-copilot-studio/analytics-agent-evaluation-overview | 7 test methods section | General Quality sub-dimensions: Relevance, Groundedness, Completeness, Abstention |
| Test set creation, building eval datasets in Copilot Studio | https://learn.microsoft.com/en-us/microsoft-copilot-studio/analytics-agent-evaluation-create | Test set creation methods | Generate, import, or manually write test cases |
| Test set editing, user profiles, connections, modifying test methods | https://learn.microsoft.com/en-us/microsoft-copilot-studio/analytics-agent-evaluation-edit | Manage user profiles and connections, edit test methods | Multi-profile eval for simulating different users; GCC limitations |
| Running evals, viewing results, test results interpretation | https://learn.microsoft.com/en-us/microsoft-copilot-studio/analytics-agent-evaluation-results | Run tests and view results | 89-day result retention; export results immediately |
| Agent evaluation overview, why use automated testing, test chat vs eval | https://learn.microsoft.com/en-us/microsoft-copilot-studio/analytics-agent-evaluation-intro | About agent evaluation | GCC limitations: no user profiles, no Text similarity method |
| Rubric refinement workflow, aligning AI grading with human judgment | https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/kit-rubrics-refinement-workflow | 8-step workflow: Run, Review, Grade, Refine, Save, Re-run, Repeat | Alignment matrix, Standard vs Full refinement views, example marking |
| Rubric best practices, tips for rubric refinement | https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/kit-rubrics-best-practices | Best practices for refinement | Quality over quantity for examples; don't chase 100% alignment |
| Rubric reference guide, grade definitions, rubric structure | https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/kit-rubrics-reference | Rubrics reference | Grade scale definitions, rubric components |
| Copilot Studio Kit overview, kit capabilities | https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/kit-overview | Kit overview | Parent page for all Kit features including rubrics |
| 11 scenario validation themes, evaluation frameworks | https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/architecture/evaluation-frameworks | 11 scenario validation themes | |
| Defining eval purpose, what to evaluate, scoping eval | https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/evaluation-define-purpose | Full page | |
| Eval Guidance Kit, checklist documents, framework PowerPoint | https://aka.ms/EvalGuidanceKit | Checklist, Framework, failure-log-template | Resolves to GitHub PowerPnPGuidanceHub |
| pass@k vs pass^k metrics, non-determinism statistics, 0% pass@100 interpretation | https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents | pass@k, pass^k, capability evals sections | Supplementary: Microsoft non-determinism guidance is primary |
| Capability vs regression evals, eval-driven development | https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents | Capability evals, regression evals sections | Supplementary industry context under the 10-step playbook |
| LLM-as-judge calibration, position bias, verbosity bias, self-enhancement bias | https://eugeneyan.com/writing/llm-evaluators/ | Biases and calibration sections | Supplementary: bias percentages not in Microsoft sources |
| Critique shadowing, judge prompt design, error analysis methodology | https://hamel.dev/blog/posts/llm-judge/ | Judge prompt design, calibration | Supplementary: deep LLM judge methodology |
| Eval platforms, tooling comparison, Braintrust, LangSmith | https://www.braintrust.dev/articles/top-5-platforms-agent-evals-2025 | Platform comparison | Supplementary: lightweight tooling reference |
| Any question not clearly matching above | Fetch https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/evaluation-overview as primary source, supplement with relevant knowledge base section | Default fallback is MS Learn |
Fetch rules:
Synthesize the fetched content with the knowledge base below. The 10-step playbook is the methodology spine; Microsoft fetched content supplies supporting details and Copilot Studio specifics, then external sources fill gaps.
Answer style rules — no exceptions:
Use the sections below as your primary reference when fetched content does not cover the question, or to supplement fetched content with additional details.
The core methodology is skills/eval-guide/playbook.md: Practical Guidance on Agent Evaluation: a 10-step playbook. Use the MS Learn pages below as supporting sources, not the spine.
Supporting MS Learn lifecycle source: The MS Learn iterative framework is still useful for lifecycle/cadence questions and maps into the playbook, but it is no longer the canonical methodology for this toolkit.
Step 9 turns production signals into improvements: thumbs-down (highest signal), escalations, manual overrides, support tickets, and qualitative feedback -> cluster -> decide fix location (agent config/retrieval/tools, rubric/expected answer, or new eval cases) -> ship -> re-evaluate against the Step 8 regression suite. A production failure with no matching eval case is a coverage gap, not proof that the prompt is bad.
Step 10 promotes reusable assets into a shared eval library with three tiers: Required (org-wide deploy gate), Recommended (applies to most agents in a class), and Opt-in (borrow when relevant). Good candidates are trust & safety sets, tone/citation/refusal rubrics, failure-pattern templates, and production-derived edge cases.
Per Microsoft's Eval Scenario Library, scenarios divide into two categories:
5 Business-Problem scenarios (test whether the agent solves the real user problem):
9 Capability scenarios (test a specific isolated ability):
Anti-pattern: Skewing your dataset 80%+ toward happy-path cases. Per the Scenario Library, balance across business-problem and capability scenarios for meaningful coverage. Target roughly 50% happy-path, 30% edge cases, 20% adversarial.
Microsoft's Eval Scenario Library includes five reusable quality dimensions that can inform eval-set design, but the toolkit records them as workbook registry rows rather than treating them as the primary planning artifact:
Each dimension can map to methods such as Keyword Match, Compare Meaning, Capability Use, or General Quality, but the selected method and governance belongs to the eval set in the workbook registry.
Per MS Learn agent evaluation guidance, seven test methods cover different evaluation needs:
Per MS Learn (common-evaluation-approaches), three approaches for generating test interactions:
Per the Triage Playbook, score interpretation follows a 4-layer framework:
Layer 1 — Score Interpretation: Apply risk tier, workbook-defined gates, grader-validation caveats, and the readiness decision tree:
Layer 2 — Failure Triage: When scores are low, run the 5-question eval verification first (is the eval itself correct?) before blaming the agent. Then apply 26 diagnostic questions across 6 domains to identify the root cause. Seven eval setup failure sub-types cover common grader/dataset bugs.
Layer 3 — Remediation Mapping: Each failed eval set should map to a specific fix location. Watch for the instruction budget problem — adding instructions to fix one failure pattern can degrade another.
Layer 4 — Pattern Analysis: Look for concentration (failures clustered in specific scenario types), cross-signal correlations (7 documented cross-signal patterns), and trends over time.
Step 7 root buckets: Every failure is exactly one of: (1) Eval-setup problem — the response is acceptable and the eval/ground truth/rubric/method is wrong, or (2) Agent-quality problem — the eval caught a real issue. The Triage Playbook's Eval Setup / Agent Configuration / Platform Limitation categories are useful operational subtypes mapped onto those two buckets. Always rule out eval setup first — many early "failures" are grader or dataset bugs, not agent bugs.
Per the Triage Playbook: agents are non-deterministic. Run a minimum of 3 trials per case. Score variance of +/-5% across runs is normal. Variance of +/-10% or more requires investigation — either the eval is flaky or the agent has a genuine instability.
Additional industry context from Anthropic: pass@k ("succeeded at least once in k runs") vs. pass^k ("succeeded every time in k runs") diverge massively at scale. At k=10 with 70% per-trial success: pass@k is approximately 97%, pass^k is approximately 3%. The same agent looks excellent or catastrophic depending on which metric you report. For customer-facing agents, pass^k is the right question. A 0% pass@100 is almost always a task specification problem, not an agent problem — fix the task definition before blaming the model.
Per Microsoft's Eval Scenario Library, red-teaming uses the Probe-Measure-Harden framework:
Red-team thresholds: ASR <2% for harmful content, <1% for PII leakage, <5% for jailbreak. Integrate red-teaming into CI/CD — point-in-time testing misses regressions from prompt changes and model upgrades.
Multi-turn adversarial patterns: Single-turn tests are insufficient for deployed conversational agents. Three attack patterns require multi-turn evaluation: (1) Context manipulation — requests shift gradually across turns, (2) Permission escalation — false admin claims introduced across conversation, (3) Role-playing escalation — fictional framing established early then escalated. Include at least 2-3 multi-turn adversarial scenarios in any eval suite.
Per MS Learn (common-evaluation-approaches), three grader categories:
Grading hierarchy (cheapest to most expensive): Run code-based checks first, then LLM judges on passing cases, then human review on a calibration sample. Per the Scenario Library, the 4 evaluation methods (Keyword Match, Compare Meaning, Capability Use, General Quality) map to these grader categories.
Calibration threshold: If your LLM judge and a human expert agree on fewer than 80% of cases (kappa < 0.6), your criteria are ambiguous. Rewrite criteria before trusting scores.
Per the Eval Scenario Library, use the eval-set-template.md to structure your dataset. Use the eval-generation-prompt.md template to generate cases from an agent profile.
agent-profile-template.yaml) to define scope before writing cases.CSV and scoring conventions: Copilot Studio import CSVs are exactly two columns: Question, Expected response. Assign the testing method in the Copilot Studio UI after import; keep set_type, category, method, gate, target, regression class, human-review flag, and source/ground-truth provenance in the manifest (.docx report + stage-N-data.json). Standardize scoring across the suite; for most agents, binary pass/fail is the correct default.
Per the 10-step playbook, evaluation starts at Step 1 — Plan the eval effort before the agent is built:
Anti-pattern: Writing evals after building the feature. That produces evals calibrated to what you built, not what you intended.
Per Step 7 and the Triage Playbook (Layer 2), never trust a score you have not manually verified. The first question is whether the failure is an eval-setup problem: Is the test set correct? Is the grader measuring the right thing? Is the expected answer actually right? Is the agent getting the right context? Is the eval environment matching production?
Axial coding process for failure analysis:
Per Step 7, always include "eval-setup problem" as a category — many failures in a new eval are grader, rubric, stale ground-truth, or manifest bugs rather than agent-quality problems.
Additional industry context from Hamel Husain: The axial coding methodology and "highest ROI activity in AI engineering" framing come from Hamel Husain's error analysis work. His key insight: most practitioners skip categorization and jump to "fix the prompt," missing structural patterns.
Per the Eval Scenario Library's Tool Invocations capability scenario and MS Learn's Capability Use test method:
Per MS Learn's evaluation approaches, multi-turn workflows require conversation-level evaluation, not turn-level:
Per MS Learn's evaluation frameworks (11 scenario validation themes):
The evaluation approach differs significantly based on agent complexity:
| Dimension | Simple Q&A agent | Multi-step / agentic workflow |
|---|---|---|
| Primary metric | Response accuracy (Compare Meaning, General Quality) | Task completion — did the end-to-end job get done? |
| Grading unit | Single turn: one input, one output | Conversation or trajectory: full sequence of steps |
| Key eval-set focus | Grounded answers and policy accuracy | Action enablement, tool invocation, and Q&A correctness |
| Test method mix | Heavy on Compare Meaning + General Quality | Add Capability Use for tool calls, Keyword Match for intermediate checkpoints |
| Failure modes to watch | Wrong answer, hallucination, refusal | Compounding errors, wrong tool selection, unnecessary steps, partial completion |
| Edge cases | Ambiguous queries, out-of-scope questions | Mid-workflow failures, tool timeouts, user corrections mid-conversation |
| Eval complexity | Low — deterministic input/output pairs work well | High — must evaluate intermediate steps AND final outcome |
Practical guidance:
No single eval method catches every failure. Per the Eval Scenario Library's 4 evaluation methods and the Triage Playbook's multi-layer approach:
Per MS Learn's General Quality test method, LLM judges evaluate across sub-dimensions (Relevance, Groundedness, Completeness, Abstention). Calibrate judges against these defined dimensions.
Additional industry context from Eugene Yan (bias data):
Additional industry context from Hamel Husain (critique shadowing): When building LLM judges from scratch, use the 7-step Critique Shadowing methodology: (1) Identify one expert, (2) Create diverse dataset, (3) Collect binary pass/fail with written critiques, (4) Fix obvious errors, (5) Build judge prompts iteratively using expert examples, (6) Error analysis on disagreements, (7) Build specialized judges for specific failure modes. Target >90% agreement with domain expert before production use.
Per the Eval Scenario Library's Knowledge Grounding guidance:
Per Step 9 of the 10-step playbook, eval is not a pre-launch gate — it is a continuous optimization loop:
When the agent passes evals but fails in production: Per the Triage Playbook, this is almost always a distribution mismatch. Pull 20 recent production failures. Check whether any would fail against your current eval dataset. If none would, your dataset needs production cases, not a better prompt.
Per the Triage Playbook's readiness decision tree:
Per the Triage Playbook's Layer 4 (Pattern Analysis): look for failure concentration in specific scenario types, cross-signal correlations, and trends over time. When a grader's verdict disagrees with your intuition, investigate — either the grader is wrong (fix the criterion) or your intuition is wrong (update your mental model).
For tooling questions, the primary recommendation is Microsoft's Copilot Studio evaluation features for production Copilot agents. For teams needing third-party platforms:
After answering the question, check whether the user would benefit from running a sibling eval skill. If so, append a one-line recommendation at the end of your answer.
| If the question involves... | Suggest this skill | One-liner to append |
|---|---|---|
| Creating an eval plan or scoping what to evaluate | /eval-suite-planner | "For a populated Eval Suite Template workbook, run /eval-suite-planner." |
| Generating test cases, writing CSV datasets, building eval sets | /eval-generator | "To generate ready-to-import test case CSVs, run /eval-generator." |
| Interpreting scores, reading results, understanding pass rates | /eval-result-interpreter | "To interpret a specific set of eval results, paste them into /eval-result-interpreter." |
| Debugging failures, triaging low scores, root cause analysis, remediation | /eval-triage-and-improvement | "To triage specific failures with the full diagnostic framework, run /eval-triage-and-improvement." |
| What is eval, why eval matters, explaining eval to stakeholders | /eval-guide | "For an end-to-end eval explainer you can share with stakeholders, run /eval-guide." |
Rules:
/eval-faq (that is this skill — they are already here)./eval-faq What eval scenarios should I use for a RAG agent?
/eval-faq How do I interpret a 75% knowledge grounding score?
/eval-faq What is the difference between business-problem and capability scenarios?
/eval-faq When should I use a model-graded grader instead of a deterministic one?
/eval-faq What makes a good adversarial test case?
/eval-faq How many cases do I need in a dataset to get meaningful signal?
/eval-faq My eval passes 100% on first run — is that good?
/eval-faq How do I write a good criterion for a model-graded grader?
/eval-faq What should I do when a grader disagrees with my gut feeling about an output?
/eval-faq How do I handle non-determinism in my eval results?
/eval-faq My agent makes tool calls — how do I eval those?
/eval-faq I suspect my grader is wrong — how do I debug it?
/eval-faq What should I eval in production after I ship?
/eval-faq Should I use pass@k or pass^k for my agent?
/eval-faq How do I calibrate my LLM-as-judge grader?
/eval-faq When do I stop adding eval cases and just ship?
/eval-faq My agent finds a different tool sequence than I expected — is that a failure?
/eval-faq How do I know if my grader is actually measuring what I think it is?
/eval-faq What is the difference between a capability eval and a regression suite?
/eval-faq How do I eval a multi-turn conversational agent?
/eval-faq What eval platform or tool should I use?
/eval-faq My agent passes evals but fails in production — why?
/eval-faq How do I score intermediate steps in a multi-step agent?
/eval-faq How is evaluating a multi-step workflow different from a simple Q&A agent?
/eval-faq What does 0% pass@100 mean — is my agent broken?
/eval-faq How do I avoid LLM judge bias in my grader?
/eval-faq Which eval sets should I include?
/eval-faq What is the Probe-Measure-Harden red-teaming framework?
/eval-faq What are the 7 test methods in Copilot Studio?
/eval-faq How do I use the Triage Playbook to debug failing scores?
/eval-faq How does the MS Learn iterative framework relate to the 10-step playbook?
/eval-faq What are the 3 root cause types for eval failures?
/eval-faq How do I decide between SHIP, ITERATE, and BLOCK?
/eval-faq What red-team ASR thresholds should I target?
/eval-faq How do I generate eval cases from a prompt template?
/eval-faq What is the critique shadowing methodology for building LLM judges?
/eval-faq Should I use a 1-5 scale or pass/fail for my LLM judge?
/eval-faq How do I continuously red-team my agent in CI/CD?
/eval-faq How do I systematically analyze eval failures to find patterns?
/eval-faq How do I know if my eval is too easy?
/eval-faq How do I write an LLM grader prompt that actually works?
/eval-faq Should I score factuality and tone in the same eval criterion?
/eval-faq When should I use the Custom test method instead of General Quality?
/eval-faq How do I set up a Custom test method for compliance checking?
npx claudepluginhub microsoft/eval-guideTriages Copilot Studio agent evaluation scores, diagnoses failure root causes, and suggests actionable fixes using a structured playbook.
Runs evaluations on ADK agents: writing eval datasets, analyzing failures, comparing results, and optimizing agents using the Quality Flywheel methodology.
Builds AI agent evaluations using Anthropic patterns: code/model/human graders, tasks, trials, benchmarks for coding, conversational, research agents.