Help us improve
Share bugs, ideas, or general feedback.
From stdd-agents
Use when creating or updating agent evaluation suites. Defines eval structure, rubrics, and validation patterns.
npx claudepluginhub craigtkhill/stdd-agents --plugin stdd-agentsHow this skill is triggered — by the user, by Claude, or both
Slash command
/stdd-agents:evaluationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Guidelines for creating comprehensive evaluation suites.
Guides technical evaluation of code review feedback: read fully, restate for understanding, verify against codebase, respond with reasoning or pushback before implementing.
Share bugs, ideas, or general feedback.
Guidelines for creating comprehensive evaluation suites.
Use this skill when:
All evaluations in evals/ follow a consistent structure with both code-based and LLM-as-judge validations.
Use this template for all eval spec.yaml files:
feature:
name: "[Feature Name] Evaluation"
as_a: evaluator
i_want: validate feature behavior
solutions:
- Ground truth validation
- Code-based validation
- LLM-as-judge validation
requirements:
- id: REQ-EVAL-XX-001
eval: G
description: Description of ground truth requirement
- id: REQ-EVAL-XX-002
eval: C
description: Description of code-based requirement
- id: REQ-EVAL-XX-003
eval: L
description: Description of LLM-judged requirement
- id: REQ-EVAL-XX-004
eval: O
description: Description of planned requirement
Template Rules:
REQ-EVAL-XX-NNN
XX = 2-3 letter eval abbreviation (e.g., AG for action_generation, AS for action_scenarios)NNN = Sequential 3-digit number starting at 001[G] = Ground truth validation (matches expected output)[C] = Code-based validation (deterministic checks)[L] = LLM-as-judge validation (quality assessment)[O] = Not yet implemented (planned for future)Use this template for all rubric.md files:
# [Feature Name] Reasoning Trace Rubric
## Format
`[PASS/FAIL] RUBRIC-ID: Criterion description`
## Based on: [Concrete example with specific values]
### [Category Name]
- [ ] RUB-XX-001: Specific, objective criterion
- [ ] RUB-XX-002: Another specific criterion
Template Rules:
RUB-XX-NNN (matches spec.yaml abbreviation)- [ ] format for LLM judge to mark pass/fail