Skill

evaluation

Use when creating or updating agent evaluation suites. Defines eval structure, rubrics, and validation patterns.

Install

npx claudepluginhub craigtkhill/stdd-agents --plugin stdd-agents

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Guidelines for creating comprehensive evaluation suites.

SKILL.md

Similar Skills

using-git-worktrees

Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.

superpowers

168.3k

subagent-driven-development

3 files

Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.

superpowers

168.3k

dispatching-parallel-agents

Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.

superpowers

168.3k

Stats

Stars0

Forks0

Last CommitFeb 22, 2026

Actions

View Source View Plugin View on GitHub View README

Evaluation Skill

Guidelines for creating comprehensive evaluation suites.

When to Use This Skill

Use this skill when:

Creating a NEW evaluation suite for a feature
Updating an EXISTING evaluation suite
Understanding the evaluation framework patterns
Writing spec.yaml, rubric.md, or evaluation files

Evaluation Framework Overview

All evaluations in evals/ follow a consistent structure with both code-based and LLM-as-judge validations.

spec.yaml Template

Use this template for all eval spec.yaml files:

feature:
  name: "[Feature Name] Evaluation"
  as_a: evaluator
  i_want: validate feature behavior
  solutions:
    - Ground truth validation
    - Code-based validation
    - LLM-as-judge validation

requirements:
  - id: REQ-EVAL-XX-001
    eval: G
    description: Description of ground truth requirement
  - id: REQ-EVAL-XX-002
    eval: C
    description: Description of code-based requirement
  - id: REQ-EVAL-XX-003
    eval: L
    description: Description of LLM-judged requirement
  - id: REQ-EVAL-XX-004
    eval: O
    description: Description of planned requirement

Template Rules:

Identifier Format: REQ-EVAL-XX-NNN
- XX = 2-3 letter eval abbreviation (e.g., AG for action_generation, AS for action_scenarios)
- NNN = Sequential 3-digit number starting at 001
Implementation Types:
- [G] = Ground truth validation (matches expected output)
- [C] = Code-based validation (deterministic checks)
- [L] = LLM-as-judge validation (quality assessment)
- [O] = Not yet implemented (planned for future)
Categories: Group related requirements logically

rubric.md Template

Use this template for all rubric.md files:

# [Feature Name] Reasoning Trace Rubric

## Format
`[PASS/FAIL] RUBRIC-ID: Criterion description`

## Based on: [Concrete example with specific values]

### [Category Name]
- [ ] RUB-XX-001: Specific, objective criterion
- [ ] RUB-XX-002: Another specific criterion

Template Rules:

Identifier Format: RUB-XX-NNN (matches spec.yaml abbreviation)
Categories: Organize criteria into logical groups
Criteria: Write concrete, objectively verifiable rules, not subjective assessments
Specificity: Reference actual values, fields, or behaviors that can be checked
Checkboxes: Use - [ ] format for LLM judge to mark pass/fail
Avoid subjective language: Do not use vague terms; state exactly what to verify