Skill

eval-harness

Use when comparing two variants (code, LLM prompts, CLI commands, or any executable) against defined criteria with the same inputs. Do NOT use when variants cannot produce observable, comparable output.

From me

Install

Run in your terminal

npx claudepluginhub baleen37/bstack --plugin bstack

Tool Access

This skill uses the workspace's default tool permissions.

Skill Content

Similar Skills

skill-lookup

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

157.6k

prompt-lookup

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

prompts.chat

157.6k

agent-introspection-debugging

Implements structured self-debugging workflow for AI agent failures: capture errors, diagnose patterns like loops or context overflow, apply contained recoveries, and generate introspection reports.

ecc

147.8k

Stats

Parent Repo Stars4

Parent Repo Forks0

Last CommitMar 16, 2026

Actions

View Source View Plugin View on GitHub View README

eval-harness

A structured harness for comparing two variants using parallel subagent execution and model grader judgment. Works for any comparison where both sides produce observable output.

When to Use

Comparing two code implementations
Comparing two LLM prompts or model configs
Comparing CLI commands, API calls, or any executable
Any "same input → two approaches → compare output" scenario

Variant Types

Type	How it runs	Example
`code`	Implement in isolated worktree, run tests	Refactoring, algorithm swap
`llm`	Call LLM with given config against all INPUTS	Prompt A vs prompt B
`command`	Run shell command against all INPUTS, capture stdout/stderr	CLI flag comparison
`custom`	Freeform — subagent follows instructions literally	API call, config swap

Input Format

Field	Description	Default
TASK	What is being compared	(required)
VARIANT_A	`type:` + config/instructions	`type: code, current code unchanged`
VARIANT_B	`type:` + config/instructions	(required)
INPUTS	Shared test inputs passed to both variants	(optional, but required for `llm`/`command`)
Evals	Checklist of judgment criteria	(required)
Grader	`auto` or `none`	`auto`

Grader auto: model grader always runs. Also runs bats tests/ if either variant is code.

Worktree rule: Create worktrees only for variants of type code. A code variant always runs in its own worktree; non-code variants do not.

Guard: If VARIANT_A and VARIANT_B are textually identical, stop — do not proceed.

Process

Phase 0: Setup

Parse all fields and identify variant types.
For each variant of type code: create a worktree via Agent tool with isolation: "worktree".
Non-code variants need no worktree.

Phase 1: Parallel Collection (output only — no scoring)

Launch two subagents in a single parallel message. Each subagent:

Executes its variant:
- code: implement changes in worktree, run bats tests/, capture full output
- llm: call the model with the prompt config against each input, collect all responses
- command: run the command against each input, capture stdout/stderr
- custom: follow freeform instructions, capture all observable output
Does NOT score or judge — raw collection only
Returns:
- VARIANT: A or B
- TYPE: variant type
- EXEC_SUMMARY: what was run
- OUTPUTS: raw outputs per input (responses, stdout, test results)
- NOTES: errors, anomalies, or partial failures

On failure: if a variant cannot execute, record the error in NOTES and return what was collected. Do not fabricate output.

Phase 2: Model Grader (when grader is `auto`)

Anonymize before constructing the judge prompt:
- Count the second digit of the current minute (e.g. minute=47 → digit=7). If odd: A→Option 1, B→Option 2. If even: B→Option 1, A→Option 2.
Judge prompt includes: TASK, INPUTS, Evals, both subagent raw outputs (anonymized)
Judge evaluates each Eval criterion per option and returns:
- Per-criterion verdict: WIN / LOSE / TIE (with reasoning)
- Note if both options fail a criterion — do not force a winner for that criterion
- Overall winner or TIE

Phase 3: Reverse-Map and Report

Reverse-map Option 1/2 back to VARIANT_A/B.
Clean up all worktrees before outputting the report.
Output:

## Eval-Harness Report

**Task**: <task>
**Variant A**: <type + description>
**Variant B**: <type + description>

### Execution Summary
<EXEC_SUMMARY per variant; errors if any>

### Per-Criterion Breakdown
| Criterion | Variant A | Variant B |
|-----------|-----------|-----------|
| ...       | WIN/LOSE/TIE | WIN/LOSE/TIE |

### Model Grader Verdict
Winner: <Variant A | Variant B | Tie>

### Reasoning
<Judge's reasoning>

### Recommendation
<Which variant to adopt and why; if a variant failed to execute, say so explicitly>

If grader is none: omit Model Grader Verdict and Reasoning sections; include only Execution Summary and raw outputs.

Key Rules

Phase 1 subagents MUST run in parallel (single message)
Phase 1 collects output only — all scoring happens in Phase 2
Always anonymize before judge; always reverse-map before showing the user
A tie is a valid outcome — do not force a winner
Worktree cleanup happens before the report, not after
Never fabricate output — execution failure is a valid result

eval-harness

Install

Tool Access

Skill Content

Similar Skills

eval-harness

Install

Tool Access

Skill Content

eval-harness

When to Use

Variant Types

Input Format

Process

Phase 0: Setup

Phase 1: Parallel Collection (output only — no scoring)

Phase 2: Model Grader (when grader is auto)

Phase 3: Reverse-Map and Report

Key Rules

Similar Skills

eval-harness

When to Use

Variant Types

Input Format

Process

Phase 0: Setup

Phase 1: Parallel Collection (output only — no scoring)

Phase 2: Model Grader (when grader is auto)

Phase 3: Reverse-Map and Report

Key Rules

Phase 2: Model Grader (when grader is `auto`)

Phase 2: Model Grader (when grader is `auto`)