Use when comparing two variants (code, LLM prompts, CLI commands, or any executable) against defined criteria with the same inputs. Do NOT use when variants cannot produce observable, comparable output.
From menpx claudepluginhub baleen37/bstack --plugin bstackThis skill uses the workspace's default tool permissions.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Implements structured self-debugging workflow for AI agent failures: capture errors, diagnose patterns like loops or context overflow, apply contained recoveries, and generate introspection reports.
A structured harness for comparing two variants using parallel subagent execution and model grader judgment. Works for any comparison where both sides produce observable output.
| Type | How it runs | Example |
|---|---|---|
code | Implement in isolated worktree, run tests | Refactoring, algorithm swap |
llm | Call LLM with given config against all INPUTS | Prompt A vs prompt B |
command | Run shell command against all INPUTS, capture stdout/stderr | CLI flag comparison |
custom | Freeform — subagent follows instructions literally | API call, config swap |
| Field | Description | Default |
|---|---|---|
| TASK | What is being compared | (required) |
| VARIANT_A | type: + config/instructions | type: code, current code unchanged |
| VARIANT_B | type: + config/instructions | (required) |
| INPUTS | Shared test inputs passed to both variants | (optional, but required for llm/command) |
| Evals | Checklist of judgment criteria | (required) |
| Grader | auto or none | auto |
Grader auto: model grader always runs. Also runs bats tests/ if either variant is code.
Worktree rule: Create worktrees only for variants of type code. A code variant always runs in its own worktree; non-code variants do not.
Guard: If VARIANT_A and VARIANT_B are textually identical, stop — do not proceed.
code: create a worktree via Agent tool with isolation: "worktree".code variants need no worktree.Launch two subagents in a single parallel message. Each subagent:
code: implement changes in worktree, run bats tests/, capture full outputllm: call the model with the prompt config against each input, collect all responsescommand: run the command against each input, capture stdout/stderrcustom: follow freeform instructions, capture all observable outputVARIANT: A or BTYPE: variant typeEXEC_SUMMARY: what was runOUTPUTS: raw outputs per input (responses, stdout, test results)NOTES: errors, anomalies, or partial failuresOn failure: if a variant cannot execute, record the error in NOTES and return what was collected. Do not fabricate output.
auto)## Eval-Harness Report
**Task**: <task>
**Variant A**: <type + description>
**Variant B**: <type + description>
### Execution Summary
<EXEC_SUMMARY per variant; errors if any>
### Per-Criterion Breakdown
| Criterion | Variant A | Variant B |
|-----------|-----------|-----------|
| ... | WIN/LOSE/TIE | WIN/LOSE/TIE |
### Model Grader Verdict
Winner: <Variant A | Variant B | Tie>
### Reasoning
<Judge's reasoning>
### Recommendation
<Which variant to adopt and why; if a variant failed to execute, say so explicitly>
If grader is none: omit Model Grader Verdict and Reasoning sections; include only Execution Summary and raw outputs.