Help us improve
Share bugs, ideas, or general feedback.
From simmer
Runs iterative refinement loops to improve artifacts like codebases, documents, prompts, pipelines using evidence-based judges, optional evaluators, and auto-selected single/multi-judge modes based on complexity.
npx claudepluginhub 2389-research/claude-plugins --plugin simmerHow this skill is triggered — by the user, by Claude, or both
Slash command
/simmer:simmerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Iterative refinement loop — take an artifact (single file or workspace) and hone it repeatedly against user-defined criteria until it's as good as it can get.
Generates improved artifacts from judge's ASI feedback in simmer workflow. Supports single-file edits and full workspace iterations, preserving working elements.
Runs autonomous optimization loops to iteratively improve prompts, templates, configs, or code using four-way separation of main agent, eval agent, test runner, and deterministic eval.py judge. Invoke via /autoresearch or 'optimize this prompt'.
Improves prompts using Anthropic's 4-step workflow. Handles direct text, files, conversation context, iteration; adds XML, chain-of-thought, examples, clear formats.
Share bugs, ideas, or general feedback.
Iterative refinement loop — take an artifact (single file or workspace) and hone it repeatedly against user-defined criteria until it's as good as it can get.
Related skills (test-kitchen family):
test-kitchen:omakase-off — don't know what you want → parallel designs → react → picktest-kitchen:cookoff — know what you want, it's code → parallel implementations → fixed criteria → steal the bestsimmer — know what you want, it's anything → user-defined criteria → iterate until good"Simmer this" / "Refine this" / "Optimize this pipeline"
↓
┌─────────────────────────────────────┐
│ SETUP (identify + criteria) │
│ Load simmer-setup subskill │
│ │
│ Output: artifact, rubric, N iters, │
│ evaluator (optional), │
│ background (optional) │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ LOOP (default 3 iterations) │
│ │
│ Each iteration: │
│ 1. Dispatch generator subagent │
│ 2. Run evaluator (if present) │
│ 3. Dispatch judge subagent │
│ 4. Load reflect subskill │
│ │
│ Generator gets: candidate + ASI │
│ + background │
│ Judge gets: candidate + rubric │
│ + evaluator output (if any) │
│ Reflect gets: full score history │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ OUTPUT │
│ Best candidate → result file │
│ Score trajectory displayed │
└─────────────────────────────────────┘
Trigger when user wants iterative refinement of any kind:
Judge mode is auto-selected by setup based on problem complexity:
| Condition | JUDGE_MODE |
|---|---|
| text/creative, ≤2 criteria, short artifact (email, tweet, tagline) | single |
| text/creative, 3 criteria or long/complex artifact | board |
| code/testable (any) | board |
| pipeline/engineering (any) | board |
| User says "with a single judge" | single (override) |
| User says "with a judge board" or "with a panel" | board (override) |
Plateau upgrade: If the loop started with a single judge and detects a plateau (3 iterations without improvement), offer: "Scores have plateaued. Switch to judge board for deeper diagnosis?" If the user accepts, switch to JUDGE_MODE: board for remaining iterations.
Not simmer: If the artifact is code and the user wants parallel implementations, use cookoff instead.
Announce: "I'm using the simmer skill to set up iterative refinement."
Track progress (TodoWrite if available, otherwise inline):
Invoke simmer:simmer-setup.
Do not attempt to identify the artifact or ask about criteria yourself — that is the setup subskill's job.
Shortcut: If the user (or calling system) has already provided artifact, criteria (each with at least one sentence describing what a high score looks like), iteration count, mode, and optionally evaluator/background, skip the setup subskill entirely. Construct the setup brief directly and proceed to Phase 2.
Setup returns a brief:
ARTIFACT: [content, file path, or directory path]
ARTIFACT_TYPE: [single-file | workspace]
CRITERIA:
- [criterion 1]: [what better looks like]
- [criterion 2]: [what better looks like]
- [criterion 3]: [what better looks like]
PRIMARY: [criterion name — omit if equally weighted]
EVALUATOR: [command to run — omit for judge-only mode]
BACKGROUND: [constraints, available resources, domain knowledge — omit if not needed]
OUTPUT_CONTRACT: [valid output format description — omit for text/creative]
VALIDATION_COMMAND: [quick check command — omit if no cheap validation exists]
SEARCH_SPACE: [what's in scope to explore — omit if unconstrained]
JUDGE_MODE: [single | board — auto-selected by setup based on complexity. User can override]
JUDGE_PANEL: [optional custom judge definitions — omit to use defaults for problem class]
ITERATIONS: [N]
MODE: [seedless | from-file | from-paste | from-workspace]
OUTPUT_DIR: [path, default: docs/simmer]
For single-file mode:
mkdir -p {OUTPUT_DIR}
For workspace mode:
# Create initial commit to snapshot the seed state
cd {ARTIFACT}
git add -A && git commit -m "simmer: iteration 0 — seed state"
Iteration counting:
"N iterations" means N generate-judge-reflect cycles AFTER the initial seed judgment. The seed judgment is iteration 0 (not counted toward N). So ITERATIONS: 3 means:
For seedless mode: iteration 1 generates the initial candidate AND judges it. ITERATIONS: 3 means 3 generation passes total.
Iteration 0 (seed):
Single-file mode:
{OUTPUT_DIR}/iteration-0-candidate.mdWorkspace mode:
Each iteration:
Step 1: Generator (subagent)
Invoke simmer:simmer-generator as a subagent.
Single-file subagent prompt:
You are the generator in a simmer refinement loop.
Invoke the skill: simmer:simmer-generator
ITERATION: [N]
ARTIFACT_TYPE: single-file
CRITERIA:
[rubric from setup]
CURRENT CANDIDATE:
[full text of current best candidate]
JUDGE FEEDBACK (ASI from previous round):
[ASI text, or "First iteration — generate initial candidate" if seedless iteration 1]
Write your improved candidate to: {OUTPUT_DIR}/iteration-[N]-candidate.md
(or appropriate extension matching artifact type)
Report: what specifically changed and why (2-3 sentences).
Workspace subagent prompt:
You are the generator in a simmer refinement loop.
Invoke the skill: simmer:simmer-generator
ITERATION: [N]
ARTIFACT_TYPE: workspace
WORKSPACE: [directory path]
CRITERIA:
[rubric from setup]
BACKGROUND:
[constraints, available resources, domain knowledge from setup]
OUTPUT_CONTRACT:
[valid output format — omit if not specified in setup]
VALIDATION_COMMAND:
[quick check command — omit if not specified in setup]
SEARCH_SPACE:
[what's in scope to explore — omit if not specified in setup]
JUDGE FEEDBACK (ASI from previous round):
[ASI text — may describe coordinated changes across multiple files]
EXPLORATION STATUS:
[from reflect: what's been tried vs untried — omit on iteration 1 or if no search space]
Make your changes directly in the workspace directory.
You may edit multiple files in a single iteration when the ASI calls for coordinated changes.
If making infrastructure changes, run VALIDATION_COMMAND (if available) before reporting success.
Report: what specifically changed and why (2-3 sentences).
Step 2: Run Evaluator (if present)
If the setup brief includes an EVALUATOR command:
cd {ARTIFACT} # for workspace mode
{EVALUATOR}
Capture stdout and stderr. This output will be passed to the judge.
Timeouts: Set generous timeouts for evaluator commands. If the evaluator involves LLM inference, network calls, or large data processing, allow 10-60 minutes per run. The orchestrator should not timeout before the evaluator completes.
If no evaluator, skip this step.
Step 3: Judge (subagent or judge board)
If JUDGE_MODE: board: Invoke simmer:simmer-judge-board instead of the single judge. Pass it all the same context below, plus JUDGE_PANEL if specified in the setup brief. The board dispatches multiple judges, runs deliberation, and returns output in the exact same format as a single judge. The rest of the loop (reflect, generator) is unchanged.
Include file paths so judges can investigate. In addition to pasted content, pass:
Judges need to read these files themselves — not just the pre-digested summaries in the prompt. A judge who reads the evaluator script discovers exact-match scoring on iteration 0 instead of learning it through 3 iterations of trial and error.
Otherwise: Invoke simmer:simmer-judge as a subagent.
Without evaluator:
You are the judge in a simmer refinement loop.
Invoke the skill: simmer:simmer-judge
ITERATION: [N]
ARTIFACT_TYPE: [single-file | workspace]
CRITERIA:
[rubric from setup]
CANDIDATE:
[full text of candidate, or key files from workspace]
SEED CALIBRATION:
[full text of original seed artifact, or key seed files]
SEED SCORES:
[iteration 0 scores — omit this block on iteration 0]
Score this candidate against the criteria using the seed as a calibration reference.
Do NOT look at or consider any intermediate iteration scores.
With evaluator:
You are the judge in a simmer refinement loop.
Invoke the skill: simmer:simmer-judge
ITERATION: [N]
ARTIFACT_TYPE: [single-file | workspace]
CRITERIA:
[rubric from setup]
CANDIDATE:
[full text of candidate, or key files from workspace]
EVALUATOR OUTPUT:
[stdout and stderr from the evaluator command]
SEED CALIBRATION:
[full text of original seed artifact, or key seed files]
SEED SCORES:
[iteration 0 scores — omit this block on iteration 0]
OUTPUT_CONTRACT:
[valid output format — omit if not specified in setup]
SEARCH_SPACE:
[what's in scope to explore — omit if not specified in setup]
PREVIOUS ASI:
[the ASI from the previous judge round — omit on iteration 0]
ITERATION HISTORY:
[condensed trajectory: iteration number, scores, config, key change for each
prior iteration — omit on iteration 0]
EXPLORATION STATUS:
[from reflect: what's been tried vs untried in the search space — omit on
iteration 0 or if no search space specified]
Interpret the evaluator output alongside the criteria.
Check evaluator output against the output contract if specified.
Score this candidate using the seed as a calibration reference.
Use the iteration history, previous ASI, and exploration status to inform
your ASI — analyze what's been tried, what worked, what didn't, and propose
an evidence-based direction. You may research approaches if the current
path is stuck.
Step 4: Reflect (inline, load subskill)
Invoke simmer:simmer-reflect.
Provide: full score history across all iterations so far, current iteration number, max iterations, judge output from this round.
After reflect completes, display the updated trajectory table to the user. Show the full table so far — the user should see scores accumulate row by row as the loop runs. This is especially important during long evaluator runs where the user otherwise sees nothing for 10-15 minutes per iteration.
Iteration 2 complete.
| Iter | Value Prop | Tone | CTA | Composite | Key Change |
|------|-----------|------|-----|-----------|------------|
| 0 | 4 | 5 | 3 | 4.0 | seed |
| 1 | 7 | 5 | 4 | 5.3 | specific problem statement |
| 2 | 7 | 6 | 6 | 6.3 | low-friction CTA |
Best so far: iteration 2 (6.3/10). 1 iteration remaining.
Handling regression: If reflect reports that this iteration scored lower than best-so-far:
git checkout <best-commit> -- <workspace-files>. Do NOT revert trajectory.md or other tracking files in {OUTPUT_DIR}.Plateau detection: If the best-so-far score (primary criterion if set, otherwise composite) has not improved for 3 consecutive iterations — including regressions that were rolled back:
If currently using single judge (JUDGE_MODE: single): Offer upgrade: "Best score has not improved for 3 iterations (best: N.N/10 at iteration M). Switch to judge board for deeper diagnosis, or stop?" If the user accepts the upgrade, switch to JUDGE_MODE: board and add 2 iterations to the remaining count (the board typically needs 2-3 iterations to surface and act on new insights). The board's multi-perspective deliberation often surfaces blind spots the single judge missed.
If already using board: Offer early termination: "Best score has not improved for 3 iterations with the judge board (best: N.N/10 at iteration M). Continue or stop?"
This catches both flat plateaus and oscillation around a ceiling. Especially important when evaluator runs are expensive (minutes to hours per iteration).
After all iterations complete:
Single-file mode:
{OUTPUT_DIR}/result.mdWorkspace mode:
If user continues: carry forward best candidate as new seed, continue iteration numbering (e.g., iterations 4, 5, 6), run 3 more.
Single-file mode:
{OUTPUT_DIR}/
iteration-0-candidate.md # Seed (or seedless first generation)
iteration-1-candidate.md # Each improved candidate
iteration-2-candidate.md
iteration-3-candidate.md
trajectory.md # Running score table
result.md # Final best output
Workspace mode:
{WORKSPACE}/ # The target directory
[project files] # Modified in place by generator
{OUTPUT_DIR}/ # Tracking files (can be inside or outside workspace)
trajectory.md # Running score table
Iterations are tracked via git commits in workspace mode rather than separate candidate files.
{OUTPUT_DIR} defaults to docs/simmer. Override via setup brief's OUTPUT_DIR field.
If you cannot dispatch separate subagents (e.g., nested Claude sessions are blocked, or you're running in a constrained environment), execute all roles sequentially.
Context discipline is aspirational in single-agent mode. You will see prior scores and evaluator output. Mitigate bias by:
Per-iteration checklist (single-agent):
{OUTPUT_DIR}/trajectory.md. Note best-so-far. If regression, flag it and roll back to best candidate. Skip the formal "output to orchestrator" block — just update the file and continue.This is critical for consistent results:
| Subskill | Receives | Does NOT receive |
|---|---|---|
| Generator | Current candidate, criteria, ASI from last judge, background, exploration status | Score history, previous candidates, evaluator output |
| Judge (text/creative) | Current candidate, criteria, iteration number, seed + seed scores | Intermediate scores, intermediate candidates, previous ASI, trajectory |
| Judge (code/pipeline) | Current candidate, criteria, iteration number, seed + seed scores, evaluator output, previous ASI, iteration history, search space, exploration status | Full candidate history |
| Judge Board | Same as single judge per problem class, plus: other panelists' scores during deliberation | Other panelists' ASI candidates (withheld until synthesis) |
| Reflect | Full score history, all iteration summaries, search space | Candidate content (just scores + summaries) |
The generator improves based on specific feedback (ASI) and available resources (background), not scores. The judge scores against criteria definitions, evaluator output, and the seed as a fixed calibration reference — no intermediate scores. The judge board preserves these same rules per panelist — deliberation adds within-iteration cross-judge visibility only, no new cross-iteration information. The reflect subskill is the only one that sees the full trajectory.
| Dependency | Usage |
|---|---|
parallel-agents | superpowers:dispatching-parallel-agents — fallback: dispatch sequentially |
Giving the generator score history
Giving the judge previous scores
Trying to fix everything at once (single-file mode)
Treating ASI as always single-edit (workspace mode)
Sharing candidate history with the judge
Not tracking best candidate separately
Not rolling back on regression (workspace mode)
git checkout <best-commit> -- <files>. Do NOT revert trajectory.md.Dual-writing embedded artifacts
Reverting trajectory on git rollback (workspace mode)
git checkout <commit> reverts ALL files including trajectory.md trackinggit checkout <commit> -- file1 file2. Keep trajectory.md and other tracking files outside the rollback scope.User: "Simmer this" [pastes a pitch email]
Claude: I'm using the simmer skill to set up iterative refinement.
[Invokes simmer-setup]
Setup identifies: pitch email, suggests criteria
User accepts: value prop clarity, tone match, call to action strength
Iterations: 3
[Iteration 0: Judge scores seed — 4.0/10]
[Iteration 1: Generator fixes value prop → 5.3/10]
[Iteration 2: Generator fixes CTA → 6.3/10]
[Iteration 3: Generator fixes tone → 7.3/10]
Trajectory:
| Iter | Value Prop | Tone | CTA | Composite | Key Change |
|------|-----------|------|-----|-----------|------------|
| 0 | 4 | 5 | 3 | 4.0 | seed |
| 1 | 7 | 5 | 4 | 5.3 | specific problem statement |
| 2 | 7 | 6 | 6 | 6.3 | lower-friction CTA |
| 3 | 7 | 7 | 8 | 7.3 | specific asset in CTA |
Best candidate: iteration 3 (7.3/10)
3 iterations complete. Run 3 more?
User: "Simmer this pipeline — find the best model and prompt setup"
Claude: I'm using the simmer skill to set up iterative refinement.
[Invokes simmer-setup]
Setup identifies: workspace at ./pipeline/
Evaluator: python evaluate.py --input output.json
Background: "Available models: claude-sonnet, gpt-4o-mini, llama-8b, llama-70b.
Topologies: single-call, multi-step chain, parallel fan-out.
Budget: <$0.01/call, <2s latency."
Criteria: accuracy, cost efficiency, latency
Iterations: 5
[Iteration 0: Run evaluator on seed, judge scores — 3.7/10]
accuracy: 6/10, cost: 2/10, latency: 3/10
ASI: "Using claude-sonnet for a simple extraction task. The model is
overkill — accuracy is fine but cost is 5x over budget. Switch to
gpt-4o-mini which handles extraction well at 1/10th the cost."
[Iteration 1: Generator swaps model + adjusts prompt → 5.3/10]
accuracy: 5/10, cost: 8/10, latency: 7/10
ASI: "Cost and latency are great now but accuracy dropped on multi-step
reasoning tasks (cases 7, 12). Split into two calls — extraction
on mini, reasoning on sonnet — to get accuracy back without
blowing the budget."
[Iteration 2: Generator restructures to 2-step chain → 7.0/10]
accuracy: 7/10, cost: 7/10, latency: 7/10
ASI: "Architecture is solid. The extraction prompt is too generic —
add 3 few-shot examples from the test cases to anchor the format."
[Iteration 3: Generator adds few-shot examples → 7.7/10]
...
Best candidate: iteration 3 (7.7/10)
5 iterations complete. Run 3 more?