Plugin

simmer

Name: simmer
Author: 2389-research

Run iterative artifact refinement loops where a panel of AI judges scores code against user criteria, deliberates, and generates evidence-based improvement proposals, with automatic rollback on regression

developer-tools

code-quality

What's Inside

Skills6

simmer-generator

/simmer-generator

Generator subskill for simmer. Produces an improved version of the artifact based on the judge's ASI feedback. Handles both single-file and workspace targets. Do not invoke directly — dispatched as a subagent by the simmer orchestrator.

simmer-judge-board

/simmer-judge-board

Judge board subskill for simmer. Dispatches a panel of judges with different lenses, runs one deliberation round where they challenge each other's scores, then synthesizes consensus scores + single ASI. Drop-in replacement for simmer-judge that produces identical output format. Do not invoke directly — dispatched by the simmer orchestrator when JUDGE_MODE is board.

simmer-judge

/simmer-judge

Judge subskill for simmer. Scores a candidate artifact against user-defined criteria on a 1-10 scale and produces ASI (highest-leverage direction) for the next generator round. Supports judge-only, runnable evaluator, and hybrid evaluation modes. Do not invoke directly — dispatched as a subagent by the simmer orchestrator.

simmer-reflect

/simmer-reflect

Reflect subskill for simmer. Records iteration results in trajectory table, tracks best candidate, handles regression rollback, and passes ASI forward to the next round. Supports both single-file and workspace modes. Do not invoke directly — called by simmer orchestrator after each judge round.

simmer-setup

/simmer-setup

Setup subskill for simmer. Inspects the artifact or workspace, infers evaluation contracts and search space, proposes a complete assessment to the user, and produces a setup brief after confirmation. Conversational, not form-based — the agent does the work of understanding the problem, then presents what it found. Do not invoke directly — called by simmer orchestrator.

Stats

Version3.0.1

Stars14

Forks3

MaintenanceFair

LicenseMIT

Last CommitMay 22, 2026

AddedApr 3, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Available In

2389-research84

Simmer

You wrote a prompt. It works. But is it good? Simmer runs your artifact through multiple rounds of criteria-driven refinement — each round, a panel of judges reads your code, understands the problem, and proposes specific improvements.

Read the story behind Simmer →

Iterative artifact refinement — take any artifact or workspace and hone it over multiple rounds using criteria-driven feedback.

Installation

/plugin marketplace add 2389-research/claude-plugins
/plugin install simmer@2389-research

What This Plugin Provides

One skill (simmer) with four subskills that run the refinement loop:

Setup — identify the artifact (file or workspace), elicit 2-3 quality criteria, determine evaluation method
Generator — produce an improved version based on the judge's ASI (Actionable Side Information — the single highest-leverage fix)
Judge — score the candidate 1-10 per criterion, produce the ASI
Reflect — record the trajectory, track the best candidate across iterations

Trigger Phrases

Say any of these to start a simmer loop:

"simmer this", "refine this", "hone this", "iterate on this", "make this better", "improve this over a few rounds", "polish this", "tighten this up", "optimize this pipeline", "find the best model for this task", "tune this configuration"

Any request to iteratively improve an artifact triggers simmer.

When to Use Simmer vs. Cookoff vs. Omakase-Off

Situation	Use	Why
You have an artifact and want it better	simmer	Iterative refinement against criteria
You want parallel code implementations competing	cookoff	Parallel generation + fixed criteria + pick winner
You don't know what you want yet	omakase-off	Parallel designs + react + discover preference

Rule of thumb: Simmer = serial iteration. Cookoff/omakase-off = parallel exploration.

Quick Example: Single File

User: "Simmer this" [pastes a pitch email]

Claude: I'm using the simmer skill to set up iterative refinement.

What does "better" mean for this? I'd suggest:
1. Value prop clarity
2. Tone match
3. Call to action strength

User: "Those work"

[Iteration 0: Judge scores seed — 4.0/10]
[Iteration 1: Generator fixes value prop → 5.3/10]
[Iteration 2: Generator fixes CTA → 6.3/10]
[Iteration 3: Generator fixes tone → 7.3/10]

Trajectory:
| Iter | Value Prop | Tone | CTA | Composite | Key Change |
|------|-----------|------|-----|-----------|------------|
| 0    | 4         | 5    | 3   | 4.0       | seed       |
| 1    | 7         | 5    | 4   | 5.3       | specific problem statement |
| 2    | 7         | 6    | 6   | 6.3       | low-friction CTA |
| 3    | 7         | 7    | 8   | 7.3       | peer-sharing tone |

Best candidate: iteration 3 (7.3/10)
3 iterations complete. Run 3 more?

Quick Example: Workspace with Evaluator

User: "Simmer this pipeline — find the best setup for this task"

Claude inspects the workspace, finds evaluate.sh and config.json,
and proposes criteria + evaluation method:

  Evaluator: ./evaluate.sh
  Criteria: accuracy, cost efficiency, latency
  Search space: models, prompt text, pipeline topology

User: "Looks good, coverage is the priority. 5 iterations."

[Iteration 0: Run evaluator on seed, judge scores — 3.7/10]
[Iteration 1: Generator swaps to cheaper model → 5.3/10]
[Iteration 2: Generator splits into 2-step chain → 7.0/10]
[Iteration 3: Generator adds few-shot examples → 7.7/10]
...

Best candidate: iteration 4 (8.1/10)

Works On Anything

Artifact type	Suggested criteria
Document / spec	clarity, completeness, actionability
Creative writing	narrative tension, specificity, voice consistency
Email / comms	value prop clarity, tone match, call to action strength
Prompt / instructions	instruction precision, output predictability, edge case coverage
API design	contract completeness, developer ergonomics, consistency
Pipeline / workflow	coverage, efficiency, noise
Configuration / infra	correctness, resource efficiency, maintainability

Evaluation Modes

Mode	When to use
Judge-only (default)	Text artifacts — judge scores against criteria
Runnable	Code/pipelines — judge interprets script output
Hybrid	Both — run script AND judge results against criteria

No format contract on evaluator output. The judge reads whatever your script produces — test results, metrics, error logs, anything.

Judge Board

Simmer auto-selects between a single judge and a multi-judge board based on complexity:

Simple (short email, tweet, ≤2 criteria) → single judge, fast
Complex (3 criteria, long artifact, code, pipelines) → judge board with deliberation

simmer

What's Inside

simmer

Popularity

What's Inside

Confidence

README

Simmer

Installation

What This Plugin Provides

Trigger Phrases

When to Use Simmer vs. Cookoff vs. Omakase-Off

Quick Example: Single File

Quick Example: Workspace with Evaluator

Works On Anything

Evaluation Modes

Judge Board

Similar Plugins

autoresearch

aidd-refine

ai-prompt-lab

autoresearch-agent

creative-writing

fullstack-dev-skills

More by 2389-research

mcp-agent-social

thrifty

binary-re

private-journal-mcp

review-squad

Simmer

Installation

What This Plugin Provides

Trigger Phrases

When to Use Simmer vs. Cookoff vs. Omakase-Off

Quick Example: Single File

Quick Example: Workspace with Evaluator

Works On Anything

Evaluation Modes

Judge Board

Popularity

Health & Quality

More by 2389-research

mcp-agent-social

thrifty

binary-re

private-journal-mcp

review-squad

Similar Plugins

autoresearch

aidd-refine

ai-prompt-lab

autoresearch-agent

creative-writing

fullstack-dev-skills