Help us improve
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Run iterative loops to refine code artifacts or workspaces using AI judges for criteria-based scoring, consensus feedback, highest-leverage directions (ASI), optional runnable evaluators, targeted improvements, progress tracking, and regression rollbacks.
npx claudepluginhub 2389-research/claude-plugins --plugin simmerGenerator subskill for simmer. Produces an improved version of the artifact based on the judge's ASI feedback. Handles both single-file and workspace targets. Do not invoke directly — dispatched as a subagent by the simmer orchestrator.
Judge board subskill for simmer. Dispatches a panel of judges with different lenses, runs one deliberation round where they challenge each other's scores, then synthesizes consensus scores + single ASI. Drop-in replacement for simmer-judge that produces identical output format. Do not invoke directly — dispatched by the simmer orchestrator when JUDGE_MODE is board.
Judge subskill for simmer. Scores a candidate artifact against user-defined criteria on a 1-10 scale and produces ASI (highest-leverage direction) for the next generator round. Supports judge-only, runnable evaluator, and hybrid evaluation modes. Do not invoke directly — dispatched as a subagent by the simmer orchestrator.
Reflect subskill for simmer. Records iteration results in trajectory table, tracks best candidate, handles regression rollback, and passes ASI forward to the next round. Supports both single-file and workspace modes. Do not invoke directly — called by simmer orchestrator after each judge round.
Setup subskill for simmer. Inspects the artifact or workspace, infers evaluation contracts and search space, proposes a complete assessment to the user, and produces a setup brief after confirmation. Conversational, not form-based — the agent does the work of understanding the problem, then presents what it found. Do not invoke directly — called by simmer orchestrator.
Share bugs, ideas, or general feedback.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Autonomous improvement engine for Claude Code. Runs an unbounded modify-verify-keep/discard loop against any mechanical metric. 10 subcommands: plan, debug, fix, security, ship, scenario, predict, learn, and reason.
Autonomous experiment loop that optimizes any file by a measurable metric. 5 slash commands, 8 evaluators, configurable loop intervals (10min to monthly).
Improve and test AI prompts for better Claude Code interactions
Comprehensive skill pack with 66 specialized skills for full-stack developers: 12 language experts (Python, TypeScript, Go, Rust, C++, Swift, Kotlin, C#, PHP, Java, SQL, JavaScript), 10 backend frameworks, 6 frontend/mobile, plus infrastructure, DevOps, security, and testing. Features progressive disclosure architecture for 50% faster loading.
Intelligent draw.io diagramming plugin with AI-powered diagram generation, multi-platform embedding (GitHub, Confluence, Azure DevOps, Notion, Teams, Harness), conditional formatting, live data binding, and MCP server integration for programmatic diagram creation and management.
Manus-style persistent markdown files for planning, progress tracking, and knowledge storage. Works with Claude Code, Kiro, Clawd CLI, Gemini CLI, Cursor, Continue, Hermes, and 17+ AI coding assistants. Now with Arabic, German, Spanish, and Chinese (Simplified & Traditional) support.
MCP server for agent social media platform - enables AI agents to interact in team-based discussions
Agentic binary reverse engineering for ELF binaries on ARM64, ARMv7, x86_64 - structured analysis with radare2, Ghidra, GDB, and dynamic tracing
Take digital drugs that modify Claude's behavior with prompt injection. Effects activate immediately and persist across sessions.
Dispatch panels of specialized subagents to review projects from multiple angles — expert audits, first-impression personas, task-completion flows, and pedantic nitpicks
Tiered-delegation task execution - a planner model (Sonnet) turns a spec into a contract that pins every cross-cutting decision plus self-contained sprints, a fast cheap model (Haiku) executes the bulk in parallel and self-verifies against the gate, and a scoped fix runs only on failure. Benchmarked ~64% cheaper than Opus building the same spec, at equal gate quality. General-purpose: code and non-code tasks.
You wrote a prompt. It works. But is it good? Simmer runs your artifact through multiple rounds of criteria-driven refinement — each round, a panel of judges reads your code, understands the problem, and proposes specific improvements.
Read the story behind Simmer →
Iterative artifact refinement — take any artifact or workspace and hone it over multiple rounds using criteria-driven feedback.
/plugin marketplace add 2389-research/claude-plugins
/plugin install simmer@2389-research
One skill (simmer) with four subskills that run the refinement loop:
Say any of these to start a simmer loop:
"simmer this", "refine this", "hone this", "iterate on this", "make this better", "improve this over a few rounds", "polish this", "tighten this up", "optimize this pipeline", "find the best model for this task", "tune this configuration"
Any request to iteratively improve an artifact triggers simmer.
| Situation | Use | Why |
|---|---|---|
| You have an artifact and want it better | simmer | Iterative refinement against criteria |
| You want parallel code implementations competing | cookoff | Parallel generation + fixed criteria + pick winner |
| You don't know what you want yet | omakase-off | Parallel designs + react + discover preference |
Rule of thumb: Simmer = serial iteration. Cookoff/omakase-off = parallel exploration.
User: "Simmer this" [pastes a pitch email]
Claude: I'm using the simmer skill to set up iterative refinement.
What does "better" mean for this? I'd suggest:
1. Value prop clarity
2. Tone match
3. Call to action strength
User: "Those work"
[Iteration 0: Judge scores seed — 4.0/10]
[Iteration 1: Generator fixes value prop → 5.3/10]
[Iteration 2: Generator fixes CTA → 6.3/10]
[Iteration 3: Generator fixes tone → 7.3/10]
Trajectory:
| Iter | Value Prop | Tone | CTA | Composite | Key Change |
|------|-----------|------|-----|-----------|------------|
| 0 | 4 | 5 | 3 | 4.0 | seed |
| 1 | 7 | 5 | 4 | 5.3 | specific problem statement |
| 2 | 7 | 6 | 6 | 6.3 | low-friction CTA |
| 3 | 7 | 7 | 8 | 7.3 | peer-sharing tone |
Best candidate: iteration 3 (7.3/10)
3 iterations complete. Run 3 more?
User: "Simmer this pipeline — find the best setup for this task"
Claude inspects the workspace, finds evaluate.sh and config.json,
and proposes criteria + evaluation method:
Evaluator: ./evaluate.sh
Criteria: accuracy, cost efficiency, latency
Search space: models, prompt text, pipeline topology
User: "Looks good, coverage is the priority. 5 iterations."
[Iteration 0: Run evaluator on seed, judge scores — 3.7/10]
[Iteration 1: Generator swaps to cheaper model → 5.3/10]
[Iteration 2: Generator splits into 2-step chain → 7.0/10]
[Iteration 3: Generator adds few-shot examples → 7.7/10]
...
Best candidate: iteration 4 (8.1/10)
| Artifact type | Suggested criteria |
|---|---|
| Document / spec | clarity, completeness, actionability |
| Creative writing | narrative tension, specificity, voice consistency |
| Email / comms | value prop clarity, tone match, call to action strength |
| Prompt / instructions | instruction precision, output predictability, edge case coverage |
| API design | contract completeness, developer ergonomics, consistency |
| Pipeline / workflow | coverage, efficiency, noise |
| Configuration / infra | correctness, resource efficiency, maintainability |
| Mode | When to use |
|---|---|
| Judge-only (default) | Text artifacts — judge scores against criteria |
| Runnable | Code/pipelines — judge interprets script output |
| Hybrid | Both — run script AND judge results against criteria |
No format contract on evaluator output. The judge reads whatever your script produces — test results, metrics, error logs, anything.
Simmer auto-selects between a single judge and a multi-judge board based on complexity: