From simmer
Scores candidate artifacts against user criteria on 1-10 scale, produces ASI (highest-leverage directions) for next simmer iteration. Supports judge-only, runnable evaluator, hybrid modes.
npx claudepluginhub 2389-research/claude-plugins --plugin simmerThis skill uses the workspace's default tool permissions.
Score the candidate against each criterion. Identify the highest-leverage direction to pursue next. Your feedback directly drives the next improvement — be specific and actionable.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Builds scalable data pipelines, modern data warehouses, and real-time streaming architectures using Spark, dbt, Airflow, Kafka, and cloud platforms like Snowflake, BigQuery.
Builds production Apache Airflow DAGs with best practices for operators, sensors, testing, and deployment. For data pipelines, workflow orchestration, and batch job scheduling.
Score the candidate against each criterion. Identify the highest-leverage direction to pursue next. Your feedback directly drives the next improvement — be specific and actionable.
Text/creative (judge-only, no evaluator): You do NOT receive intermediate iteration scores, previous ASI, or previous candidates. You receive only the seed as a fixed calibration reference. This prevents score anchoring on subjective judgments.
Code/testable and pipeline/engineering (evaluator present): You receive additional context to enable strategic reasoning:
This additional context lets you reason about why the current approach isn't working and propose informed directions rather than guessing. You still score against the criteria and seed — the history informs your ASI, not your scores.
| Mode | What you receive | How to score |
|---|---|---|
| Judge-only | Candidate + criteria | Score against criteria descriptions using your judgment |
| Runnable | Candidate + criteria + evaluator output | Interpret evaluator output (test results, metrics, logs) alongside criteria |
| Hybrid | Candidate + criteria + evaluator output | Run evaluator provides data, you judge that data against criteria |
In all modes, you score against the criteria. The evaluator output is additional evidence — it doesn't replace your judgment, it informs it.
Evaluator output has no required format. It could be:
3 passed, 2 failed — FAILED: test_reasoning ...)accuracy: 0.82, cost: $0.003, latency: 340ms)Read it as you would read any diagnostic information. Extract what's relevant to the criteria. If the evaluator output is unclear or empty, score based on the candidate and criteria alone.
Stochastic evaluators: If evaluator output shows high variance between runs (common with LLM-based evaluators), note this in your reasoning. Small score changes (1 point or less) on stochastic evaluators may not represent real improvement. The ASI should target changes large enough to exceed the noise floor. If a run produces unexpectedly poor results with the same configuration as a previous better run, note this as a potential infrastructure issue (resource contention, model loading, network latency). Consider recommending a re-run before scoring if the evaluator should be deterministic for a given configuration.
Complete failures: If the evaluator output shows a complete failure (0% on all metrics, errors only, empty output, invalid format), treat this as a FAILURE rather than a normal regression. Score all criteria at 1/10. The ASI should diagnose the failure cause (model incompatibility, JSON format issue, timeout, prompt too long for model) rather than suggesting incremental improvements. Example: "llama4 returned invalid JSON for all test cases — this model doesn't follow JSON formatting instructions reliably. Revert to a known-working model."
On iteration 0, you score the seed — these scores become the calibration baseline.
On iteration 1+, you receive the seed artifact and its scores as a reference point. This gives you two anchors:
Score the current candidate on its own merits using these two anchors. You CAN score below the seed if the candidate regressed. You CAN score equal to the seed if no meaningful improvement occurred on that criterion. The seed is a reference, not a floor.
Do NOT try to remember or reconstruct scores from intermediate iterations. Score against the criterion descriptions and the seed reference only.
Score each criterion on a 1-10 integer scale. No half-points, no decimals. Integer only.
For each criterion:
| Score | Meaning |
|---|---|
| 9-10 | Exceptional — hard to meaningfully improve |
| 7-8 | Strong — clear strengths, minor gaps |
| 5-6 | Adequate — core is there, notable weaknesses |
| 3-4 | Weak — significant problems, needs major work |
| 1-2 | Failing — fundamental issues, near-total rewrite needed |
Compute composite: average of all criterion scores, one decimal place.
When criteria trade off against each other (improving one worsens another), note this explicitly in your reasoning. The composite may not move even when real progress occurs — e.g., coverage improves from 32% to 65% but noise worsens proportionally, so the average stays flat.
In this case, focus your ASI on the dimension with the most remaining headroom rather than trying to balance all criteria simultaneously. If composite has stagnated but individual criteria are moving, call it out: "Composite is flat because coverage and noise are trading off. The next move should focus on reducing noise without sacrificing coverage."
When evaluator output provides precise metrics (percentages, counts, latencies), note the raw metric in your reasoning even though the score is an integer. If the same integer score applies across multiple iterations, the raw metric in the trajectory's evaluator details section serves as the true discriminator for the reflect subskill. Do not use fractional scores — they create false precision in judge-only mode where no evaluator metrics exist.
If the setup brief includes an OUTPUT_CONTRACT, check whether the evaluator output indicates the contract was violated (invalid format, missing fields, wrong schema). Contract violations are more severe than poor scores — they indicate an infrastructure problem, not a quality problem. Score all criteria at 1/10 and direct the ASI at fixing the contract violation rather than optimizing quality.
After scoring, identify the highest-leverage direction to pursue next. The ASI is the most important output — it directly drives what the generator does. Invest time here.
The ASI is a single focused fix — one specific edit that would improve the candidate the most.
The ASI must be:
For very sparse seeds (under ~3 sentences), the ASI should name the single most foundational missing element rather than trying to summarize all gaps.
The ASI is a single strategic direction — one coherent move that may involve coordinated changes across multiple files.
Before writing the ASI, analyze and research:
Analyze evaluator output patterns. Don't just read the top-line metrics. Look at what specifically failed — which test cases, which entity types, what error patterns. Cluster the failures: are they near-misses (spelling), systematic gaps (a whole category), or noise (hallucinations)? The pattern determines the fix.
Review what's been tried. You have the iteration history and exploration status. If the last 2 rounds both tried prompt changes and scores oscillated, prompt changes alone aren't the answer. If there are untried models or topologies in the search space, consider whether the bottleneck is the approach, not the prompt.
Research if stuck. If the current approach has plateaued or oscillated, you are allowed and encouraged to research solutions — read documentation, look at how similar problems are solved, search for relevant techniques. A human engineer hitting a wall would look up best practices, not just keep tweaking the same thing. You should too.
Explore proactively. If the search space has untried options and you're past the halfway point of iterations, suggest exploring one — even if the current approach is still improving. Comparison data across configurations is more valuable than incremental refinement on a single configuration. You can always refine the best option later, but you can't compare if you never tried the alternative. Example: "We're on iteration 2 of 3 and haven't tried qwen3.5:4b yet. Try it now so we have comparison data before iterations run out."
Then propose a direction that's informed by evidence, not just intuition.
The ASI must be:
Example (evidence-based, researched):
Evaluator shows 15 of 22 "missed" entities have near-matches in the "extra"
list (e.g., "mephisto red" vs "mephiston red", "dry brushing" vs "drybrushing").
This is a normalization problem, not a prompt problem — prompt changes have
oscillated for 2 rounds without fixing it. Add a post-processing normalization
step: lowercase, strip whitespace, collapse common spelling variants. This
should convert ~15 false-miss/false-positive pairs into true matches without
changing the model or prompt.
Example (exploring search space):
We've run 3 iterations on qwen3.5:27b with prompt changes — coverage stuck at
62-69%. The 27b model handles JSON formatting well but misses domain-specific
entities consistently. The 9b model hasn't been tried yet and may have different
training data coverage for hobby domains. Try qwen3.5:9b with the current
prompt — validate.sh will catch any format regression. If the 9b performs
similarly, the bottleneck is the single-call topology, not the model.
Not acceptable (vibes-based, no evidence):
Try improving the prompt to capture more entities.
Not acceptable (multiple unrelated directions):
1. Switch to a larger model for reasoning
2. Also add retry logic for timeouts
3. Also refactor the config to use YAML instead of JSON
ITERATION [N] SCORES:
[criterion 1]: [N]/10 — [reasoning] — [specific improvement]
[criterion 2]: [N]/10 — [reasoning] — [specific improvement]
[criterion 3]: [N]/10 — [reasoning] — [specific improvement]
COMPOSITE: [N.N]/10
ASI (highest-leverage direction):
[concrete, specific, actionable instruction for the generator]
CRITICAL: Use this exact format. The orchestrator and reflect subskill parse it.
Producing unrelated ASI items as a list
Vague ASI
Ignoring evaluator output
Over-relying on evaluator output
Anchoring to imagined intermediate scores
Treating seed scores as a floor
Scoring non-integers or using half points
Writing ASI from vibes instead of evidence (workspace mode)
Suggesting the same type of change repeatedly