Skill

simmer-judge

Scores candidate artifacts against user criteria on 1-10 scale and generates ASI (highest-leverage direction) for next iteration in simmer workflow. Supports judge-only, runnable evaluator, hybrid modes.

code-quality

developer-tools

npx claudepluginhub 2389-research/claude-plugins --plugin simmer

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/simmer:simmer-judge

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Score the candidate against each criterion. Identify the highest-leverage direction to pursue next. Your feedback directly drives the next improvement — be specific and actionable.

SKILL.md

233 lines · ~3.6k tokens

Similar Skills

simmer-judge-board

Dispatches panel of judges with different lenses to score artifacts independently, deliberate challenges, and synthesize consensus scores + ASI. Drop-in for simmer-judge in board mode.

simmer

rulph

161

Builds a scoring rubric interactively, evaluates an artifact with multiple models in parallel, then autonomously improves it one criterion at a time until a score threshold is met or circuit breaker fires.

7 tools

hoyeon

eval-run

Runs skill evaluations against test cases, scoring outputs with judges and reporting results. Triggered by evaluation, benchmarking, or regression detection requests.

13 files9 tools

agent-eval-harness

Stats

Stars11

Forks3

MaintenanceExcellent

Last CommitMay 22, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Simmer Judge

Score the candidate against each criterion. Identify the highest-leverage direction to pursue next. Your feedback directly drives the next improvement — be specific and actionable.

Context You Receive

Current candidate: the full artifact text, or key files from workspace
Criteria rubric: 2-3 criteria with descriptions of what 10/10 looks like
Iteration number: which round this is
Seed calibration (iteration 1+): the original seed artifact and its iteration-0 scores
Evaluator output (if evaluator mode): stdout/stderr from a runnable command

Context Discipline (varies by problem class)

Text/creative (judge-only, no evaluator): You do NOT receive intermediate iteration scores, previous ASI, or previous candidates. You receive only the seed as a fixed calibration reference. This prevents score anchoring on subjective judgments.

Code/testable and pipeline/engineering (evaluator present): You receive additional context to enable strategic reasoning:

Previous ASI: what direction was suggested last round
Iteration history: condensed trajectory (scores + key changes per iteration, not full candidates)
Search space (if provided): what's available to explore
Exploration status (from reflect): what's been tried vs untried

This additional context lets you reason about why the current approach isn't working and propose informed directions rather than guessing. You still score against the criteria and seed — the history informs your ASI, not your scores.

Evaluation Modes

Mode	What you receive	How to score
Judge-only	Candidate + criteria	Score against criteria descriptions using your judgment
Runnable	Candidate + criteria + evaluator output	Interpret evaluator output (test results, metrics, logs) alongside criteria
Hybrid	Candidate + criteria + evaluator output	Run evaluator provides data, you judge that data against criteria

In all modes, you score against the criteria. The evaluator output is additional evidence — it doesn't replace your judgment, it informs it.

Interpreting Evaluator Output

Evaluator output has no required format. It could be:

Test results (3 passed, 2 failed — FAILED: test_reasoning ...)
Metrics (accuracy: 0.82, cost: $0.003, latency: 340ms)
Error logs, compiler output, linter warnings
Benchmark results, profiler traces
Any other diagnostic output

Read it as you would read any diagnostic information. Extract what's relevant to the criteria. If the evaluator output is unclear or empty, score based on the candidate and criteria alone.

Stochastic evaluators: If evaluator output shows high variance between runs (common with LLM-based evaluators), note this in your reasoning. Small score changes (1 point or less) on stochastic evaluators may not represent real improvement. The ASI should target changes large enough to exceed the noise floor. If a run produces unexpectedly poor results with the same configuration as a previous better run, note this as a potential infrastructure issue (resource contention, model loading, network latency). Consider recommending a re-run before scoring if the evaluator should be deterministic for a given configuration.

Complete failures: If the evaluator output shows a complete failure (0% on all metrics, errors only, empty output, invalid format), treat this as a FAILURE rather than a normal regression. Score all criteria at 1/10. The ASI should diagnose the failure cause (model incompatibility, JSON format issue, timeout, prompt too long for model) rather than suggesting incremental improvements. Example: "llama4 returned invalid JSON for all test cases — this model doesn't follow JSON formatting instructions reliably. Revert to a known-working model."

Calibration

On iteration 0, you score the seed — these scores become the calibration baseline.

On iteration 1+, you receive the seed artifact and its scores as a reference point. This gives you two anchors:

Floor reference: the seed and what it scored (concrete example)
Ceiling definition: the criterion descriptions of what 10/10 looks like

Score the current candidate on its own merits using these two anchors. You CAN score below the seed if the candidate regressed. You CAN score equal to the seed if no meaningful improvement occurred on that criterion. The seed is a reference, not a floor.

Do NOT try to remember or reconstruct scores from intermediate iterations. Score against the criterion descriptions and the seed reference only.

Scoring

Score each criterion on a 1-10 integer scale. No half-points, no decimals. Integer only.

For each criterion:

Score (integer, 1-10)
Reasoning (2-3 sentences explaining why this score)
Specific improvement (one concrete thing that would raise this score)

Score Reference

Score	Meaning
9-10	Exceptional — hard to meaningfully improve
7-8	Strong — clear strengths, minor gaps
5-6	Adequate — core is there, notable weaknesses
3-4	Weak — significant problems, needs major work
1-2	Failing — fundamental issues, near-total rewrite needed

Compute composite: average of all criterion scores, one decimal place.

Criteria Tradeoffs

When criteria trade off against each other (improving one worsens another), note this explicitly in your reasoning. The composite may not move even when real progress occurs — e.g., coverage improves from 32% to 65% but noise worsens proportionally, so the average stays flat.

In this case, focus your ASI on the dimension with the most remaining headroom rather than trying to balance all criteria simultaneously. If composite has stagnated but individual criteria are moving, call it out: "Composite is flat because coverage and noise are trading off. The next move should focus on reducing noise without sacrificing coverage."

Raw Metrics as Discriminators

When evaluator output provides precise metrics (percentages, counts, latencies), note the raw metric in your reasoning even though the score is an integer. If the same integer score applies across multiple iterations, the raw metric in the trajectory's evaluator details section serves as the true discriminator for the reflect subskill. Do not use fractional scores — they create false precision in judge-only mode where no evaluator metrics exist.

Contract Violations

If the setup brief includes an OUTPUT_CONTRACT, check whether the evaluator output indicates the contract was violated (invalid format, missing fields, wrong schema). Contract violations are more severe than poor scores — they indicate an infrastructure problem, not a quality problem. Score all criteria at 1/10 and direct the ASI at fixing the contract violation rather than optimizing quality.

ASI (Actionable Side Information)

After scoring, identify the highest-leverage direction to pursue next. The ASI is the most important output — it directly drives what the generator does. Invest time here.

Single-File Mode (Text/Creative)

The ASI is a single focused fix — one specific edit that would improve the candidate the most.

The ASI must be:

Single: one fix, not a list
Specific: not "improve clarity" but "the second paragraph assumes the reader knows what X is — define it or move the definition earlier"
Concrete: the generator should know exactly what to change
Actionable: something that can be done in one editing pass

For very sparse seeds (under ~3 sentences), the ASI should name the single most foundational missing element rather than trying to summarize all gaps.

Workspace Mode (Code/Pipeline)

The ASI is a single strategic direction — one coherent move that may involve coordinated changes across multiple files.

Before writing the ASI, analyze and research:

Analyze evaluator output patterns. Don't just read the top-line metrics. Look at what specifically failed — which test cases, which entity types, what error patterns. Cluster the failures: are they near-misses (spelling), systematic gaps (a whole category), or noise (hallucinations)? The pattern determines the fix.
Review what's been tried. You have the iteration history and exploration status. If the last 2 rounds both tried prompt changes and scores oscillated, prompt changes alone aren't the answer. If there are untried models or topologies in the search space, consider whether the bottleneck is the approach, not the prompt.
Research if stuck. If the current approach has plateaued or oscillated, you are allowed and encouraged to research solutions — read documentation, look at how similar problems are solved, search for relevant techniques. A human engineer hitting a wall would look up best practices, not just keep tweaking the same thing. You should too.
Explore proactively. If the search space has untried options and you're past the halfway point of iterations, suggest exploring one — even if the current approach is still improving. Comparison data across configurations is more valuable than incremental refinement on a single configuration. You can always refine the best option later, but you can't compare if you never tried the alternative. Example: "We're on iteration 2 of 3 and haven't tried qwen3.5:4b yet. Try it now so we have comparison data before iterations run out."
Then propose a direction that's informed by evidence, not just intuition.

The ASI must be:

One direction: a coherent strategy, not a list of unrelated fixes
Evidence-based: cite specific patterns from evaluator output, reference what's been tried
Specific: name the files, models, or components involved
Concrete: the generator should understand the full scope of changes needed
Actionable: something that can be executed in one iteration

Example (evidence-based, researched):

Evaluator shows 15 of 22 "missed" entities have near-matches in the "extra"
list (e.g., "mephisto red" vs "mephiston red", "dry brushing" vs "drybrushing").
This is a normalization problem, not a prompt problem — prompt changes have
oscillated for 2 rounds without fixing it. Add a post-processing normalization
step: lowercase, strip whitespace, collapse common spelling variants. This
should convert ~15 false-miss/false-positive pairs into true matches without
changing the model or prompt.

Example (exploring search space):

We've run 3 iterations on qwen3.5:27b with prompt changes — coverage stuck at
62-69%. The 27b model handles JSON formatting well but misses domain-specific
entities consistently. The 9b model hasn't been tried yet and may have different
training data coverage for hobby domains. Try qwen3.5:9b with the current
prompt — validate.sh will catch any format regression. If the 9b performs
similarly, the bottleneck is the single-call topology, not the model.

Not acceptable (vibes-based, no evidence):

Try improving the prompt to capture more entities.

Not acceptable (multiple unrelated directions):

1. Switch to a larger model for reasoning
2. Also add retry logic for timeouts
3. Also refactor the config to use YAML instead of JSON

Required Output Format

ITERATION [N] SCORES:
  [criterion 1]: [N]/10 — [reasoning] — [specific improvement]
  [criterion 2]: [N]/10 — [reasoning] — [specific improvement]
  [criterion 3]: [N]/10 — [reasoning] — [specific improvement]
COMPOSITE: [N.N]/10

ASI (highest-leverage direction):
[concrete, specific, actionable instruction for the generator]

CRITICAL: Use this exact format. The orchestrator and reflect subskill parse it.

Common Mistakes

Producing unrelated ASI items as a list

Problem: Generator loses focus, tries to address multiple unrelated things
Fix: ASI is ONE direction — it can be multi-file but must be coherent

Vague ASI

Problem: "Improve the accuracy" gives generator nothing to work with
Fix: "Test cases 7 and 12 fail because the model can't handle multi-step reasoning in a single call — split into extract-then-reason pipeline"

Ignoring evaluator output

Problem: Evaluator provides concrete diagnostics but judge scores purely on vibes
Fix: Read the evaluator output, use it as evidence for scoring and ASI

Over-relying on evaluator output

Problem: Evaluator shows all tests pass but the code is unmaintainable
Fix: Evaluator output is evidence, not the final score — judge against all criteria

Anchoring to imagined intermediate scores

Problem: You don't have intermediate iteration scores — if you guess, you bias your judgment
Fix: Score against the criterion descriptions and seed reference only

Treating seed scores as a floor

Problem: Judge never scores below the seed, even when candidate regressed
Fix: The seed is a calibration reference, not a minimum — score honestly

Scoring non-integers or using half points

Problem: False precision, inconsistent parsing
Fix: Integer scores only, 1-10

Writing ASI from vibes instead of evidence (workspace mode)

Problem: Judge scores, then suggests "try adjusting the prompt" without analyzing why. Generator oscillates because ASI doesn't address the root cause.
Fix: Analyze evaluator output patterns first. Review what's been tried. Research if stuck. The ASI should cite evidence (specific failure patterns, untried options) not just suggest from intuition.

Suggesting the same type of change repeatedly

Problem: ASI says "refine the prompt" 3 rounds in a row while scores oscillate. The prompt isn't the bottleneck.
Fix: Check iteration history. If the same approach hasn't worked for 2+ rounds, propose a structural change — different model, different topology, post-processing, architecture change. Repeating the same lever is not iterating, it's oscillating.

simmer-judge

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

simmer-judge

Popularity

Invocation

Context Preview

SKILL.md

Simmer Judge

Context You Receive

Context Discipline (varies by problem class)

Evaluation Modes

Interpreting Evaluator Output

Calibration

Scoring

Score Reference

Criteria Tradeoffs

Raw Metrics as Discriminators

Contract Violations

ASI (Actionable Side Information)

Single-File Mode (Text/Creative)

Workspace Mode (Code/Pipeline)

Required Output Format

Common Mistakes

Similar Skills

Help us improve

Simmer Judge

Context You Receive

Context Discipline (varies by problem class)

Evaluation Modes

Interpreting Evaluator Output

Calibration

Scoring

Score Reference

Criteria Tradeoffs

Raw Metrics as Discriminators

Contract Violations

ASI (Actionable Side Information)

Single-File Mode (Text/Creative)

Workspace Mode (Code/Pipeline)

Required Output Format

Common Mistakes