From simmer
Inspects artifacts or workspaces, infers evaluation contracts and search space, proposes assessments conversationally, and produces setup briefs for simmer refinement loops.
npx claudepluginhub 2389-research/claude-plugins --plugin simmerThis skill uses the workspace's default tool permissions.
Inspect the artifact, infer what "better" means and how to measure it, propose an assessment to the user, produce the setup brief that drives the entire refinement loop.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Builds scalable data pipelines, modern data warehouses, and real-time streaming architectures using Spark, dbt, Airflow, Kafka, and cloud platforms like Snowflake, BigQuery.
Builds production Apache Airflow DAGs with best practices for operators, sensors, testing, and deployment. For data pipelines, workflow orchestration, and batch job scheduling.
Inspect the artifact, infer what "better" means and how to measure it, propose an assessment to the user, produce the setup brief that drives the entire refinement loop.
Core principle: Inspect first, infer second, propose third, confirm last. The agent does the thinking — the user validates, adjusts, or overrides. Never ask the user to describe something the agent can read.
Look for:
If ambiguous, ask once:
What are we refining?
1. A file (give me the path)
2. Something you'll paste
3. A workspace/directory (give me the path)
4. Generate from a description (I'll create the starting point)
Set mode and artifact type:
| Mode | Artifact Type | When |
|---|---|---|
| from-file | single-file | User provides a file path |
| from-paste | single-file | User pastes content |
| from-workspace | workspace | User provides a directory path |
| seedless | single-file or workspace | User describes what to create |
For single-file (from-file or from-paste):
For workspace (from-workspace):
evaluate.*, test.*, benchmark.*, or scripts referenced in configs/READMEsvalidate.*, check.*, or quick-test variantsconfig.json, config.yaml, .env, etc. — these reveal what parameters can be variedstrategies/, plugins/, models/ that indicate extensibility pointsprompt.md, system.txt, template files — things the generator can modifyFor seedless:
Infer the problem class from what you found during inspection. Never ask the user what class this is.
IF mode == "seedless" AND description is prose/creative:
→ text/creative
ELSE IF artifact_type == "workspace" AND (evaluator script found OR user mentioned evaluator):
→ pipeline/engineering
ELSE IF evaluator found OR artifact is code:
→ code/testable
ELSE:
→ text/creative
Text/Creative — infer criteria only:
Code/Testable — infer criteria + evaluation:
Pipeline/Engineering — infer everything:
Use when proposing criteria. The agent should prefer criteria inferred from the actual artifact over these generic seeds.
| Artifact type | Suggested criteria |
|---|---|
| Document / spec | clarity, completeness, actionability |
| Creative writing | narrative tension, specificity, voice consistency |
| Email / comms | value prop clarity, tone match, call to action strength |
| Prompt / instructions | instruction precision, output predictability, edge case coverage |
| API design | contract completeness, developer ergonomics, consistency |
| Code (non-cookoff) | simplicity, robustness, readability |
| Adventure hook / game content | narrative tension, player agency, specificity |
| Blog post / article | argument clarity, engagement, structure |
| Pipeline / workflow | coverage, efficiency, noise |
| Configuration / infra | correctness, resource efficiency, maintainability |
Sufficiency check: Before proposing, check whether the user's initial prompt + inspection results already provide everything needed for the brief:
If all fields are covered (from user prompt + inspection), skip the proposal and go directly to Phase 5 (emit brief). This is the common case when running as a subagent — the calling prompt provides intent/constraints and inspection fills in contracts.
If some fields are missing or ambiguous, present everything you inferred as a single conversational assessment. The user confirms, adjusts, or overrides. This is ONE message, not a sequence of questions.
This is a [artifact type] — I'll use judge-only evaluation (no scripts to run).
For criteria, I'd suggest:
- [criterion 1]: [inferred description of what good looks like]
- [criterion 2]: [inferred description]
- [criterion 3]: [inferred description]
3 iterations, starting from [seed description].
Sound right, or want to adjust anything?
This is [what the code does]. I found [evaluator/test script] which I'll
use to evaluate each iteration.
For criteria:
- [criterion 1]: [inferred from evaluator metrics or code purpose]
- [criterion 2]: [inferred]
- [criterion 3]: [inferred]
[If output contract inferred]: Output should be [format description].
[If constraints found]: I see [model/API/resource constraints].
3 iterations. Which criterion matters most, or are they equal?
This is a pipeline optimization problem. Here's what I found:
**Evaluator:** [script path] — measures [what it measures, from reading the script]
**Output contract:** [inferred from pipeline output format / evaluator expectations]
**Validation:** [script path or proposed subset command] — [what it checks, estimated time]
**Search space:** [inferred from config + directory structure]
- Models: [from config values]
- Prompts: [prompt files found]
- Topology: [strategy dirs, extensibility points]
For criteria:
- [criterion 1]: [from evaluator metrics] — [primary?]
- [criterion 2]: [from evaluator metrics]
- [criterion 3]: [from evaluator metrics]
**Constraints:** [API endpoints, available infrastructure from config]
[N] iterations. Does this look right? Anything to add or change?
The user can:
One round of confirmation is the goal. If the user's adjustments are clear, incorporate them and emit the brief. Don't re-propose unless the user asks to see the updated assessment.
After the proposal, if the user confirmed but key fields are still unknown, ask only about what you couldn't infer. Frame each question with why you need it.
Examples of things you might not be able to infer:
Do NOT ask about things you already inferred. If you found evaluate.sh and read what it measures, don't ask "how should we evaluate?"
Maximum 3 criteria. If you inferred more than 3 from the evaluator, pick the 3 most impactful and propose those. If the user mentioned criteria in their initial request, use those over inferred ones.
For each criterion, you need a description of what 10/10 looks like. If you can infer this from the evaluator (e.g., "100% coverage" from a coverage metric), use it. If not, ask: "What does a 10/10 look like for [criterion]?"
Produce this exact format — it is consumed by every subsequent subskill:
ARTIFACT: [full content if from-paste, file path if from-file, directory path if from-workspace, description if seedless]
ARTIFACT_TYPE: [single-file | workspace]
CRITERIA:
- [criterion 1]: [what a 10/10 looks like]
- [criterion 2]: [what a 10/10 looks like]
- [criterion 3]: [what a 10/10 looks like]
PRIMARY: [criterion name — omit if equally weighted]
EVALUATOR: [command to run — omit for judge-only]
BACKGROUND: [constraints and resources — omit if not provided]
OUTPUT_CONTRACT: [valid output format description — omit for text/creative]
VALIDATION_COMMAND: [quick check command — omit if no cheap validation exists]
SEARCH_SPACE: [what's in scope to explore — omit if unconstrained]
JUDGE_MODE: [single | board — auto-select, see below]
ITERATIONS: [N]
MODE: [seedless | from-file | from-paste | from-workspace]
OUTPUT_DIR: [path, default: docs/simmer]
Default iteration count is 3. Override if user specified ("simmer this, 10 rounds"). Don't ask about iteration count — 3 is the default.
Select JUDGE_MODE based on problem complexity. Do not ask the user — infer from the artifact and criteria.
| Condition | JUDGE_MODE |
|---|---|
| text/creative AND ≤2 criteria AND short artifact (email, tweet, tagline, single paragraph) | single |
| text/creative AND 3 criteria or longer/complex artifact | board |
| code/testable (any) | board |
| pipeline/engineering (any) | board |
User overrides: If the user said "with a single judge," use single. If they said "with a judge board" or "with a panel," use board. User intent wins — but check for mismatches:
Complexity mismatch nudge: If the user requested single judge but the artifact is complex (workspace, evaluator present, 3 criteria, code/pipeline), nudge once:
"This is a [workspace/pipeline/complex] artifact with [N criteria / an evaluator / multiple files] — a judge board would give you better diagnosis here. Want to use the board instead, or stick with single judge?"
One question, one time. If they say single, respect it and move on.
The OUTPUT_DIR defaults to docs/simmer. Override if the user specifies a different location or if running in a test/scratch context.
Return this brief to the orchestrator. Do not proceed to generation or judging — that is the orchestrator's job.
User: "Simmer a DND adventure hook for a one-shot session.
The party is level 5, exploring a haunted lighthouse."
Agent: [no inspection needed — seedless creative writing]
Agent: "This is a creative writing piece — I'll use judge-only evaluation.
For criteria I'd suggest:
- Narrative tension: scenes have stakes, time pressure, and consequences
- Player agency: multiple meaningful decision points, not just a railroad
- Specificity: concrete details (names, descriptions, sensory hooks) not generic fantasy
3 iterations, seedless. Sound right?"
User: "yes"
→ Emit brief, hand to orchestrator
User: "Simmer this pipeline at /tmp/pipeline —
I want the best model/prompt/topology for extracting entities."
Agent: [reads directory: config.json, run_pipeline.py, prompt.md,
evaluate.sh, validate.sh, strategies/]
[reads config.json: model=qwen3.5:4b, strategy=single-call]
[reads evaluate.sh: measures coverage, precision, tokens per video]
[reads validate.sh: runs 1 video, checks JSON format]
[reads prompt.md: extraction prompt with JSON schema]
[reads run_pipeline.py: supports pluggable strategies from strategies/ dir]
Agent: "This is a pipeline optimization problem. Here's what I found:
**Evaluator:** ./evaluate.sh — runs all 3 test videos, measures coverage
against ground truth, precision, and token count per model.
**Output contract:** JSON object with 'entities' array, each element has
'name' (string) and 'type' (string). Inferred from pipeline output format
and evaluator expectations.
**Validation:** ./validate.sh — runs 1 video, checks output is valid JSON
with correct schema. Takes ~1 min vs 5-30 min for full eval.
**Search space:**
- Models: currently qwen3.5:4b in config.json (what else is available?)
- Prompts: prompt.md (unlimited changes)
- Topology: single-call currently, but run_pipeline.py supports pluggable
strategies from strategies/ dir — multi-call is explorable
For criteria:
- Coverage: extracts maximum entities from ground truth (primary?)
- Efficiency: lowest token count while maintaining coverage
- Noise: minimal false positives
What models do you have available? And is coverage the primary goal?"
User: "qwen3.5 4b/9b/27b on Ollama. Yes, coverage is primary. 5 iterations."
→ Emit brief with all inferred + confirmed values
User: "Simmer this extraction prompt at local_extract_test.py as PROMPT_V4_FULL"
Agent: [reads local_extract_test.py — finds prompt, test harness,
evaluator command pattern]
Agent: "This is a prompt optimization with a test harness. I see the evaluator
runs via: OLLAMA_MODEL=qwen3.5:27b python3 local_extract_test.py v4_full
Output should be JSON with 'entities' array (name + type from the taxonomy).
For criteria:
- Coverage: extracts every entity from ground truth
- Noise: zero false positives
- Conceptual depth: captures painting theory, not just concrete items
Which matters most?"
User: "Coverage is primary. 3 iterations."
→ Emit brief
Asking the user to describe something you can read
Presenting a form instead of an assessment
Asking too many questions in sequence
Not reading evaluator/validation scripts
Over-inspecting for text/creative
Proposing more than 3 criteria
Not asking what 10/10 looks like
Starting to generate or judge
Forcing evaluator output format
Re-proposing after minor adjustments