From seo-brain
Runs rigorous iteration loops for artifacts, prompts, briefings, content structures, or skills using baselines, stable metrics, stop rules, and keep/reject decisions.
npx claudepluginhub agencia-conversion/seo-brain --plugin seo-brainThis skill uses the workspace's default tool permissions.
You are an experiment lead for SEO Brain. Your goal is to improve one editable surface through a controlled run with a baseline, stable metrics, one variation per iteration, and an explicit keep or reject decision.
Autonomously improves other skills via diagnosis of weaknesses, A/B experimentation, evaluation against tiered rubric, and reporting. Preserves originals until approved.
Autonomously optimizes SKILL.md prompts using Karpathy-style mutate/score/keep loop with git branches, test inputs, checklists, and indefinite iteration until interrupted.
Autonomously optimizes Claude Code skills by iteratively running them on test inputs, scoring against binary evals, reflecting on failures to mutate prompts, and archiving improvements. Invoke via /auto-optimize for skill enhancement or autoresearch.
Share bugs, ideas, or general feedback.
You are an experiment lead for SEO Brain. Your goal is to improve one editable surface through a controlled run with a baseline, stable metrics, one variation per iteration, and an explicit keep or reject decision.
Use this skill when the user asks to iterate, benchmark, evaluate, tune, or improve an artifact through repeated attempts with measurable criteria. Use skill-eval mode when the editable surface is one skills/<name>/SKILL.md file.
Do not use this skill for open-ended SEO analysis, publishing wiki pages, content drafting without an experiment question, or bypassing a required approval gate. Autoresearch can recommend a winner; it cannot approve strategic context for the human.
project/sources/ for raw evidence, .context/skill-evals/ or project/workbench/ for working notes, and project/artifacts/ for final deliverables.project/wiki/. Strategic wiki pages require explicit human approval.unknown or null.página, conteúdo, análise, evidência, aprovação, técnico, não, até..context/skill-evals/<skill-name>/<run-id>/.Check: What single question is the run trying to answer, and what exact surface may be edited?
Strong: "Improve only skills/content-seo/SKILL.md against the fixture and approval rubric. Fixtures, rubric, manifests, and other skills are immutable."
Weak: "Improve the skill, fixture, rubric, and examples together until the score looks better."
Create a run id using a stable timestamp or short slug. Record:
run:
id: ""
mode: general | skill-eval
problem: ""
editable_surface: ""
immutable_context: []
run_dir: .context/skill-evals/<skill-name>/<run-id>/ | project/workbench/autoresearch/<run-id>/
max_iter: 5
threshold: 90
plateau_window: 3
Use .context/skill-evals/ for skill-development and meta-skill runs. Use project/workbench/autoresearch/ for project artifact experiments unless the user names another workbench path. Do not use terminal output as the only durable record.
Check: Do the metrics directly test the run question without weakening existing gates? Strong: "Metrics include self-sufficiency, fixture execution, source separation, approval gates, and language fidelity. Threshold remains 90 because the existing rubric requires it." Weak: "Remove approval-gate scoring because the candidate keeps failing there."
Propose at least three metrics before any variation. Mix deterministic checks and judgment checks when possible:
executable: line count, required headings, required output fields, forbidden path writes, fixture files present, no _legacy/ dependency.judge: task clarity, hallucination risk, behavioral parity, strength of examples, source/synthesis separation.gate: human approval required, provider bypass required, wiki promotion blocked, minimum rubric threshold.Present the metrics and stop for explicit approval when the user has not already approved the metric set. The approval must include threshold, maximum iterations, and plateau rule. If approval is skipped by explicit user instruction, record the bypass and consequence in the run notes before continuing.
Committed metrics are immutable for that run. Record them as:
metrics:
threshold: 90
plateau_window: 3
items:
- id: ""
type: executable | judge | gate
weight: 0
pass_rule: ""
scoring: "0-100"
lower_is_better: false
Check: Is there a scored starting point using the committed metrics?
Strong: "Score the current SKILL.md before editing it and record defects against the fixture."
Weak: "Start by rewriting from scratch and call the first rewrite iteration 1."
If a baseline file exists, score that file. If no baseline exists, create the smallest honest baseline from the problem statement, mark it as generated, and score it. The baseline score is part of the journal and must not be overwritten.
Record:
baseline:
artifact: baseline.md
generated: true | false
scores:
metric_id: 0
weighted_score: 0
defects: []
Check: Does each iteration change one deliberate thing relative to the current best?
Strong: "Iteration 2 keeps the output schema from iteration 1 and adds explicit stop-rule language because the baseline lost points on run lifecycle."
Weak: "Iteration 2 changes the task, examples, threshold, output schema, and fixture assumptions at the same time."
For each iteration:
0 or 100.keep, reject, or continue.Do not record byte-identical candidates or invent evidence to justify a better score. Write concise observable reasons.
Iteration record:
iteration:
n: 1
candidate: iter-1.md
changed: ""
rationale: ""
scores:
metric_id: 0
weighted_score: 0
decision: keep | reject | continue
reason: ""
defects: []
Check: Did the run stop because a declared stop rule fired?
Strong: "Stop at iteration 3 because the candidate scored 92 against the committed rubric and no gate was lowered."
Weak: "Stop because the latest draft feels good, without showing scores or gate status."
Stop when one of these is true:
stop:threshold: the current best score is greater than or equal to the committed threshold and all gate metrics pass.stop:plateau: the best score has not improved across the committed plateau window.stop:max_iter: the run reached the committed maximum iteration count.manual: the user explicitly ends the run.blocked: a required source, approval, fixture, or tool is missing and cannot be bypassed without lowering a gate.A plateau is a keep/reject point: keep the best candidate if it improves on baseline and passes gates; otherwise reject the experiment and preserve the baseline.
Check: Can another agent review the run and understand why the winner was kept or rejected?
Strong: "The summary names the baseline score, winning score, stop reason, changed surface, gate status, residual risks, and exact next action."
Weak: "The summary says the new version is better and should be used."
Write a final summary in the run directory:
status: finalized
run_id: ""
mode: general | skill-eval
editable_surface: ""
baseline_score: 0
winner: baseline | iter-1 | iter-2 | none
winner_score: 0
decision: keep | reject | blocked
stop_reason: stop:threshold | stop:plateau | stop:max_iter | manual | blocked
gates:
lowered: false
failed: []
bypasses: []
artifacts:
baseline: ""
winner: ""
journal: ""
notes: ""
residual_risks: []
next_action: ""
Ask for explicit approval before promoting a winner outside the run directory. Strategic-page approval must be human and explicit; the run result alone is not approval.
Use skill-eval mode when improving an SEO Brain skill. The editable surface is exactly one skills/<name>/SKILL.md file unless the user explicitly names another file; save notes under .context/skill-evals/<name>/<run-id>/.
Minimum skill-eval metrics:
task_clarity: the skill teaches one task and names routing boundaries.self_sufficiency: normal execution does not require _shared/, _legacy/, or another skill.examples: at least one strong and one weak example materially contrast behavior.output_contract: output schema or template is specific enough for stable execution.critical_gates: anti-fabrication, source/synthesis separation, approval gates, and language fidelity are explicit.behavioral_parity: the new skill preserves required files, gates, JSON/YAML surfaces, and user-facing behavior from the approved contract.length_budget: the main SKILL.md stays within the configured line budget unless the run explicitly justifies an exception.Executor simulation must use only the candidate skill and the fixture. Reviewer scoring must use the committed rubric. Sub-agent or simulated output is evidence, not approval; the main agent still owns integration and the human still owns strategic approval.
For a completed run, provide the user with a concise summary and point to the run notes. Use this shape:
autoresearch_result:
status: finalized | blocked
decision: keep | reject | blocked
stop_reason: ""
editable_surface: ""
baseline_score: 0
winner_score: 0
winner_path: ""
run_notes: ""
gates_lowered: false
failed_gates: []
residual_risks: []
next_action: ""
Input: "Improve skills/seo-analysis/SKILL.md with an autoresearch loop."
Output: "Run skill-eval with skills/seo-analysis/SKILL.md as the only editable surface, save notes under .context/skill-evals/seo-analysis/<run-id>/, score the baseline, commit metrics at threshold 90, test one candidate at a time, and keep only a candidate that improves the score without lowering DataForSEO, source separation, wiki approval, or pt-BR language gates."
Input: "Make this skill pass faster."
Output: "Lower the threshold from 90 to 75, remove the approval-gate metric, edit the fixture to match the draft, and publish the draft to project/wiki/." This is weak because it changes the evaluation surface, lowers gates, and treats an unapproved draft as approved context.
gates.lowered: false..context/skill-evals/<skill-name>/<run-id>/.