Continuously improves an existing agent skill based on eval results using the RED-GREEN-REFACTOR cycle. Apply when a skill's routing accuracy is low, trigger descriptions need sharpening, or os-eval-runner scores are below target. (1) run a RED baseline to observe the failure mode, (2) apply a focused patch and verify with os-eval-runner (GREEN), (3) refactor to close loopholes until score meets threshold. Integrates with os-eval-runner as the objective eval gate. NOT for scaffolding new skills — use create-skill (agent-scaffolders) for that.
From agent-agentic-osnpx claudepluginhub richfrem/agent-plugins-skills --plugin agent-agentic-osThis skill is limited to using the following tools:
evals/evals.jsonevals/results.tsvreferences/memory/improvement-ledger-spec.mdreferences/operations/skill_optimization_guide.mdreferences/testing/test-registry-protocol.mdscripts/eval_runner.pySearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Verifies tests pass on completed feature branch, presents options to merge locally, create GitHub PR, keep as-is or discard; executes choice and cleans up worktree.
Adapts the RED-GREEN-REFACTOR cycle from software testing to skill authoring. The key insight: a skill is a testable contract. The failure to follow the contract is observable. Always observe the failure BEFORE writing the fix.
Integrated with:
os-eval-runner -- runs eval_runner.py as the GREEN verification stepTriple-Loop Retrospective -- uses this methodology to gate every proposed skill patchevals/evals.json + results.tsv -- autoresearch eval format for longitudinal tracking| Software TDD | Skill Authoring Equivalent |
|---|---|
| Test case | Pressure scenario: a user prompt that should trigger the skill |
| RED phase | Run a baseline WITHOUT the skill. Observe: does the agent violate the intended protocol? |
| GREEN phase | Write the skill. Run os-eval-runner. KEEP only if score >= baseline. |
| REFACTOR phase | Identify loopholes from eval failures. Patch frontmatter or examples. Re-eval. |
Never write a new skill without first observing a failure.
The RED scenario is the evidence that the skill is needed. Without it:
<example> block's commentary.# Document the RED scenario before writing:
# Write to: context/memory/tests/[TIMESTAMP]_[SKILL_SLUG].md
# Fields: pressure_scenario, expected_behavior, observed_failure, acceptance_criterion
Before proposing any change in an active improvement loop, run:
python3 ./scripts/eval_runner.py \
--skill <experiment-dir> \
--snapshot
This tells you: current score, iteration history, false-positive vs false-negative rate, and the dominant problem type (PRECISION or RECALL). If the snapshot shows PRECISION (too many false positives), do not add more keywords — that makes it worse. If it shows RECALL, do not add adversarial examples without also adding trigger phrases.
If --snapshot is not yet available (pre-Enhancement-2), read evals/results.tsv directly
for score trend and evals/traces/ for the most recent DISCARD's per-input detail.
Before editing any file, output a hypothesis block. If you cannot fill all 5 fields from trace data or eval history, read more traces before proposing. Mutations without a grounded hypothesis are exploratory noise — not systematic improvement.
HYPOTHESIS:
Failure mode: [exact input that triggered incorrectly + the incorrect verdict]
Root cause: [which specific keyword, phrase, or missing example caused it]
Change: [one sentence — add/remove/modify WHAT in SKILL.md]
Effect: [which specific eval inputs should flip from wrong → correct]
Risk: [which inputs might regress — name them specifically]
Acceptable example:
HYPOTHESIS:
Failure mode: "audit all hyperlinks in markdown files" triggered (should_trigger=false)
Root cause: keyword 'audit' in description matched this unrelated request
Change: Remove 'audit'; replace with 'broken-link audit' (compound, more specific)
Effect: iter_002 false positive should no longer trigger
Risk: "audit my symlink manifest" (iter_006, should_trigger=true) may also stop triggering
Not acceptable — do not write mutations based on vague hypotheses like "description too vague, improve it." That produces random mutations and early plateau.
Before writing a single line of SKILL.md:
context/memory/tests/registry.md): has this hypothesis
been tested and falsified before? If yes, do not re-test -- pick a different approach.---
name: skill-slug # lowercase-hyphen, matches directory name
version: 1.0.0
description: >
Trigger description. This is the MOST IMPORTANT field -- it determines routing accuracy.
Rules:
- Lead with the primary use case, not the skill name
- Include 2-3 <example> blocks: one standard use, one adversarial (when NOT to trigger),
one edge case
- Use specific vocabulary in the description text — terms that only appear in this skill's domain
- NEVER add a `keywords:` YAML field — it disables description scanning entirely (known footgun — see os-eval-runner Troubleshooting)
- Avoid generic verbs (do, run, execute) as primary triggers -- they appear everywhere
trigger: comma-separated, specific trigger phrases that ONLY appear in this skill's context
allowed-tools: Read, Write, Edit, Bash # list only what the skill actually needs
---
Trigger description anti-patterns (will degrade routing accuracy):
Every non-trivial skill needs at least two example blocks:
<example>
<commentary>Standard use: agent correctly invokes this skill</commentary>
User: [exact or paraphrased pressure scenario from RED phase]
Agent: [first sentence of correct behavior -- invoke the skill, not explain it]
</example>
<example>
<commentary>Adversarial: agent correctly does NOT invoke this skill</commentary>
User: [request that SOUNDS similar but belongs to a different skill]
Agent: [correct behavior: invokes the OTHER skill instead]
</example>
# Skill Name
One-paragraph description of what the skill does and why.
## When to Use
- [condition 1]
- [condition 2]
## Iron Law (if applicable)
[The single most important rule that must not be violated. State it as an absolute.]
## Step-by-Step Protocol
[Numbered steps. If >7 steps, extract a sub-phase.]
## Common Failures
| Failure | Why it happens | Prevention |
|---|---|---|
## References
- [related skill or reference doc]
After writing the SKILL.md, run the eval gate. Do not apply the skill without a KEEP verdict.
python3 ./scripts/eval_runner.py \
--skill path/to/new/SKILL.md
Interpreting results:
STATUS: KEEP -- score >= baseline. Apply the skill.STATUS: BASELINE -- first run. Record the score. Do not apply yet -- write an eval scenario
in evals/evals.json targeting the pressure scenario from Phase 1.STATUS: DISCARD -- score same or lower. Do not apply. Go to Phase 4 (REFACTOR).If the eval returns BASELINE on a new skill, write one eval scenario in evals/evals.json
in the autoresearch format, run again, and compare to that baseline before shipping.
If eval returns DISCARD or review reveals gaps:
<example> block covering that specific input.context/memory/tests/registry.md.REFACTOR anti-patterns:
| Type | When to use | Key property |
|---|---|---|
| Protocol skill | Sequential multi-step procedure | Steps are MANDATORY, order matters |
| Reference skill | Lookup table or decision guide | Agent reads it, does not execute steps |
| Gating skill | Iron Law enforcement (verification, TDD) | Must include Common Failures table |
| Coordination skill | Agent-to-agent or multi-session | Must specify event bus interaction pattern |
plugins/<your-plugin>/skills/<skill-slug>/
SKILL.md <- single authoritative source (never duplicate)
evals/
evals.json <- eval scenarios in autoresearch format
results.tsv <- longitudinal KEEP/DISCARD history (append-only)
references/ <- supporting docs (file-level symlinks if shared)
scripts/ <- helper scripts (file-level symlinks if shared)
If a reference doc or script is shared with another skill in the same plugin:
references/ or scripts/When Triple-Loop Retrospective proposes a new skill or skill patch, it MUST:
A proposal that skips the RED scenario MUST be rejected -- the learning loop cannot improve what it cannot measure.
create-skill (filesystem scaffolding).[!TIP] See INSTALL.md for instructions on how to install missing dependencies.
How they work together:
create-skill (agent-scaffolders) — runs the discovery interview, creates the directory, writes starter filesos-skill-improvement (this skill) — takes the scaffolded skill and drives the RED-GREEN-REFACTOR quality cycleos-eval-runner (agent-agentic-os) — provides the objective eval gate used in step 2