Bootstraps a skill evaluation lab repo for an autoresearch improvement run. Trigger with "set up an eval lab", "bootstrap the eval repo", "prepare the test repo for skill evaluation", "create an eval environment for this skill", "set up the lab space for this skill", or when starting a new skill optimization run that needs a standalone test environment. <example> Context: User wants to start an improvement run on a skill in an isolated lab repo. user: "Set up an eval lab for the link-checker skill" assistant: [triggers os-eval-lab, runs intake interview, bootstraps lab repo, installs engine, copies plugin files, generates eval-instructions.md] </example> <example> Context: User has a lab repo but needs it configured. user: "Prepare the test repo at <USER_HOME>/Projects/test-my-skill-eval for skill evaluation" assistant: [triggers os-eval-lab, installs engine, copies plugin files, generates eval-instructions.md] </example>
From agent-agentic-osnpx claudepluginhub richfrem/agent-plugins-skills --plugin agent-agentic-osThis skill is limited to using the following tools:
assets/templates/autoresearch/evals.json.templateassets/templates/autoresearch/program.md.templateassets/templates/autoresearch/results.tsv.templateassets/templates/eval-instructions.template.mdevals/evals.jsonevals/results.tsvreferences/operating-protocols.mdreferences/program.mdscripts/generate_eval_instructions.pySearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
You bootstrap evaluation lab environments for autoresearch improvement runs. A lab repo is a
standalone git repo with a hard copy of the plugin files (no symlinks), the
os-eval-runner engine installed, and a customized eval-instructions.md ready for
an eval agent to follow.
The template used to generate eval-instructions.md lives at:
assets/templates/eval-instructions.template.md (relative to this skill root)
Ask each unanswered question. If provided in $ARGUMENTS, confirm rather than re-ask.
Q1 — Lab repo path?
The local filesystem path to the lab git repository (e.g. <USER_HOME>/Projects/test-link-checker-eval).
If it doesn't exist: "Should I create a new directory at that path and initialize it as a git repo?"
Q2 — Target plugin path?
The canonical plugin path in agent-plugins-skills (e.g. .agents/skills/link-checker). This is
what gets hard-copied into the lab repo.
Q3 — Target skill name?
The skill folder name to optimize (e.g. link-checker-agent). This is the skill whose
SKILL.md will be mutated each iteration.
Q4 — GitHub repo URL?
The remote URL for the lab repo (e.g. https://github.com/username/test-skill-eval.git).
Set as origin in the lab repo.
Q5 — Round label?
Short label used in log and survey filenames (e.g. link-checker-round1).
Default: <skill-name>-round1.
Q6 — agent-plugins-skills root path?
The absolute local path to the agent-plugins-skills repo (needed for the npx install path
and master plugin path). Default: ask the user or detect from context.
Q7 — What are you optimizing for? (primary metric)
Present these options and ask the user to pick one:
| Option | Metric | KEEP condition | Best when |
|---|---|---|---|
quality_score (default) | routing_accuracy × 0.7 + heuristic × 0.3 | score ≥ baseline AND f1 ≥ baseline | General SKILL.md improvement |
f1 | F1 score | f1 ≥ baseline | Routing balance — both precision and recall matter equally |
precision | Routing precision | precision ≥ baseline | Skill is over-triggering (too many false positives) |
recall | Routing recall | recall ≥ baseline | Skill is under-triggering (missing true positives) |
heuristic | Structural health score | heuristic ≥ baseline | Routing is already good; fixing structural/doc issues |
If the user is unsure: diagnose first — run eval_runner.py --snapshot to see whether
false-positive or false-negative rate is the dominant problem, then suggest the matching metric.
Default: quality_score if the user has no preference.
Q8 — What optimization strategy? (how much context the proposer sees)
Present these options:
| Strategy | Proposer sees | Token cost | Best when |
|---|---|---|---|
scores-only | results.tsv rows (score history) | ~0.002 MTok/iter | Simple routing fix, fast cheap iteration |
traces (default) | results.tsv + last 3 trace files | ~0.1 MTok/iter | Most cases — enough signal without high cost |
full | results.tsv + ALL trace files | ~1–10 MTok/iter | Complex structural failures needing causal diagnosis |
The strategy is written into program.md as an instruction to the proposer. It does not change
evaluate.py behavior — only what the proposer agent reads before proposing mutations.
Default: traces unless the user specifies otherwise.
Q9 — Which CLI proposer for mutations?
The improvement loop delegates mutation proposals to an external CLI for cheap, fast iteration.
| Option | Command | Best when |
|---|---|---|
copilot (default) | copilot -p "..." | GitHub Copilot CLI installed |
gemini | gemini -p "..." | Gemini CLI installed |
self | agent self-proposes | No CLI available (slowest, most tokens) |
Check availability: which copilot / which gemini. Default to copilot if both are present.
The choice is written into eval-instructions.md Step 4 so the eval agent knows which command to use.
Confirm before proceeding:
Lab repo: /path/to/lab-repo (e.g. <USER_HOME>/Projects/...)
Plugin (master): plugins/<plugin-name> → /abs/path/agent-plugins-skills/plugins/<plugin-name>
Skill: <skill-name>
GitHub remote: https://github.com/...
Round label: <label>
Primary metric: quality_score (or: f1 / precision / recall / heuristic)
Strategy: traces (or: scores-only / full)
Proposer CLI: copilot (or: gemini / self)
The plugin path from Q2 has the form plugins/<plugin-folder>/skills/<skill-folder>.
Parse it explicitly using cut — do NOT infer PLUGIN_NAME from the plugins/ root word:
APS_ROOT=<abs-path-to-agent-plugins-skills> # from Q6
PLUGIN_PATH=plugins/<plugin-name> # from Q2, e.g. plugins/mermaid-to-png
SKILL_NAME=<skill-folder-name> # from Q3, e.g. convert-mermaid
# Extract plugin folder name (segment 2, NOT segment 1 which is 'plugins')
PLUGIN_NAME=$(echo "$PLUGIN_PATH" | cut -d'/' -f2) # → mermaid-to-png
LAB_PATH=<lab-repo-path> # from Q1
SKILL_EVAL_SOURCE="$LAB_PATH/.agents/skills/os-eval-runner"
# Verify before proceeding:
echo "PLUGIN_NAME=$PLUGIN_NAME SKILL_NAME=$SKILL_NAME"
echo "SKILL_EVAL_SOURCE=$SKILL_EVAL_SOURCE"
⚠️
PLUGIN_NAME=mermaid-to-png, NOTplugins. Always verify the echo output.
[!WARNING] Workspace Permissions: The lab repo path is usually outside your current active workspace. Before modifying any files in the lab directory, you MUST ask the user for full file access / to turn off workspace validation. Once they confirm, you may use your normal code-editing tools securely. If they choose not to grant permission, you must bypass your internal file tools entirely and use only native bash operations (
mkdir,cp,echoviarun_command) to create the lab environment. Using file tools without permission will result in frozen operations.
Run these steps in the lab repo directory in order:
cd <lab-repo>
git remote remove origin 2>/dev/null
git remote add origin <GITHUB_URL>
git remote -v
If not yet a git repo:
git init && git add . && git commit -m "init: <skill-name> eval sandbox"
rm -rf .agent .agents .gemini .claude
cp -RL <APS_ROOT>/plugins/<plugin-name> <lab-repo>/<plugin-folder-name>
rm -rf <lab-repo>/<plugin-folder-name>/**/__pycache__
[!TIP] See INSTALL.md for instructions on how to install missing dependencies. If
-ycrashes: run without it and press Enter to accept defaults. Both skills are required:os-eval-runnergates iterations,copilot-cli-agentproposes mutations.
cd <lab-repo>
git add . && git commit -m "seed: install os-eval-runner engine"
git push origin main
python3 --version # must be 3.8+
Use the generate_eval_instructions.py script provided in this skill's scripts directory to generate the filled instruction file:
python3 $APS_ROOT/plugins/agent-agentic-os/skills/os-eval-lab-setup/scripts/generate_eval_instructions.py \
--template $APS_ROOT/plugins/agent-agentic-os/assets/templates/eval-instructions.template.md \
--out $LAB_PATH/eval-instructions.md \
--skill-display-name "<Human-readable skill name>" \
--skill-name "$SKILL_NAME" \
--plugin-dir "$PLUGIN_NAME" \
--mutation-target "SKILL.md" \
--repo-url "$GH_URL" \
--round-label "<The round label>" \
--engine-source "$SKILL_EVAL_SOURCE" \
--master-plugin-path "$APS_ROOT/plugins/$PLUGIN_NAME"
Report to the user:
.agents/skills/os-eval-runner/eval-instructions.md written at lab repo rootExecution Options (Confirm with User): Ask the user how they want to run the loop:
Follow eval-instructions.md."gemini CLI in headless mode.If Autonomous: Run this exact bash command from your active workspace:
# Run from within the lab repository
nohup gemini --yolo --model gemini-3-flash-preview -p "You are the L1 Triple-Loop Orchestrator. Read eval-instructions.md completely and follow every step precisely. You are running headlessly — do NOT pause to ask for human confirmation on the evals.json setup; populate the JSON yourself and immediately execute all 10 iteration loops using copilot (gpt-mini with --allow-all-paths --allow-all-urls -y) as your L2 proposer. Generate eval_progress.png at the end." > gemini_orchestrator_<skill-name>.log 2>&1 < /dev/null &
[!IMPORTANT] If you hit
Tool execution denied by policy, rungemini trust <lab-repo-path>in your master terminal.
When the run completes (or you observe it finishing via logs), use the os-eval-backport skill in this repo to review and apply approved changes back to master sources.
When os-eval-runner is installed as a peer in the lab repo alongside the target skill,
the improvement loop may propose changes to os-eval-runner itself — its SKILL.md, scripts,
or evals — in addition to the target skill. This is expected and welcome, not a bug.
Why it happens: the agent can read all installed skills and proposes the highest-leverage
change it can find, regardless of which skill it's in. The lab copy of os-eval-runner
is a safe mutation target because:
evaluate.py still gates every change — including changes to eval_runner.py itselfos-eval-backport review is the gate before any change reaches the canonical sourceAt backport review: treat changes to os-eval-runner files with extra scrutiny —
the evaluator modifying its own scoring logic is high-leverage. Verify the change doesn't
introduce a scoring bias that inflates future KEEP rates. See os-eval-backport SKILL.md
for the review checklist.
This pattern is structurally equivalent to what Meta-Harness (Lee et al., arXiv:2603.28052) calls "harness self-improvement": the outer loop discovers improvements to the evaluation machinery itself, not just the target. The backport gate is the Pareto review that controls what flows to production.