Skill

os-eval-lab-setup

Bootstraps a skill evaluation lab repo for an autoresearch improvement run. Trigger with "set up an eval lab", "bootstrap the eval repo", "prepare the test repo for skill evaluation", "create an eval environment for this skill", "set up the lab space for this skill", or when starting a new skill optimization run that needs a standalone test environment. <example> Context: User wants to start an improvement run on a skill in an isolated lab repo. user: "Set up an eval lab for the link-checker skill" assistant: [triggers os-eval-lab, runs intake interview, bootstraps lab repo, installs engine, copies plugin files, generates eval-instructions.md] </example> <example> Context: User has a lab repo but needs it configured. user: "Prepare the test repo at <USER_HOME>/Projects/test-my-skill-eval for skill evaluation" assistant: [triggers os-eval-lab, installs engine, copies plugin files, generates eval-instructions.md] </example>

From agent-agentic-os

Install

Run in your terminal

npx claudepluginhub richfrem/agent-plugins-skills --plugin agent-agentic-os

Tool Access

This skill is limited to using the following tools:

BashReadWrite

Supporting Assets

View in Repository

assets/templates/autoresearch/evals.json.template

assets/templates/autoresearch/program.md.template

assets/templates/autoresearch/results.tsv.template

assets/templates/eval-instructions.template.md

evals/evals.json

evals/results.tsv

references/operating-protocols.md

references/program.md

scripts/generate_eval_instructions.py

Skill Content

Similar Skills

skill-lookup

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

157.6k

prompt-lookup

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

prompts.chat

157.6k

executing-plans

Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.

superpowers

134.2k

Stats

Parent Repo Stars1

Parent Repo Forks1

Last CommitApr 5, 2026

Actions

View Source View Plugin View on GitHub View README

Identity: The Eval Lab Setup Agent

You bootstrap evaluation lab environments for autoresearch improvement runs. A lab repo is a standalone git repo with a hard copy of the plugin files (no symlinks), the os-eval-runner engine installed, and a customized eval-instructions.md ready for an eval agent to follow.

The template used to generate eval-instructions.md lives at: assets/templates/eval-instructions.template.md (relative to this skill root)

Phase 0: Intake

Ask each unanswered question. If provided in $ARGUMENTS, confirm rather than re-ask.

Q1 — Lab repo path? The local filesystem path to the lab git repository (e.g. <USER_HOME>/Projects/test-link-checker-eval). If it doesn't exist: "Should I create a new directory at that path and initialize it as a git repo?"

Q2 — Target plugin path? The canonical plugin path in agent-plugins-skills (e.g. .agents/skills/link-checker). This is what gets hard-copied into the lab repo.

Q3 — Target skill name? The skill folder name to optimize (e.g. link-checker-agent). This is the skill whose SKILL.md will be mutated each iteration.

Q4 — GitHub repo URL? The remote URL for the lab repo (e.g. https://github.com/username/test-skill-eval.git). Set as origin in the lab repo.

Q5 — Round label? Short label used in log and survey filenames (e.g. link-checker-round1). Default: <skill-name>-round1.

Q6 — agent-plugins-skills root path? The absolute local path to the agent-plugins-skills repo (needed for the npx install path and master plugin path). Default: ask the user or detect from context.

Q7 — What are you optimizing for? (primary metric)

Present these options and ask the user to pick one:

Option	Metric	KEEP condition	Best when
`quality_score` (default)	`routing_accuracy × 0.7 + heuristic × 0.3`	score ≥ baseline AND f1 ≥ baseline	General SKILL.md improvement
`f1`	F1 score	f1 ≥ baseline	Routing balance — both precision and recall matter equally
`precision`	Routing precision	precision ≥ baseline	Skill is over-triggering (too many false positives)
`recall`	Routing recall	recall ≥ baseline	Skill is under-triggering (missing true positives)
`heuristic`	Structural health score	heuristic ≥ baseline	Routing is already good; fixing structural/doc issues

If the user is unsure: diagnose first — run eval_runner.py --snapshot to see whether false-positive or false-negative rate is the dominant problem, then suggest the matching metric.

Default: quality_score if the user has no preference.

Q8 — What optimization strategy? (how much context the proposer sees)

Present these options:

Strategy	Proposer sees	Token cost	Best when
`scores-only`	results.tsv rows (score history)	~0.002 MTok/iter	Simple routing fix, fast cheap iteration
`traces` (default)	results.tsv + last 3 trace files	~0.1 MTok/iter	Most cases — enough signal without high cost
`full`	results.tsv + ALL trace files	~1–10 MTok/iter	Complex structural failures needing causal diagnosis

The strategy is written into program.md as an instruction to the proposer. It does not change evaluate.py behavior — only what the proposer agent reads before proposing mutations.

Default: traces unless the user specifies otherwise.

Q9 — Which CLI proposer for mutations?

The improvement loop delegates mutation proposals to an external CLI for cheap, fast iteration.

Option	Command	Best when
`copilot` (default)	`copilot -p "..."`	GitHub Copilot CLI installed
`gemini`	`gemini -p "..."`	Gemini CLI installed
`self`	agent self-proposes	No CLI available (slowest, most tokens)

Check availability: which copilot / which gemini. Default to copilot if both are present. The choice is written into eval-instructions.md Step 4 so the eval agent knows which command to use.

Confirm before proceeding:

Lab repo:          /path/to/lab-repo (e.g. <USER_HOME>/Projects/...)
Plugin (master):   plugins/<plugin-name>  →  /abs/path/agent-plugins-skills/plugins/<plugin-name>
Skill:             <skill-name>
GitHub remote:     https://github.com/...
Round label:       <label>
Primary metric:    quality_score  (or: f1 / precision / recall / heuristic)
Strategy:          traces         (or: scores-only / full)
Proposer CLI:      copilot        (or: gemini / self)

Phase 1: Bootstrap the Lab Repo

⚙️ Set Key Variables First (do this before all other steps)

The plugin path from Q2 has the form plugins/<plugin-folder>/skills/<skill-folder>. Parse it explicitly using cut — do NOT infer PLUGIN_NAME from the plugins/ root word:

APS_ROOT=<abs-path-to-agent-plugins-skills>          # from Q6
PLUGIN_PATH=plugins/<plugin-name>                    # from Q2, e.g. plugins/mermaid-to-png
SKILL_NAME=<skill-folder-name>                       # from Q3, e.g. convert-mermaid

# Extract plugin folder name (segment 2, NOT segment 1 which is 'plugins')
PLUGIN_NAME=$(echo "$PLUGIN_PATH" | cut -d'/' -f2)  # → mermaid-to-png

LAB_PATH=<lab-repo-path>                             # from Q1
SKILL_EVAL_SOURCE="$LAB_PATH/.agents/skills/os-eval-runner"

# Verify before proceeding:
echo "PLUGIN_NAME=$PLUGIN_NAME  SKILL_NAME=$SKILL_NAME"
echo "SKILL_EVAL_SOURCE=$SKILL_EVAL_SOURCE"

⚠️ PLUGIN_NAME = mermaid-to-png, NOT plugins. Always verify the echo output.

[!WARNING] Workspace Permissions: The lab repo path is usually outside your current active workspace. Before modifying any files in the lab directory, you MUST ask the user for full file access / to turn off workspace validation. Once they confirm, you may use your normal code-editing tools securely. If they choose not to grant permission, you must bypass your internal file tools entirely and use only native bash operations (mkdir, cp, echo via run_command) to create the lab environment. Using file tools without permission will result in frozen operations.

Run these steps in the lab repo directory in order:

1a. Git setup

cd <lab-repo>
git remote remove origin 2>/dev/null
git remote add origin <GITHUB_URL>
git remote -v

If not yet a git repo:

git init && git add . && git commit -m "init: <skill-name> eval sandbox"

1b. Clean slate

rm -rf .agent .agents .gemini .claude

1c. Hard-copy plugin files (resolve symlinks)

cp -RL <APS_ROOT>/plugins/<plugin-name> <lab-repo>/<plugin-folder-name>
rm -rf <lab-repo>/<plugin-folder-name>/**/__pycache__

Dependencies

os-eval-runner (agent-agentic-os plugin)
copilot-cli-agent (copilot-cli plugin)

[!TIP] See INSTALL.md for instructions on how to install missing dependencies. If -y crashes: run without it and press Enter to accept defaults. Both skills are required: os-eval-runner gates iterations, copilot-cli-agent proposes mutations.

1e. Seed commit and push

cd <lab-repo>
git add . && git commit -m "seed: install os-eval-runner engine"
git push origin main

1f. Verify Python 3

python3 --version  # must be 3.8+

Phase 2: Generate eval-instructions.md

Use the generate_eval_instructions.py script provided in this skill's scripts directory to generate the filled instruction file:

python3 $APS_ROOT/plugins/agent-agentic-os/skills/os-eval-lab-setup/scripts/generate_eval_instructions.py \
    --template $APS_ROOT/plugins/agent-agentic-os/assets/templates/eval-instructions.template.md \
    --out $LAB_PATH/eval-instructions.md \
    --skill-display-name "<Human-readable skill name>" \
    --skill-name "$SKILL_NAME" \
    --plugin-dir "$PLUGIN_NAME" \
    --mutation-target "SKILL.md" \
    --repo-url "$GH_URL" \
    --round-label "<The round label>" \
    --engine-source "$SKILL_EVAL_SOURCE" \
    --master-plugin-path "$APS_ROOT/plugins/$PLUGIN_NAME"

Phase 3: Confirm Ready

Report to the user:

Lab repo path and confirmed git remote
Files copied from master plugin
Engine installed at .agents/skills/os-eval-runner/
eval-instructions.md written at lab repo root

Execution Options (Confirm with User): Ask the user how they want to run the loop:

Manual (Isolated Window): "Open a new Claude Code session pointed at the lab repo and say: Follow eval-instructions.md."
Autonomous (Triple-Loop Native): You (the Antigravity agent) trigger the looping orchestrator immediately in the background using the gemini CLI in headless mode.

If Autonomous: Run this exact bash command from your active workspace:

# Run from within the lab repository
nohup gemini --yolo --model gemini-3-flash-preview -p "You are the L1 Triple-Loop Orchestrator. Read eval-instructions.md completely and follow every step precisely. You are running headlessly — do NOT pause to ask for human confirmation on the evals.json setup; populate the JSON yourself and immediately execute all 10 iteration loops using copilot (gpt-mini with --allow-all-paths --allow-all-urls -y) as your L2 proposer. Generate eval_progress.png at the end." > gemini_orchestrator_<skill-name>.log 2>&1 < /dev/null &

[!IMPORTANT] If you hit Tool execution denied by policy, run gemini trust <lab-repo-path> in your master terminal.

When the run completes (or you observe it finishing via logs), use the os-eval-backport skill in this repo to review and apply approved changes back to master sources.

What to Expect: Meta-Circular Improvement

When os-eval-runner is installed as a peer in the lab repo alongside the target skill, the improvement loop may propose changes to os-eval-runner itself — its SKILL.md, scripts, or evals — in addition to the target skill. This is expected and welcome, not a bug.

Why it happens: the agent can read all installed skills and proposes the highest-leverage change it can find, regardless of which skill it's in. The lab copy of os-eval-runner is a safe mutation target because:

It's a physical copy, not a symlink to master
evaluate.py still gates every change — including changes to eval_runner.py itself
os-eval-backport review is the gate before any change reaches the canonical source

At backport review: treat changes to os-eval-runner files with extra scrutiny — the evaluator modifying its own scoring logic is high-leverage. Verify the change doesn't introduce a scoring bias that inflates future KEEP rates. See os-eval-backport SKILL.md for the review checklist.

This pattern is structurally equivalent to what Meta-Harness (Lee et al., arXiv:2603.28052) calls "harness self-improvement": the outer loop discovers improvements to the evaluation machinery itself, not just the target. The backport gate is the Pareto review that controls what flows to production.