Skill

pi-autoresearch — Autonomous Experiment Loop

Runs autonomous experiment loops in pi to propose code changes, benchmark metrics like test speed or bundle size, commit wins, revert losses, and repeat.

Typescript

Bash

Git

automation

performance

Install

npx claudepluginhub joshuarweaver/cascade-ai-ml-agents-misc-1 --plugin aradotso-trending-skills-37

Tool Access

This skill uses the workspace's default tool permissions.

Preview

```markdown

SKILL.md

Similar Skills

cache-components

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

139.2k

mcp-builder

9 files

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

anthropics-skills-13

124.2k

canvas-design

20 files

Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.

anthropics-skills-13

124.2k

Stats

Stars36

Forks8

Last CommitMar 15, 2026

Actions

View Source View Plugin View on GitHub View README

Quick Start

/skill:autoresearch-create

The agent will:

Ask about your goal, command, metric, and files in scope (or infer from context)
Create a branch
Write autoresearch.md and autoresearch.sh
Run the baseline
Start looping immediately — no further input needed

Core Concepts

Two-file persistence model

Every session is fully recoverable from two files:

File	Purpose
`autoresearch.jsonl`	Append-only log — one JSON line per run (metric, status, commit, description)
`autoresearch.md`	Living document — objective, what's been tried, dead ends, key wins

A fresh agent with zero memory can read these two files and continue exactly where the previous session left off.

Session files written by the skill

File	Purpose
`autoresearch.md`	Session document — objective, metrics, files in scope, experiment history
`autoresearch.sh`	Benchmark script — pre-checks, runs the workload, outputs `METRIC name=number` lines
`autoresearch.checks.sh`	(optional) Backpressure checks — tests, types, lint. Failures block `keep`

Extension Tools

`init_experiment`

One-time session configuration. Call once at session start.

await init_experiment({
  name: "vitest-speed",
  metric: "seconds",
  unit: "s",
  direction: "lower", // "lower" | "higher"
});

`run_experiment`

Runs any shell command, times wall-clock duration, captures stdout/stderr.

const result = await run_experiment({
  command: "pnpm test --run",
  timeout_seconds: 120,           // optional, default 300
  checks_timeout_seconds: 300,    // optional, for checks script
});
// result: { exit_code, duration_seconds, stdout, stderr }

`log_experiment`

Records result, auto-commits on keep, updates the status widget and dashboard.

await log_experiment({
  metric_value: 42.3,
  status: "keep",          // "keep" | "discard" | "crash" | "checks_failed"
  description: "Enable parallel test workers in vitest config",
  commit_message: "perf: parallel vitest workers → 42.3s (-18%)",
});

The Autonomous Loop

Once started, the agent runs this cycle indefinitely:

propose change → edit files → run_experiment → measure metric
       ↓
  metric improved?
    YES → log_experiment(keep) → auto-commit → update autoresearch.md
    NO  → log_experiment(discard) → git revert → try next idea
       ↓
  repeat forever (until interrupted)

Interrupt anytime with Escape, then ask for a summary of what was tried.

Benchmark Script Format

autoresearch.sh must output at least one METRIC line:

#!/bin/bash
set -euo pipefail

# Pre-checks
[ -f package.json ] || { echo "No package.json"; exit 1; }

# Run workload
pnpm test --run

# Output metric — required format
echo "METRIC seconds=$SECONDS"

Multiple metrics are supported:

echo "METRIC duration_seconds=42.3"
echo "METRIC test_count=847"
echo "METRIC memory_mb=512"

The primary metric (set in init_experiment) drives keep/discard decisions. Others are recorded for analysis.

Backpressure Checks (Optional)

Create autoresearch.checks.sh to guard correctness after every passing benchmark:

#!/bin/bash
set -euo pipefail

pnpm test --run          # full test suite
pnpm typecheck           # TypeScript
pnpm lint                # ESLint / Biome

Behavior:

File absent → loop runs exactly as before, no change
File present → runs automatically after every benchmark that exits 0
Checks time does not count toward the primary metric
Checks failure → logged as checks_failed, changes reverted (same as crash)
Dashboard shows checks_failed separately from crash so you can distinguish correctness failures from benchmark errors

UI

Status Widget

Always visible above the editor:

🔬 autoresearch 12 runs 8 kept │ best: 42.3s

Dashboard

Open with /autoresearch — full results table with status, metric values, descriptions, and best run highlighted.

Ctrl+X — toggle dashboard
Escape — close dashboard / interrupt loop

Example Domains

// Test speed
{
  command: "pnpm test --run",
  metric: "seconds",
  direction: "lower",
  scope: ["vitest.config.ts", "src/**/*.test.ts"],
}

// Bundle size
{
  command: "pnpm build && du -sb dist | cut -f1",
  metric: "bytes",
  direction: "lower",
  scope: ["vite.config.ts", "src/index.ts"],
}

// LLM training loss
{
  command: "uv run train.py --epochs 1",
  metric: "val_bpb",
  direction: "lower",
  scope: ["train.py", "model.py", "config.yaml"],
}

// Build speed
{
  command: "pnpm build",
  metric: "seconds",
  direction: "lower",
  scope: ["tsconfig.json", "vite.config.ts"],
}

// Lighthouse performance
{
  command: "lighthouse http://localhost:3000 --output=json | jq '.categories.performance.score'",
  metric: "score",
  direction: "higher",
  scope: ["src/pages/index.tsx", "public/"],
}

autoresearch.md Structure

The skill writes and maintains this file throughout the session:

# autoresearch: vitest-speed

## Objective
Reduce test suite wall-clock time. Baseline: 51.7s.

## Metric
- Name: seconds
- Direction: lower is better
- Baseline: 51.7s
- Best so far: 42.3s (run 8)

## Files in scope
- vitest.config.ts
- src/**/*.test.ts

## What's been tried
- [kept] Run 8: Enable parallel workers → 42.3s (-18%)
- [discarded] Run 5: Increase pool size to 16 → 53.1s (+3%)
- [kept] Run 3: Disable coverage in CI → 47.8s (-8%)

## Dead ends
- Increasing pool beyond 8 causes memory pressure, net negative

## Next ideas
- [ ] Try forks pool instead of threads
- [ ] Investigate slow test files with --reporter=verbose

autoresearch.jsonl Format

One JSON object per line:

{"run":1,"metric_value":51.7,"status":"keep","description":"baseline","commit":"a1b2c3d","timestamp":"2025-01-15T10:00:00Z"}
{"run":2,"metric_value":49.2,"status":"keep","description":"disable coverage","commit":"e4f5g6h","timestamp":"2025-01-15T10:03:21Z"}
{"run":3,"metric_value":53.1,"status":"discard","description":"increase pool to 16","commit":null,"timestamp":"2025-01-15T10:07:45Z"}
{"run":4,"metric_value":null,"status":"crash","description":"invalid vitest config syntax","commit":null,"timestamp":"2025-01-15T10:09:12Z"}

Read the log programmatically:

import { readFileSync } from "fs";

const runs = readFileSync("autoresearch.jsonl", "utf-8")
  .trim()
  .split("\n")
  .map((line) => JSON.parse(line));

const kept = runs.filter((r) => r.status === "keep");
const best = kept.reduce((a, b) =>
  a.metric_value < b.metric_value ? a : b
);

console.log(`Best: ${best.metric_value} — ${best.description}`);

Resuming a Session

The agent can resume from either file. Recommended resume prompt:

Read autoresearch.jsonl and autoresearch.md, then continue the experiment loop.
Don't restart — pick up from run N and keep going.

Or use the skill:

/skill:autoresearch-create resume

Architecture

┌──────────────────────┐     ┌──────────────────────────┐
│  Extension (global)  │     │  Skill (per-domain)       │
│                      │     │                           │
│  run_experiment      │◄────│  command: pnpm test       │
│  log_experiment      │     │  metric: seconds (lower)  │
│  widget + dashboard  │     │  scope: vitest configs    │
│                      │     │  ideas: pool, parallel…   │
└──────────────────────┘     └──────────────────────────┘
         │
         ▼
  autoresearch.jsonl   ← append-only run log
  autoresearch.md      ← living session document

The extension is domain-agnostic infrastructure. The skill encodes domain knowledge. One extension serves unlimited domains.

Troubleshooting

Loop not starting after skill runs

Check that autoresearch.sh is executable: chmod +x autoresearch.sh
Verify the script outputs a METRIC name=number line on success
Run bash autoresearch.sh manually to debug

Widget not showing

Run /reload in pi to reload the extension
Confirm the extension is in ~/.pi/agent/extensions/pi-autoresearch/

run_experiment times out

Increase timeout_seconds in your run_experiment call
Default is 300s — long benchmarks (LLM training) may need 3600+

Checks script blocking everything

Check autoresearch.checks.sh exit codes manually: bash autoresearch.checks.sh
Increase checks_timeout_seconds if tests are slow
Remove the file temporarily to isolate whether the benchmark or checks are failing

Session lost after context reset

The agent needs only autoresearch.jsonl + autoresearch.md to resume
Both files are committed to the branch — they survive any context reset
Use the resume prompt above to continue

Metric value not captured

Ensure the benchmark script exits 0 on success
The METRIC line must be on stdout, not stderr
Format must be exactly METRIC name=number (no spaces around =)

License

MIT