Skill

AutoAgent Harness Engineering

Sets up AutoAgent for autonomous AI agent harness engineering: iteratively builds, benchmarks with Harbor, and improves agent.py overnight.

Python

Docker

Install

npx claudepluginhub joshuarweaver/cascade-ai-ml-agents-misc-1 --plugin aradotso-trending-skills-37

Tool Access

This skill uses the workspace's default tool permissions.

Preview

```markdown

SKILL.md

Similar Skills

cache-components

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

139.2k

mcp-builder

9 files

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

anthropics-skills-13

124.2k

canvas-design

20 files

Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.

anthropics-skills-13

124.2k

Stats

Stars36

Forks8

Last CommitApr 4, 2026

Actions

View Source View Plugin View on GitHub View README

Project Structure

autoagent/
├── agent.py          # Single-file harness under test (primary edit surface)
├── program.md        # Meta-agent instructions + directive (human edits this)
├── Dockerfile.base   # Base Docker image for task containers
├── tasks/            # Benchmark tasks in Harbor format
│   └── my-task/
│       ├── task.toml
│       ├── instruction.md
│       ├── tests/
│       │   ├── test.sh
│       │   └── test.py
│       ├── environment/
│       │   └── Dockerfile
│       └── files/
├── jobs/             # Harbor job outputs (auto-generated)
├── results.tsv       # Experiment log (created by meta-agent, gitignored)
├── run.log           # Latest run output
└── .agent/           # Optional workspace artifacts

Key Concepts

The Two Files You Actually Edit

program.md — Instructions for the meta-agent + the directive (what kind of agent to build). You edit this.
agent.py — The entire harness under test. Contains config, tool definitions, agent registry, routing/orchestration, and the Harbor adapter (fixed section). The meta-agent edits this.

The Loop

meta-agent reads program.md
    → inspects current agent.py
    → runs benchmark
    → diagnoses failures
    → modifies agent.py (prompt, tools, config, orchestration)
    → runs benchmark again
    → keeps change if score improves, discards if not
    → repeats

Running Benchmarks

Run a Single Task

rm -rf jobs
mkdir -p jobs
uv run harbor run \
  -p tasks/ \
  --task-name "<task-name>" \
  -l 1 \
  -n 1 \
  --agent-import-path agent:AutoAgent \
  -o jobs \
  --job-name latest > run.log 2>&1

Run All Tasks in Parallel

rm -rf jobs
mkdir -p jobs
uv run harbor run \
  -p tasks/ \
  -n 100 \
  --agent-import-path agent:AutoAgent \
  -o jobs \
  --job-name latest > run.log 2>&1

Flags:

Flag	Description
`-p tasks/`	Path to tasks directory
`--task-name`	Run a specific named task
`-l 1`	Limit to 1 task
`-n <N>`	Concurrency (default 4, use 100 for max parallelism)
`--agent-import-path`	Python import path to your agent class
`-o jobs`	Output directory for job results
`--job-name`	Label for this run

Launching the Meta-Agent

Point any coding agent (Claude Code, Cursor, Codex, etc.) at the repo and say:

Read program.md and let's kick off a new experiment!

Task Format (Harbor)

Tasks live in tasks/ following the Harbor task format.

Minimal Task Structure

tasks/my-task/
├── task.toml           # Config (timeouts, metadata)
├── instruction.md      # Prompt sent to the agent
├── tests/
│   ├── test.sh         # Entry point — writes /logs/reward.txt
│   └── test.py         # Verification logic
├── environment/
│   └── Dockerfile      # FROM autoagent-base
└── files/              # Reference files mounted into container

`task.toml` Example

[task]
name = "my-task"
description = "A sample benchmark task"
timeout = 300

[task.metadata]
category = "reasoning"
difficulty = "medium"

`instruction.md` Example

You are given a dataset of customer reviews. Your task is to:

1. Classify each review as positive, negative, or neutral
2. Extract the main topic of each review
3. Output a JSON file at /output/results.json

The reviews are located at /files/reviews.csv

`tests/test.sh` Example

#!/bin/bash
set -e

# Run verification and write score to /logs/reward.txt
python /tests/test.py

`tests/test.py` Example

import json
import os

EXPECTED_COUNT = 50
reward_path = "/logs/reward.txt"
output_path = "/output/results.json"

try:
    with open(output_path) as f:
        results = json.load(f)

    # Score based on completeness and format
    score = 0.0

    if isinstance(results, list):
        score += 0.3  # correct format

    correct = sum(1 for r in results if "label" in r and "topic" in r)
    score += 0.7 * (correct / EXPECTED_COUNT)

    score = min(1.0, max(0.0, score))

except Exception as e:
    score = 0.0

os.makedirs("/logs", exist_ok=True)
with open(reward_path, "w") as f:
    f.write(str(score))

`environment/Dockerfile` Example

FROM autoagent-base

# Add any task-specific dependencies
RUN pip install pandas scikit-learn

# Copy reference files
COPY files/ /files/

`agent.py` Structure

agent.py is the single-file harness. It has two sections:

Editable Harness Section (meta-agent modifies this)

# ── CONFIG ──────────────────────────────────────────────────────────────────
MODEL = "gpt-4o"
TEMPERATURE = 0.0
MAX_TOKENS = 4096
SYSTEM_PROMPT = """You are a helpful AI assistant..."""

# ── TOOL DEFINITIONS ─────────────────────────────────────────────────────────
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read a file from the filesystem",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "File path to read"}
                },
                "required": ["path"]
            }
        }
    },
    # ... more tools
]

# ── AGENT REGISTRY ───────────────────────────────────────────────────────────
AGENTS = {
    "default": {
        "model": MODEL,
        "system_prompt": SYSTEM_PROMPT,
        "tools": TOOLS,
        "temperature": TEMPERATURE,
    }
}

# ── ROUTING / ORCHESTRATION ──────────────────────────────────────────────────
def route(task: dict) -> str:
    """Return agent name based on task properties."""
    return "default"

Fixed Adapter Section (do NOT modify)

# ── HARBOR ADAPTER (FIXED — DO NOT EDIT) ─────────────────────────────────────
class AutoAgent:
    """Harbor-compatible agent entry point."""
    # ... Harbor integration + trajectory serialization

`program.md` Template

# Meta-Agent Program

## Context
This repo implements an autonomous harness engineering loop. The meta-agent
reads this file, inspects agent.py, runs benchmarks, modifies the harness,
and iterates to maximize benchmark score.

## Directive
Build an agent that can [DESCRIBE YOUR TASK HERE].

## Constraints
- Focus on [specific capabilities]
- The agent should prioritize [accuracy/speed/cost]
- Target benchmark: [task name]

## Current Status
- Baseline score: [X.X]
- Last experiment: [description]
- Next hypothesis: [what to try]

## Experiment Log
| Run | Change | Score | Delta |
|-----|--------|-------|-------|
| 1   | baseline | 0.42 | - |

Common Patterns

Pattern 1: Multi-Agent Routing

# In agent.py editable section
AGENTS = {
    "classifier": {
        "model": "gpt-4o-mini",
        "system_prompt": "You classify tasks by type. Output JSON.",
        "tools": [],
        "temperature": 0.0,
    },
    "executor": {
        "model": "gpt-4o",
        "system_prompt": "You execute complex tasks with tools.",
        "tools": TOOLS,
        "temperature": 0.1,
    },
    "verifier": {
        "model": "gpt-4o",
        "system_prompt": "You verify outputs for correctness.",
        "tools": [],
        "temperature": 0.0,
    }
}

def route(task: dict) -> str:
    instruction = task.get("instruction", "").lower()
    if "verify" in instruction or "check" in instruction:
        return "verifier"
    elif len(instruction) < 100:
        return "classifier"
    return "executor"

Pattern 2: Tool-Heavy Harness

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "bash",
            "description": "Run a bash command and return output",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {"type": "string"}
                },
                "required": ["command"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "write_file",
            "description": "Write content to a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string"},
                    "content": {"type": "string"}
                },
                "required": ["path", "content"]
            }
        }
    },
]

Pattern 3: Chain-of-Thought System Prompt

SYSTEM_PROMPT = """You are an expert AI assistant.

Before answering, always:
1. Restate the task in your own words
2. Break it into subtasks
3. Execute each subtask step by step
4. Verify your output

Use tools liberally. If a tool fails, retry with different parameters.
Always write final outputs to the specified output path.
"""

Cleanup

Docker images and containers accumulate across runs:

# Clean Harbor's cached task images and task cache
uv run harbor cache clean -f

# Full Docker cleanup (all unused images, build cache)
docker system prune -a -f

# Lighter: just dead containers
docker container prune -f

# If Docker becomes unresponsive after many concurrent runs
killall Docker && open -a Docker

Troubleshooting

Score is always 0.0

Check /logs/reward.txt exists and contains a float between 0.0 and 1.0
Inspect jobs/latest/ for task-specific logs
Run cat run.log to see Harbor output and errors
Verify test.sh is executable: chmod +x tasks/my-task/tests/test.sh

Docker build fails

# Ensure base image exists
docker images | grep autoagent-base

# Rebuild if missing
docker build -f Dockerfile.base -t autoagent-base .

Agent import error

# Verify agent.py exports AutoAgent
python -c "from agent import AutoAgent; print('OK')"

# Check import path flag matches
--agent-import-path agent:AutoAgent

Out of memory / slow runs

# Reduce concurrency
uv run harbor run -p tasks/ -n 4 --agent-import-path agent:AutoAgent -o jobs --job-name latest

# Clean up between runs
rm -rf jobs && mkdir -p jobs

Meta-agent not improving scores

Update program.md with clearer directives and constraints
Check results.tsv for experiment history
Add .agent/ workspace files with domain-specific context
Ensure tasks/ contains representative, well-scored tasks

Environment Variables Reference

# OpenAI
OPENAI_API_KEY=...

# Anthropic (if using Claude)
ANTHROPIC_API_KEY=...

# Google (if using Gemini)
GOOGLE_API_KEY=...

# Azure OpenAI (if applicable)
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=...

Set these in .env at the project root — uv run loads them automatically via python-dotenv or your shell.

Quick Reference

Action	Command
Install deps	`uv sync`
Build base image	`docker build -f Dockerfile.base -t autoagent-base .`
Run single task	`uv run harbor run -p tasks/ --task-name <name> -n 1 --agent-import-path agent:AutoAgent -o jobs --job-name latest`
Run all tasks	`uv run harbor run -p tasks/ -n 100 --agent-import-path agent:AutoAgent -o jobs --job-name latest`
View run logs	`cat run.log`
View job outputs	`ls jobs/latest/`
Clean Harbor cache	`uv run harbor cache clean -f`
Clean Docker	`docker system prune -a -f`
Start meta-agent	Say: "Read program.md and let's kick off a new experiment!"

AutoAgent Harness Engineering

Install

Tool Access

Preview

SKILL.md

Similar Skills

AutoAgent Harness Engineering

Install

Tool Access

Preview

SKILL.md

Project Structure

Key Concepts

The Two Files You Actually Edit

The Loop

Running Benchmarks

Run a Single Task

Run All Tasks in Parallel

Launching the Meta-Agent

Task Format (Harbor)

Minimal Task Structure

task.toml Example

instruction.md Example

tests/test.sh Example

tests/test.py Example

environment/Dockerfile Example

agent.py Structure

Editable Harness Section (meta-agent modifies this)

Fixed Adapter Section (do NOT modify)

program.md Template

Common Patterns

Pattern 1: Multi-Agent Routing

Pattern 2: Tool-Heavy Harness

Pattern 3: Chain-of-Thought System Prompt

Cleanup

Troubleshooting

Score is always 0.0

Docker build fails

Agent import error

Out of memory / slow runs

Meta-agent not improving scores

Environment Variables Reference

Quick Reference

Similar Skills

Project Structure

Key Concepts

The Two Files You Actually Edit

The Loop

Running Benchmarks

Run a Single Task

Run All Tasks in Parallel

Launching the Meta-Agent

Task Format (Harbor)

Minimal Task Structure

task.toml Example

instruction.md Example

tests/test.sh Example

tests/test.py Example

environment/Dockerfile Example

agent.py Structure

Editable Harness Section (meta-agent modifies this)

Fixed Adapter Section (do NOT modify)

program.md Template

Common Patterns

Pattern 1: Multi-Agent Routing

Pattern 2: Tool-Heavy Harness

Pattern 3: Chain-of-Thought System Prompt

Cleanup

Troubleshooting

Score is always 0.0

Docker build fails

Agent import error

Out of memory / slow runs

Meta-agent not improving scores

Environment Variables Reference

Quick Reference

`task.toml` Example

`instruction.md` Example

`tests/test.sh` Example

`tests/test.py` Example

`environment/Dockerfile` Example

`agent.py` Structure

`program.md` Template

`task.toml` Example

`instruction.md` Example

`tests/test.sh` Example

`tests/test.py` Example

`environment/Dockerfile` Example

`agent.py` Structure

`program.md` Template