ACP Harness

Purpose

This skill provides a unified toolkit for ACP client usage and agent evaluation, optimized for TypeScript/JavaScript projects using Bun.

ACP Client API - Headless programmatic access to ACP-compatible agents
Evaluation Harness - Run prompts against agents and capture full trajectories

Use this when:

Comparing skills across different agents (Claude Code, Cursor, OpenCode, Amp, Goose, Factory)
Evaluating built-in tools vs MCP servers vs skills for the same task
Generating training data with full trajectory capture
Running regression tests in CI/CD pipelines
Building multi-agent applications on a headless transport layer

Foundation Use Cases

The harness is a foundation layer - it captures trajectories; scoring happens downstream.

flowchart LR
    Harness["ACP Harness"] -->|"trajectories"| Scoring["Braintrust / Custom Script"]
    Scoring -->|"scores"| Decision["Informed Choices"]

Use Case	Harness Provides	You Build
Cross-agent skill eval	Same prompts → multiple agents → trajectories	Scoring pipeline (Braintrust, custom)
Tool comparison	Trajectory with tool/skill attribution	Diff analysis, preference data
Training data	Structured I/O with tool calls, plans, thoughts	SFT/DPO formatting for world-agent
Regression testing	Deterministic prompt → trajectory capture	CI integration, golden file comparison
Multi-agent apps	`createACPClient` transport layer	Session management, UI, agent switching

Agents Supporting Skills

Skills can be installed across multiple agents, enabling cross-agent comparison:

Agent	Skills Directory	Install Command
Claude Code	`.claude/skills/`	`./install-workshop.sh --agent claude`
Cursor	`.claude/skills/`	`./install-workshop.sh --agent cursor`
OpenCode	`.opencode/skill/`	`./install-workshop.sh --agent opencode`
Amp	`.agents/skills/`	`./install-workshop.sh --agent amp`
Goose	`.claude/skills/`	`./install-workshop.sh --agent goose`
Factory	`.factory/skills/`	`./install-workshop.sh --agent factory`

Example: Comparing Built-in vs Skill

# Run same prompt with built-in tool
bun scripts/run-harness.ts prompts.jsonl \
  --agent claude-code-acp \
  -o results-builtin.jsonl

# Run same prompt with custom skill installed
bun scripts/run-harness.ts prompts.jsonl \
  --agent claude-code-acp \
  --cwd /project/with/typescript-lsp-skill \
  -o results-skill.jsonl

# Compare trajectories - which used better tools? faster? more accurate?
diff <(jq '.toolCalls' results-builtin.jsonl) <(jq '.toolCalls' results-skill.jsonl)

Execution Environment

Recommendation: Run evaluations in Docker containers for consistent, isolated execution.

# Build and run with Docker Compose
docker compose -f docker-compose.acp.yml run --rm acp-harness

# Or build directly
docker build -f Dockerfile.acp -t acp-harness .
docker run --rm -e ANTHROPIC_API_KEY acp-harness

Docker provides:

Consistent environment across local and CI
Filesystem isolation without app-level sandboxing
Reproducible results for training data generation

See assets/ for example container configurations:

Dockerfile.acp - Base container with Bun and git
docker-compose.acp.yml - Compose file with volume mounts for results

Non-Goals

This harness is optimized for TypeScript/JavaScript projects using Bun. It is not designed for:

Python projects - Use SWE-bench, Braintrust Python SDK
Academic model benchmarking - Use EleutherAI lm-evaluation-harness
IDE integrations - Use Copilot Evaluation Harness
SaaS observability - Use Braintrust, Langfuse platforms directly

Quick Reference

Resource	Description
`scripts/run-harness.ts`	Execute prompts against agent, capture trajectories
client-api.md	`createACPClient` configuration, helpers
output-formats.md	JSONL schemas, format options
downstream.md	Integration patterns (Braintrust, jq, LLM-as-judge)
llm-judge-templates.md	Evaluation prompt templates

Evaluation Workflow

flowchart LR
    Prompts["prompts.jsonl"] --> Harness["run-harness.ts"]
    Agent["ACP Agent"] --> Harness
    Harness --> Summary["summary.jsonl"]
    Harness --> Judge["results.md + results.full.jsonl"]
    Summary --> JQ["jq analysis"]
    Judge --> LLM["LLM-as-judge"]

Prepare - Create prompts.jsonl with evaluation cases
Execute - Run harness against target agent
Capture - Trajectories streamed to output files
Analyze - Pipe to downstream tools for scoring

Harness Script

Basic Usage

bun scripts/run-harness.ts <prompts.jsonl> --agent <command> [options]

Arguments

Flag	Description	Default
`prompts.jsonl`	Input file with evaluation prompts	Required
`-a, --agent`	ACP agent command	`"claude-code-acp"`
`-o, --output`	Output file/path	stdout
`-c, --cwd`	Working directory for agent	current
`-t, --timeout`	Request timeout in ms	`60000`
`-f, --format`	Output format: `summary`, `judge`	`summary`
`--progress`	Show progress to stderr	false
`--append`	Append to output file	false
`--mcp-server`	MCP server config JSON (repeatable)	none

Examples

# Summary format (default) - minimal JSONL
bun scripts/run-harness.ts prompts.jsonl -o results.jsonl

# Judge format - creates two files for two-tier evaluation
bun scripts/run-harness.ts prompts.jsonl --format judge -o results
# Creates: results.md (summary with step IDs) + results.full.jsonl (complete trajectory)

# With MCP server (stdio transport)
bun scripts/run-harness.ts prompts.jsonl \
  --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem","/data"]}'

# With MCP server (HTTP transport)
bun scripts/run-harness.ts prompts.jsonl \
  --mcp-server '{"type":"http","name":"api","url":"http://localhost:3000"}'

# Different agent (Droid ACP adapter)
bun scripts/run-harness.ts prompts.jsonl --agent droid-acp -o results.jsonl

# Stream with progress
bun scripts/run-harness.ts prompts.jsonl --progress -o results.jsonl

Input Format

Each line in prompts.jsonl:

{"id":"test-001","input":"Create a primary button","expected":"should contain <button>","metadata":{"category":"ui"}}
{"id":"test-002","input":"Write a bThread for form validation","metadata":{"category":"behavioral"}}

Field	Required	Description
`id`	Yes	Unique identifier
`input`	Yes	Prompt text for the agent
`expected`	No	Expected output (for downstream scoring)
`metadata`	No	Tags, category, difficulty for filtering
`timeout`	No	Override default timeout for this prompt

Output Formats

Summary Format (default)

Minimal JSONL for quick metrics and analysis:

{"id":"test-001","input":"Create a button","output":"I created...","toolCalls":["Write"],"status":"passed","duration":1234}

Judge Format (two-tier)

Creates two files for LLM-as-judge evaluation:

<output>.md - Markdown summary with step IDs and code previews:

## Evaluation Record: test-001

**Input:** Create a primary button

**Trajectory:**
1. [THOUGHT] I'll create a styled button template... [->test-001-step-1]
2. [TOOL:Write] -> completed (234ms) [->test-001-step-2]
   File: src/button.tsx (847 chars)
   ```tsx
   import { createStyles } from 'plaited'

   type ButtonProps = {
     label: string

   // ... 30 lines omitted ...

   export const Button = ({ label }: ButtonProps) => (
     <button {...styles.btn}>{label}</button>
   )

[MESSAGE] I created the button... [->test-001-step-3]

Output: I created the button template with primary styling. Metadata: category=ui, agent=claude-code-acp Status: passed Duration: 1234ms


**`<output>.full.jsonl`** - Complete trajectory with step IDs for correlation:

```jsonl
{"id":"test-001","input":"...","output":"...","trajectory":[{"type":"thought","content":"...","timestamp":100,"stepId":"test-001-step-1"},{"type":"tool_call","name":"Write","status":"completed","input":{...},"output":{...},"duration":234,"stepId":"test-001-step-2"}],...}

Usage patterns by judge context window:

Judge Model	Strategy
Gemini (1M+ tokens)	Feed `results.full.jsonl` directly
Claude/GPT-4 (128-200k)	Use `results.full.jsonl` for most runs
Smaller models	Use `results.md`, retrieve specific steps by ID as needed

Programmatic Usage

import { createACPClient, createPrompt, summarizeResponse } from 'plaited/acp'

// Requires: npm install -g @zed-industries/claude-code-acp
const client = createACPClient({
  command: ['claude-code-acp'],
  cwd: '/path/to/project',
})

await client.connect()
const session = await client.createSession()

const { updates } = await client.promptSync(
  session.id,
  createPrompt('Create a button with hover state')
)

// Full trajectory is in updates
const summary = summarizeResponse(updates)
console.log({
  text: summary.text,
  toolCalls: summary.completedToolCalls,
  hasErrors: summary.hasErrors
})

await client.disconnect()

See client-api.md for complete API documentation.

Downstream Integration

The harness outputs standard JSONL that pipes to any tool:

# Filter with jq
cat results.jsonl | jq 'select(.metadata.category == "ui")'

# Count tool usage
cat results.jsonl | jq -s 'map(.toolCalls | length) | add'

# Feed full trajectory to Gemini (large context)
cat results.full.jsonl | your-gemini-judge.ts

See downstream.md for integration patterns with Braintrust, Gemini, and custom scorers.

Evaluation Targets

Target	How to Evaluate
Agent capability	Direct prompts, analyze trajectory quality
Skills	Set `--cwd` to project with skill, test skill-specific prompts
MCP Servers	Use `--mcp-server` flag, verify tool usage in trajectory
Behavioral programs	Analyze trajectory for bThread coordination patterns

Skill Evaluation

bun scripts/run-harness.ts skill-prompts.jsonl \
  --cwd /project/with/skill \
  -o results.jsonl

MCP Server Evaluation

bun scripts/run-harness.ts mcp-prompts.jsonl \
  --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem"]}' \
  -o results.jsonl

plaited/acp - Core ACP client module
world-agent - Training workflow, trajectory generation
plaited-behavioral-core - bThread patterns for agent coordination

acp-harness