Skill

acp-harness

Install
1
Install the plugin
$
npx claudepluginhub plaited/acp-harness --plugin acp-harness

Want just this skill?

Add to a custom plugin, then install with one command.

Description

Unified ACP client and evaluation harness. Connect to ACP-compatible agents programmatically, capture full trajectories (tools, thoughts, plans), and pipe to downstream analysis tools.

Tool Access

This skill uses the workspace's default tool permissions.

Skill Content

ACP Harness

Purpose

This skill provides a unified toolkit for ACP client usage and agent evaluation, optimized for TypeScript/JavaScript projects using Bun.

  1. ACP Client API - Headless programmatic access to ACP-compatible agents
  2. Evaluation Harness - Run prompts against agents and capture full trajectories

Use this when:

  • Comparing skills across different agents (Claude Code, Cursor, OpenCode, Amp, Goose, Factory)
  • Evaluating built-in tools vs MCP servers vs skills for the same task
  • Generating training data with full trajectory capture
  • Running regression tests in CI/CD pipelines
  • Building multi-agent applications on a headless transport layer

Foundation Use Cases

The harness is a foundation layer - it captures trajectories; scoring happens downstream.

flowchart LR
    Harness["ACP Harness"] -->|"trajectories"| Scoring["Braintrust / Custom Script"]
    Scoring -->|"scores"| Decision["Informed Choices"]
Use CaseHarness ProvidesYou Build
Cross-agent skill evalSame prompts → multiple agents → trajectoriesScoring pipeline (Braintrust, custom)
Tool comparisonTrajectory with tool/skill attributionDiff analysis, preference data
Training dataStructured I/O with tool calls, plans, thoughtsSFT/DPO formatting
Regression testingDeterministic prompt → trajectory captureCI integration, golden file comparison
Multi-agent appscreateACPClient transport layerSession management, UI, agent switching

Agents Supporting Skills

Skills can be installed across multiple agents, enabling cross-agent comparison:

AgentSkills DirectoryInstall Command
Claude Code.claude/skills/./install.sh --agent claude
Cursor.claude/skills/./install.sh --agent cursor
OpenCode.opencode/skill/./install.sh --agent opencode
Amp.agents/skills/./install.sh --agent amp
Goose.claude/skills/./install.sh --agent goose
Factory.factory/skills/./install.sh --agent factory

Example: Comparing Built-in vs Skill

# Run same prompt with built-in tool
bun scripts/run-harness.ts prompts.jsonl \
  --agent claude-code-acp \
  -o results-builtin.jsonl

# Run same prompt with custom skill installed
bun scripts/run-harness.ts prompts.jsonl \
  --agent claude-code-acp \
  --cwd /project/with/typescript-lsp-skill \
  -o results-skill.jsonl

# Compare trajectories - which used better tools? faster? more accurate?
diff <(jq '.toolCalls' results-builtin.jsonl) <(jq '.toolCalls' results-skill.jsonl)

Execution Environment

Recommendation: Run evaluations in Docker containers for consistent, isolated execution.

# Build and run with Docker Compose
docker compose -f docker-compose.acp.yml run --rm acp-harness

# Or build directly
docker build -f Dockerfile.acp -t acp-harness .
docker run --rm -e ANTHROPIC_API_KEY acp-harness

Docker provides:

  • Consistent environment across local and CI
  • Filesystem isolation without app-level sandboxing
  • Reproducible results for training data generation

See assets/ for example container configurations:

  • Dockerfile.acp - Base container with Bun and git
  • docker-compose.acp.yml - Compose file with volume mounts for results

Non-Goals

This harness is optimized for TypeScript/JavaScript projects using Bun. It is not designed for:

Quick Reference

ResourceDescription
scripts/run-harness.tsExecute prompts against agent, capture trajectories
client-api.mdcreateACPClient configuration, helpers
output-formats.mdJSONL schemas, format options
downstream.mdIntegration patterns (Braintrust, jq, LLM-as-judge)
llm-judge-templates.mdEvaluation prompt templates

Evaluation Workflow

flowchart LR
    Prompts["prompts.jsonl"] --> Harness["run-harness.ts"]
    Agent["ACP Agent"] --> Harness
    Harness --> Summary["summary.jsonl"]
    Harness --> Judge["results.md + results.full.jsonl"]
    Summary --> JQ["jq analysis"]
    Judge --> LLM["LLM-as-judge"]
  1. Prepare - Create prompts.jsonl with evaluation cases
  2. Execute - Run harness against target agent
  3. Capture - Trajectories streamed to output files
  4. Analyze - Pipe to downstream tools for scoring

Harness Script

Basic Usage

bun scripts/run-harness.ts <prompts.jsonl> --agent <command> [options]

Arguments

FlagDescriptionDefault
prompts.jsonlInput file with evaluation promptsRequired
-a, --agentACP agent command"claude-code-acp"
-o, --outputOutput file/pathstdout
-c, --cwdWorking directory for agentcurrent
-t, --timeoutRequest timeout in ms60000
-f, --formatOutput format: summary, judgesummary
--progressShow progress to stderrfalse
--appendAppend to output filefalse
--mcp-serverMCP server config JSON (repeatable)none

Examples

# Summary format (default) - minimal JSONL
bun scripts/run-harness.ts prompts.jsonl -o results.jsonl

# Judge format - creates two files for two-tier evaluation
bun scripts/run-harness.ts prompts.jsonl --format judge -o results
# Creates: results.md (summary with step IDs) + results.full.jsonl (complete trajectory)

# With MCP server (stdio transport)
bun scripts/run-harness.ts prompts.jsonl \
  --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem","/data"]}'

# With MCP server (HTTP transport)
bun scripts/run-harness.ts prompts.jsonl \
  --mcp-server '{"type":"http","name":"api","url":"http://localhost:3000"}'

# Different agent (Droid ACP adapter)
bun scripts/run-harness.ts prompts.jsonl --agent droid-acp -o results.jsonl

# Stream with progress
bun scripts/run-harness.ts prompts.jsonl --progress -o results.jsonl

Input Format

Each line in prompts.jsonl:

{"id":"test-001","input":"Create a primary button","expected":"should contain <button>","metadata":{"category":"ui"}}
{"id":"test-002","input":"Write a function for form validation","metadata":{"category":"logic"}}
FieldRequiredDescription
idYesUnique identifier
inputYesPrompt text for the agent
expectedNoExpected output (for downstream scoring)
metadataNoTags, category, difficulty for filtering
timeoutNoOverride default timeout for this prompt

Output Formats

Summary Format (default)

Minimal JSONL for quick metrics and analysis:

{"id":"test-001","input":"Create a button","output":"I created...","toolCalls":["Write"],"status":"passed","duration":1234}

Judge Format (two-tier)

Creates two files for LLM-as-judge evaluation:

<output>.md - Markdown summary with step IDs and code previews:

## Evaluation Record: test-001

**Input:** Create a primary button

**Trajectory:**
1. [THOUGHT] I'll create a styled button... [->test-001-step-1]
2. [TOOL:Write] -> completed (234ms) [->test-001-step-2]
   File: src/button.tsx (847 chars)
   ```tsx
   import { createStyles } from '@plaited/acp'

   type ButtonProps = {
     label: string

   // ... 30 lines omitted ...

   export const Button = ({ label }: ButtonProps) => (
     <button {...styles.btn}>{label}</button>
   )
  1. [MESSAGE] I created the button... [->test-001-step-3]

Output: I created the button with primary styling. Metadata: category=ui, agent=claude-code-acp Status: passed Duration: 1234ms



**`<output>.full.jsonl`** - Complete trajectory with step IDs for correlation:

```jsonl
{"id":"test-001","input":"...","output":"...","trajectory":[{"type":"thought","content":"...","timestamp":100,"stepId":"test-001-step-1"},{"type":"tool_call","name":"Write","status":"completed","input":{...},"output":{...},"duration":234,"stepId":"test-001-step-2"}],...}

Usage patterns by judge context window:

Judge ModelStrategy
Gemini (1M+ tokens)Feed results.full.jsonl directly
Claude/GPT-4 (128-200k)Use results.full.jsonl for most runs
Smaller modelsUse results.md, retrieve specific steps by ID as needed

Programmatic Usage

import { createACPClient, createPrompt, summarizeResponse } from '@plaited/acp'

// Requires: npm install -g @zed-industries/claude-code-acp
const client = createACPClient({
  command: ['claude-code-acp'],
  cwd: '/path/to/project',
})

await client.connect()
const session = await client.createSession()

const { updates } = await client.promptSync(
  session.id,
  createPrompt('Create a button with hover state')
)

// Full trajectory is in updates
const summary = summarizeResponse(updates)
console.log({
  text: summary.text,
  toolCalls: summary.completedToolCalls,
  hasErrors: summary.hasErrors
})

await client.disconnect()

See client-api.md for complete API documentation.

Downstream Integration

The harness outputs standard JSONL that pipes to any tool:

# Filter with jq
cat results.jsonl | jq 'select(.metadata.category == "ui")'

# Count tool usage
cat results.jsonl | jq -s 'map(.toolCalls | length) | add'

# Feed full trajectory to Gemini (large context)
cat results.full.jsonl | your-gemini-judge.ts

See downstream.md for integration patterns with Braintrust, Gemini, and custom scorers.

Evaluation Targets

TargetHow to Evaluate
Agent capabilityDirect prompts, analyze trajectory quality
SkillsSet --cwd to project with skill, test skill-specific prompts
MCP ServersUse --mcp-server flag, verify tool usage in trajectory

Skill Evaluation

bun scripts/run-harness.ts skill-prompts.jsonl \
  --cwd /project/with/skill \
  -o results.jsonl

MCP Server Evaluation

bun scripts/run-harness.ts mcp-prompts.jsonl \
  --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem"]}' \
  -o results.jsonl

Related

  • @plaited/acp - Core ACP client module
Stats
Stars0
Forks0
Last CommitJan 16, 2026
Actions

Similar Skills

cache-components

Expert guidance for Next.js Cache Components and Partial Prerendering (PPR). **PROACTIVE ACTIVATION**: Use this skill automatically when working in Next.js projects that have `cacheComponents: true` in their next.config.ts/next.config.js. When this config is detected, proactively apply Cache Components patterns and best practices to all React Server Component implementations. **DETECTION**: At the start of a session in a Next.js project, check for `cacheComponents: true` in next.config. If enabled, this skill's patterns should guide all component authoring, data fetching, and caching decisions. **USE CASES**: Implementing 'use cache' directive, configuring cache lifetimes with cacheLife(), tagging cached data with cacheTag(), invalidating caches with updateTag()/revalidateTag(), optimizing static vs dynamic content boundaries, debugging cache issues, and reviewing Cache Component implementations.

138.4k