Unified ACP client and evaluation harness. Connect to ACP-compatible agents programmatically, capture full trajectories (tools, thoughts, plans), and pipe to downstream analysis tools.
/plugin marketplace add plaited/acp-harness/plugin install acp-harness@plaited-marketplaceThis skill inherits all available tools. When active, it can use any tool Claude has access to.
This skill provides a unified toolkit for ACP client usage and agent evaluation, optimized for TypeScript/JavaScript projects using Bun.
Use this when:
The harness is a foundation layer - it captures trajectories; scoring happens downstream.
flowchart LR
Harness["ACP Harness"] -->|"trajectories"| Scoring["Braintrust / Custom Script"]
Scoring -->|"scores"| Decision["Informed Choices"]
| Use Case | Harness Provides | You Build |
|---|---|---|
| Cross-agent skill eval | Same prompts → multiple agents → trajectories | Scoring pipeline (Braintrust, custom) |
| Tool comparison | Trajectory with tool/skill attribution | Diff analysis, preference data |
| Training data | Structured I/O with tool calls, plans, thoughts | SFT/DPO formatting |
| Regression testing | Deterministic prompt → trajectory capture | CI integration, golden file comparison |
| Multi-agent apps | createACPClient transport layer | Session management, UI, agent switching |
Skills can be installed across multiple agents, enabling cross-agent comparison:
| Agent | Skills Directory | Install Command |
|---|---|---|
| Claude Code | .claude/skills/ | ./install.sh --agent claude |
| Cursor | .claude/skills/ | ./install.sh --agent cursor |
| OpenCode | .opencode/skill/ | ./install.sh --agent opencode |
| Amp | .agents/skills/ | ./install.sh --agent amp |
| Goose | .claude/skills/ | ./install.sh --agent goose |
| Factory | .factory/skills/ | ./install.sh --agent factory |
# Run same prompt with built-in tool
bun scripts/run-harness.ts prompts.jsonl \
--agent claude-code-acp \
-o results-builtin.jsonl
# Run same prompt with custom skill installed
bun scripts/run-harness.ts prompts.jsonl \
--agent claude-code-acp \
--cwd /project/with/typescript-lsp-skill \
-o results-skill.jsonl
# Compare trajectories - which used better tools? faster? more accurate?
diff <(jq '.toolCalls' results-builtin.jsonl) <(jq '.toolCalls' results-skill.jsonl)
Recommendation: Run evaluations in Docker containers for consistent, isolated execution.
# Build and run with Docker Compose
docker compose -f docker-compose.acp.yml run --rm acp-harness
# Or build directly
docker build -f Dockerfile.acp -t acp-harness .
docker run --rm -e ANTHROPIC_API_KEY acp-harness
Docker provides:
See assets/ for example container configurations:
Dockerfile.acp - Base container with Bun and gitdocker-compose.acp.yml - Compose file with volume mounts for resultsThis harness is optimized for TypeScript/JavaScript projects using Bun. It is not designed for:
| Resource | Description |
|---|---|
scripts/run-harness.ts | Execute prompts against agent, capture trajectories |
| client-api.md | createACPClient configuration, helpers |
| output-formats.md | JSONL schemas, format options |
| downstream.md | Integration patterns (Braintrust, jq, LLM-as-judge) |
| llm-judge-templates.md | Evaluation prompt templates |
flowchart LR
Prompts["prompts.jsonl"] --> Harness["run-harness.ts"]
Agent["ACP Agent"] --> Harness
Harness --> Summary["summary.jsonl"]
Harness --> Judge["results.md + results.full.jsonl"]
Summary --> JQ["jq analysis"]
Judge --> LLM["LLM-as-judge"]
prompts.jsonl with evaluation casesbun scripts/run-harness.ts <prompts.jsonl> --agent <command> [options]
| Flag | Description | Default |
|---|---|---|
prompts.jsonl | Input file with evaluation prompts | Required |
-a, --agent | ACP agent command | "claude-code-acp" |
-o, --output | Output file/path | stdout |
-c, --cwd | Working directory for agent | current |
-t, --timeout | Request timeout in ms | 60000 |
-f, --format | Output format: summary, judge | summary |
--progress | Show progress to stderr | false |
--append | Append to output file | false |
--mcp-server | MCP server config JSON (repeatable) | none |
# Summary format (default) - minimal JSONL
bun scripts/run-harness.ts prompts.jsonl -o results.jsonl
# Judge format - creates two files for two-tier evaluation
bun scripts/run-harness.ts prompts.jsonl --format judge -o results
# Creates: results.md (summary with step IDs) + results.full.jsonl (complete trajectory)
# With MCP server (stdio transport)
bun scripts/run-harness.ts prompts.jsonl \
--mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem","/data"]}'
# With MCP server (HTTP transport)
bun scripts/run-harness.ts prompts.jsonl \
--mcp-server '{"type":"http","name":"api","url":"http://localhost:3000"}'
# Different agent (Droid ACP adapter)
bun scripts/run-harness.ts prompts.jsonl --agent droid-acp -o results.jsonl
# Stream with progress
bun scripts/run-harness.ts prompts.jsonl --progress -o results.jsonl
Each line in prompts.jsonl:
{"id":"test-001","input":"Create a primary button","expected":"should contain <button>","metadata":{"category":"ui"}}
{"id":"test-002","input":"Write a function for form validation","metadata":{"category":"logic"}}
| Field | Required | Description |
|---|---|---|
id | Yes | Unique identifier |
input | Yes | Prompt text for the agent |
expected | No | Expected output (for downstream scoring) |
metadata | No | Tags, category, difficulty for filtering |
timeout | No | Override default timeout for this prompt |
Minimal JSONL for quick metrics and analysis:
{"id":"test-001","input":"Create a button","output":"I created...","toolCalls":["Write"],"status":"passed","duration":1234}
Creates two files for LLM-as-judge evaluation:
<output>.md - Markdown summary with step IDs and code previews:
## Evaluation Record: test-001
**Input:** Create a primary button
**Trajectory:**
1. [THOUGHT] I'll create a styled button... [->test-001-step-1]
2. [TOOL:Write] -> completed (234ms) [->test-001-step-2]
File: src/button.tsx (847 chars)
```tsx
import { createStyles } from '@plaited/acp'
type ButtonProps = {
label: string
// ... 30 lines omitted ...
export const Button = ({ label }: ButtonProps) => (
<button {...styles.btn}>{label}</button>
)
Output: I created the button with primary styling. Metadata: category=ui, agent=claude-code-acp Status: passed Duration: 1234ms
**`<output>.full.jsonl`** - Complete trajectory with step IDs for correlation:
```jsonl
{"id":"test-001","input":"...","output":"...","trajectory":[{"type":"thought","content":"...","timestamp":100,"stepId":"test-001-step-1"},{"type":"tool_call","name":"Write","status":"completed","input":{...},"output":{...},"duration":234,"stepId":"test-001-step-2"}],...}
Usage patterns by judge context window:
| Judge Model | Strategy |
|---|---|
| Gemini (1M+ tokens) | Feed results.full.jsonl directly |
| Claude/GPT-4 (128-200k) | Use results.full.jsonl for most runs |
| Smaller models | Use results.md, retrieve specific steps by ID as needed |
import { createACPClient, createPrompt, summarizeResponse } from '@plaited/acp'
// Requires: npm install -g @zed-industries/claude-code-acp
const client = createACPClient({
command: ['claude-code-acp'],
cwd: '/path/to/project',
})
await client.connect()
const session = await client.createSession()
const { updates } = await client.promptSync(
session.id,
createPrompt('Create a button with hover state')
)
// Full trajectory is in updates
const summary = summarizeResponse(updates)
console.log({
text: summary.text,
toolCalls: summary.completedToolCalls,
hasErrors: summary.hasErrors
})
await client.disconnect()
See client-api.md for complete API documentation.
The harness outputs standard JSONL that pipes to any tool:
# Filter with jq
cat results.jsonl | jq 'select(.metadata.category == "ui")'
# Count tool usage
cat results.jsonl | jq -s 'map(.toolCalls | length) | add'
# Feed full trajectory to Gemini (large context)
cat results.full.jsonl | your-gemini-judge.ts
See downstream.md for integration patterns with Braintrust, Gemini, and custom scorers.
| Target | How to Evaluate |
|---|---|
| Agent capability | Direct prompts, analyze trajectory quality |
| Skills | Set --cwd to project with skill, test skill-specific prompts |
| MCP Servers | Use --mcp-server flag, verify tool usage in trajectory |
bun scripts/run-harness.ts skill-prompts.jsonl \
--cwd /project/with/skill \
-o results.jsonl
bun scripts/run-harness.ts mcp-prompts.jsonl \
--mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem"]}' \
-o results.jsonl
This skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.