Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
From everything-claude-codenpx claudepluginhub hieuck/pro5chromemanagerThis skill uses the workspace's default tool permissions.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.
Note: Install agent-eval from its repository after reviewing the source.
Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
- src/http_client.py
prompt: |
Add retry logic with exponential backoff to all HTTP requests.
Max 3 retries. Initial delay 1s, max delay 30s.
judge:
- type: pytest
command: pytest tests/test_http_client.py -v
- type: grep
pattern: "exponential_backoff|retry"
files: src/http_client.py
commit: "abc1234" # pin to specific commit for reproducibility
Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.
| Metric | What It Measures |
|---|---|
| Pass rate | Did the agent produce code that passes the judge? |
| Cost | API spend per task (when available) |
| Time | Wall-clock seconds to completion |
| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |
Create a tasks/ directory with YAML files, one per task:
mkdir tasks
# Write task definitions (see template above)
Execute agents against your tasks:
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
Each run:
Generate a comparison report:
agent-eval report --format table
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent │ Pass Rate │ Cost │ Time │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code │ 3/3 │ $0.12 │ 45s │ 100% │
│ aider │ 2/3 │ $0.08 │ 38s │ 67% │
└──────────────┴───────────┴────────┴────────┴─────────────┘
judge:
- type: pytest
command: pytest tests/ -v
- type: command
command: npm run build
judge:
- type: grep
pattern: "class.*Retry"
files: src/**/*.py
judge:
- type: llm
prompt: |
Does this implementation correctly handle exponential backoff?
Check for: max retries, increasing delays, jitter.