Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
From everything-claude-codenpx claudepluginhub burgebj/claude_everythingThis skill uses the workspace's default tool permissions.
Provides guidance on returns authorization, receipt inspection, condition grading, disposition routing, refund processing, fraud detection, and warranty claims in e-commerce operations.
Scans installed skills to extract principles shared across 2+ skills and distills them into rules by appending, revising, or creating rule files.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.
Note: Install agent-eval from its repository after reviewing the source.
Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
- src/http_client.py
prompt: |
Add retry logic with exponential backoff to all HTTP requests.
Max 3 retries. Initial delay 1s, max delay 30s.
judge:
- type: pytest
command: pytest tests/test_http_client.py -v
- type: grep
pattern: "exponential_backoff|retry"
files: src/http_client.py
commit: "abc1234" # pin to specific commit for reproducibility
Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.
| Metric | What It Measures |
|---|---|
| Pass rate | Did the agent produce code that passes the judge? |
| Cost | API spend per task (when available) |
| Time | Wall-clock seconds to completion |
| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |
Create a tasks/ directory with YAML files, one per task:
mkdir tasks
# Write task definitions (see template above)
Execute agents against your tasks:
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
Each run:
Generate a comparison report:
agent-eval report --format table
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent │ Pass Rate │ Cost │ Time │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code │ 3/3 │ $0.12 │ 45s │ 100% │
│ aider │ 2/3 │ $0.08 │ 38s │ 67% │
└──────────────┴───────────┴────────┴────────┴─────────────┘
judge:
- type: pytest
command: pytest tests/ -v
- type: command
command: npm run build
judge:
- type: grep
pattern: "class.*Retry"
files: src/**/*.py
judge:
- type: llm
prompt: |
Does this implementation correctly handle exponential backoff?
Check for: max retries, increasing delays, jitter.