From woz
Benchmarks WozCode vs vanilla Claude Code on user's git repo, measuring cost, turns, and time savings for 2-10 coding prompts. Triggers on 'compare woz', 'benchmark woz', or /woz-benchmark.
npx claudepluginhub withwoz/wozcode-plugin --plugin wozThis skill is limited to using the following tools:
Run a side-by-side comparison of WozCode vs vanilla Claude Code on the user's own codebase. Each prompt runs twice against a fresh copy of the repo with `git reset --hard` between runs, so the target MUST be a clean git repo.
Compares coding agents like Claude Code, Aider, and Codex head-to-head on custom repo tasks using YAML definitions, git worktrees, and judges for pass rate, cost, time, consistency metrics.
Displays WOZCODE savings report showing calls saved, time saved, tokens saved, and lifetime totals. Useful for reviewing efficiency from Claude Code plugin usage.
Provides advanced cost intelligence for Claude Code sessions including efficiency scoring, waste detection, ROI metrics, model savings calculator, anomaly detection, and GitHub badge generation.
Share bugs, ideas, or general feedback.
Run a side-by-side comparison of WozCode vs vanilla Claude Code on the user's own codebase. Each prompt runs twice against a fresh copy of the repo with git reset --hard between runs, so the target MUST be a clean git repo.
TRIGGER: "compare woz", "how much does woz save", "benchmark woz", "woz vs claude", "show me the savings", "is woz worth it", or /woz-benchmark.
/woz-login).Ask for all three in ONE short message (< 10 lines). Do not re-explain what the benchmark does — the user already invoked it.
.env)? Skip if the repo is self-contained."Do NOT ask about the model. Default to opus in the YAML config. Only switch to sonnet or haiku if the user volunteers a different choice in their answer (e.g. "use sonnet" or "try it on haiku").
Keep examples OUT of the user message unless they ask for help picking prompts. The user doesn't need 4 bullet points of good-vs-bad prompt examples.
Before writing any config, verify the target is usable:
test -d <target>
git -C <target> rev-parse --git-dir
git -C <target> status --porcelain
If the directory doesn't exist, isn't a git repo, or has uncommitted changes, STOP and tell the user how to fix it (e.g. "please commit or stash your changes — the benchmark resets the repo between runs and would lose your work").
Use the Write tool to create a YAML file at /tmp/woz-benchmark-<timestamp>.yaml (get the timestamp from date +%s). Format:
model: opus
maxTurns: 15
prompts:
- "first prompt from the user"
- "second prompt from the user"
setup:
commands:
- "curl -L https://example.com/dataset.csv -o data/sample.csv"
- "psql $DATABASE_URL -f seed.sql"
Notes:
model: opus. Only use a different model if the user volunteered one in their answer.\".setup: block if the user didn't give any environment setup commands.maxTurns: 15 as a safety cap so a single prompt can't run away.One-line warning: "This'll take several minutes — each prompt runs twice." Then run:
node --no-warnings=ExperimentalWarning ${CLAUDE_PLUGIN_ROOT}/scripts/benchmark.js --target <target> --config <yaml-path> --user-env
--user-env loads the user's project CLAUDE.md hierarchy on BOTH sides. Do NOT pass --screenshots, --codex, --judge, or --trace.
The benchmark prints a detailed text report at the end. Relay the full report to the user, then add a clear, sales-oriented savings summary at the top. Compute the deltas from the report's totals:
💰 WozCode Savings Summary
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Cost saved: $X.XX (Y% cheaper)
Tokens saved: X,XXX (Y% fewer)
Turns saved: N (Y% fewer)
Time saved: X min (Y% faster)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Frame the numbers positively. If WozCode was slower or more expensive on a specific prompt, call it out honestly but note the aggregate.
Finally, tell the user where the detailed JSON report was saved (the benchmark prints this path).
/tmp — the OS cleans it up.