From web-bench
Run the WebBench browser agent benchmark — main entry point and orchestrator. Triggers: "run benchmark", "run WebBench", "start benchmark", "benchmark browser agent", "web bench", "execute WebBench", "run web-bench". Parses user configuration (category filter, sample size, resume), delegates to load-dataset, execute-task, evaluate-task, and generate-report skills. This is the user-invocable orchestrator that ties the full benchmark pipeline together.
npx claudepluginhub lespaceman/athena-workflow-marketplace --plugin web-benchThis skill is limited to using the following tools:
Main entry point for running the WebBench browser agent benchmark. This skill is used in interactive (single-session) mode. For multi-session workflow execution, see the system prompt.
Builds reliable browser automation skills for specific websites by iteratively running tasks, analyzing traces, and refining strategy.md. Supports parallel sub-agent execution.
Creates benchmark suites for performance testing including load testing, stress testing, and benchmarking with k6 and JMeter. Useful for generating configs and code.
Enforces sequential performance benchmarks with 60s min runs (30s binary search), 10s warmup, JSON metrics output, and anomaly reruns. For baselines and regressions.
Share bugs, ideas, or general feedback.
Main entry point for running the WebBench browser agent benchmark. This skill is used in interactive (single-session) mode. For multi-session workflow execution, see the system prompt.
Parse configuration from: $ARGUMENTS
Supported flags:
| Flag | Description | Default |
|---|---|---|
--category <CAT> | Filter tasks by category (READ, CREATE, UPDATE, DELETE, FILE_MANIPULATION) | All categories |
--sample <N> | Random sample of N tasks (deterministic seed=42) | Full dataset |
--resume | Resume from existing web-bench-results.jsonl, skip completed task IDs | Fresh run |
--report-only | Skip execution, just generate report from existing results | Full run |
Examples:
run-benchmark --category READ --sample 50 — 50 random READ tasksrun-benchmark --resume — continue from where last run stoppedrun-benchmark --report-only — just aggregate existing resultsWhen run interactively (not via the workflow loop), this skill executes the full pipeline in a single session:
web-bench-tasks.jsonl, web-bench-results.jsonl)--resume and results exist: determine completed task IDs, skip themload-dataset skill to download and prepare the datasetFor each task in web-bench-tasks.jsonl (skipping completed if resuming):
date +%s%3Nexecute-task methodology and have the browser-capable calling context perform the browser automationevaluate-task methodology and score the resultdate +%s%3N, compute durationweb-bench-results.jsonl:
{"id": 42, "url": "...", "category": "READ", "task": "...", "score": 1.0, "verdict": "PASS", "reasoning": "...", "error": null, "duration_ms": 34200, "tokens_used": {"input": 12450, "output": 3200}, "timestamp": "2026-03-19T14:30:00Z"}
[42/2454] PASS (1.0) — acehardware.com — READ — 34.2sAfter all tasks are processed (or if --report-only):
generate-report methodologyweb-bench-results.jsonl into web-bench-report.mdToken usage should be tracked per task. The agent should estimate tokens consumed during task execution by recording:
If exact token counts are available from the session metadata, prefer those over estimates.
After each task, print a status line:
[1/50] PASS (1.0) acehardware.com READ 34.2s 15,650 tokens
[2/50] FAIL (0.0) airbnb.com CREATE 12.1s 8,200 tokens [auth_required]
[3/50] PARTIAL(0.5) amazon.com READ 45.8s 22,100 tokens