From web-bench
Run the WebBench browser agent benchmark — main entry point and orchestrator. Triggers: "run benchmark", "run WebBench", "start benchmark", "benchmark browser agent", "web bench", "execute WebBench", "run web-bench". Parses user configuration (category filter, sample size, resume), delegates to load-dataset, execute-task, evaluate-task, and generate-report skills. This is the user-invocable orchestrator that ties the full benchmark pipeline together.
How this skill is triggered — by the user, by Claude, or both
Slash command
/web-bench:run-benchmarkThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Main entry point for running the WebBench browser agent benchmark. This skill is used in interactive (single-session) mode. For multi-session workflow execution, see the system prompt.
Main entry point for running the WebBench browser agent benchmark. This skill is used in interactive (single-session) mode. For multi-session workflow execution, see the system prompt.
Parse configuration from: $ARGUMENTS
Supported flags:
| Flag | Description | Default |
|---|---|---|
--category <CAT> | Filter tasks by category (READ, CREATE, UPDATE, DELETE, FILE_MANIPULATION) | All categories |
--sample <N> | Random sample of N tasks (deterministic seed=42) | Full dataset |
--resume | Resume from existing web-bench-results.jsonl, skip completed task IDs | Fresh run |
--report-only | Skip execution, just generate report from existing results | Full run |
Examples:
run-benchmark --category READ --sample 50 — 50 random READ tasksrun-benchmark --resume — continue from where last run stoppedrun-benchmark --report-only — just aggregate existing resultsWhen run interactively (not via the workflow loop), this skill executes the full pipeline in a single session:
web-bench-tasks.jsonl, web-bench-results.jsonl)--resume and results exist: determine completed task IDs, skip themload-dataset skill to download and prepare the datasetFor each task in web-bench-tasks.jsonl (skipping completed if resuming):
date +%s%3Nexecute-task methodology and have the browser-capable calling context perform the browser automationevaluate-task methodology and score the resultdate +%s%3N, compute durationweb-bench-results.jsonl:
{"id": 42, "url": "...", "category": "READ", "task": "...", "score": 1.0, "verdict": "PASS", "reasoning": "...", "error": null, "duration_ms": 34200, "tokens_used": {"input": 12450, "output": 3200}, "timestamp": "2026-03-19T14:30:00Z"}
[42/2454] PASS (1.0) — acehardware.com — READ — 34.2sAfter all tasks are processed (or if --report-only):
generate-report methodologyweb-bench-results.jsonl into web-bench-report.mdToken usage should be tracked per task. The agent should estimate tokens consumed during task execution by recording:
If exact token counts are available from the session metadata, prefer those over estimates.
After each task, print a status line:
[1/50] PASS (1.0) acehardware.com READ 34.2s 15,650 tokens
[2/50] FAIL (0.0) airbnb.com CREATE 12.1s 8,200 tokens [auth_required]
[3/50] PARTIAL(0.5) amazon.com READ 45.8s 22,100 tokens
npx claudepluginhub lespaceman/athena-workflow-marketplace --plugin web-benchRuns performance benchmarks for agentic-flow worker systems, including trigger detection, registry CRUD, agent selection, model cache, concurrent workers, and memory key generation. Use when diagnosing worker performance or comparing configurations.
Builds reliable browser automation skills through iterative testing — runs a task, reads the trace, and improves the strategy until it passes. Supports parallel multi-task runs via sub-agents.
End-to-end performance, load, and stress testing of public websites with k6. Produces hybrid protocol+browser test suites, SLO-backed thresholds, and monitoring.