From web-bench
Evaluate whether a WebBench task was successfully completed using LLM-as-judge scoring. Triggers: "evaluate task", "score task", "judge result", "grade benchmark task". Examines the execution trace, final page state, and extracted data against the original task description. Produces a structured verdict (PASS/PARTIAL/FAIL) with reasoning. Does NOT execute browser actions — use execute-task for that.
How this skill is triggered — by the user, by Claude, or both
Slash command
/web-bench:evaluate-taskThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Judge whether a completed task execution meets the success criteria defined in the original task description. This is a post-hoc evaluation — no browser interaction, only analysis of the execution trace and captured state.
Judge whether a completed task execution meets the success criteria defined in the original task description. This is a post-hoc evaluation — no browser interaction, only analysis of the execution trace and captured state.
You receive:
{"id": 42, "url": "...", "category": "READ", "task": "Navigate to the news section and summarize..."}| Verdict | Score | Criteria |
|---|---|---|
| PASS | 1.0 | Task fully completed. All requested information extracted or all requested actions performed. |
| PARTIAL | 0.5 | Task partially completed. Some but not all requirements met. Meaningful progress was made. |
| FAIL | 0.0 | Task not completed. No meaningful progress, wrong information, or blocked before starting. |
If the execution trace contains blockers, evaluate based on the blocker type:
| Blocker | Verdict | Reasoning |
|---|---|---|
| Login required (no credentials) | FAIL | Infrastructure limitation — task requires auth |
| CAPTCHA | FAIL | Infrastructure limitation — cannot solve programmatically |
| Site down / 404 | FAIL | External dependency — site unavailable |
| Geo-restricted | FAIL | Infrastructure limitation — content not accessible |
| Paywall | FAIL | Infrastructure limitation — paid content |
| Pop-up could not be dismissed | PARTIAL or FAIL | Depends on whether task could proceed |
Score each dimension and use them to determine the overall verdict:
Navigation (required): Did the agent reach the correct page/section?
Comprehension (required): Did the agent understand what was being asked?
Completeness (required): Did the agent fulfill ALL parts of the task?
Accuracy (for READ tasks): Is the extracted information correct?
Confirmation (for WRITE tasks): Is there evidence the action was performed?
Break the task description into discrete, verifiable requirements:
Task: "Navigate to the news section and summarize the headline and key points from the latest science policy update."
Requirements:
1. Navigate to the news section
2. Find the latest science policy update
3. Extract the headline
4. Extract key points
For each requirement, determine if the execution trace shows it was fulfilled:
1. Navigate to news section → DONE (action 2: clicked "News", URL changed to /news)
2. Find latest science policy update → DONE (action 3: found article "New Science Policy...")
3. Extract headline → DONE (extracted: "New Science Policy Framework Announced")
4. Extract key points → NOT DONE (only headline extracted, no key points)
Provide clear, structured reasoning:
Verdict: PARTIAL (0.5)
Reasoning: Agent successfully navigated to the news section and identified the correct article.
The headline was extracted accurately. However, the task also requested "key points" from the
article, which were not extracted. 3 of 4 requirements met.
Return a structured evaluation result:
{
"task_id": 42,
"verdict": "PARTIAL",
"score": 0.5,
"reasoning": "Agent navigated correctly and extracted the headline, but missed the key points requirement. 3/4 requirements fulfilled.",
"requirements_met": 3,
"requirements_total": 4,
"blocker": null
}
If a blocker prevented execution:
{
"task_id": 43,
"verdict": "FAIL",
"score": 0.0,
"reasoning": "Task requires account login. No credentials available — infrastructure limitation.",
"requirements_met": 0,
"requirements_total": 3,
"blocker": "auth_required"
}
npx claudepluginhub lespaceman/athena-workflow-marketplace --plugin web-benchEnforces rigorous task verification via /need-vet: clarify if needed, fully execute and test (code runs, app interactions, edge cases), append verified tag only after checks.
Performs end-to-end QA on a Teamwork task: reads context, extracts the multi-dev URL, builds a validation plan, executes it via CoWork browser automation, and produces a structured report.
Builds reliable browser automation skills through iterative testing — runs a task, reads the trace, and improves the strategy until it passes. Supports parallel multi-task runs via sub-agents.