From web-bench
Evaluate whether a WebBench task was successfully completed using LLM-as-judge scoring. Triggers: "evaluate task", "score task", "judge result", "grade benchmark task". Examines the execution trace, final page state, and extracted data against the original task description. Produces a structured verdict (PASS/PARTIAL/FAIL) with reasoning. Does NOT execute browser actions — use execute-task for that.
npx claudepluginhub lespaceman/athena-workflow-marketplace --plugin web-benchThis skill is limited to using the following tools:
Judge whether a completed task execution meets the success criteria defined in the original task description. This is a post-hoc evaluation — no browser interaction, only analysis of the execution trace and captured state.
Enforces rigorous task verification via /need-vet: clarify if needed, fully execute and test (code runs, app interactions, edge cases), append verified tag only after checks.
Dispatches subagents role-playing real users to complete site-specific tasks like browsing, subscribing, purchasing, and searching using browser MCP tools, verifying end-to-end flows and reporting issues.
Verifies task completion with checklist: collects evidence like screenshots/videos in REPORT.md, asks report level via user question, initiates reviews (code-security, e2e, ui-ux). Use post-implementation.
Share bugs, ideas, or general feedback.
Judge whether a completed task execution meets the success criteria defined in the original task description. This is a post-hoc evaluation — no browser interaction, only analysis of the execution trace and captured state.
You receive:
{"id": 42, "url": "...", "category": "READ", "task": "Navigate to the news section and summarize..."}| Verdict | Score | Criteria |
|---|---|---|
| PASS | 1.0 | Task fully completed. All requested information extracted or all requested actions performed. |
| PARTIAL | 0.5 | Task partially completed. Some but not all requirements met. Meaningful progress was made. |
| FAIL | 0.0 | Task not completed. No meaningful progress, wrong information, or blocked before starting. |
If the execution trace contains blockers, evaluate based on the blocker type:
| Blocker | Verdict | Reasoning |
|---|---|---|
| Login required (no credentials) | FAIL | Infrastructure limitation — task requires auth |
| CAPTCHA | FAIL | Infrastructure limitation — cannot solve programmatically |
| Site down / 404 | FAIL | External dependency — site unavailable |
| Geo-restricted | FAIL | Infrastructure limitation — content not accessible |
| Paywall | FAIL | Infrastructure limitation — paid content |
| Pop-up could not be dismissed | PARTIAL or FAIL | Depends on whether task could proceed |
Score each dimension and use them to determine the overall verdict:
Navigation (required): Did the agent reach the correct page/section?
Comprehension (required): Did the agent understand what was being asked?
Completeness (required): Did the agent fulfill ALL parts of the task?
Accuracy (for READ tasks): Is the extracted information correct?
Confirmation (for WRITE tasks): Is there evidence the action was performed?
Break the task description into discrete, verifiable requirements:
Task: "Navigate to the news section and summarize the headline and key points from the latest science policy update."
Requirements:
1. Navigate to the news section
2. Find the latest science policy update
3. Extract the headline
4. Extract key points
For each requirement, determine if the execution trace shows it was fulfilled:
1. Navigate to news section → DONE (action 2: clicked "News", URL changed to /news)
2. Find latest science policy update → DONE (action 3: found article "New Science Policy...")
3. Extract headline → DONE (extracted: "New Science Policy Framework Announced")
4. Extract key points → NOT DONE (only headline extracted, no key points)
Provide clear, structured reasoning:
Verdict: PARTIAL (0.5)
Reasoning: Agent successfully navigated to the news section and identified the correct article.
The headline was extracted accurately. However, the task also requested "key points" from the
article, which were not extracted. 3 of 4 requirements met.
Return a structured evaluation result:
{
"task_id": 42,
"verdict": "PARTIAL",
"score": 0.5,
"reasoning": "Agent navigated correctly and extracted the headline, but missed the key points requirement. 3/4 requirements fulfilled.",
"requirements_met": 3,
"requirements_total": 4,
"blocker": null
}
If a blocker prevented execution:
{
"task_id": 43,
"verdict": "FAIL",
"score": 0.0,
"reasoning": "Task requires account login. No credentials available — infrastructure limitation.",
"requirements_met": 0,
"requirements_total": 3,
"blocker": "auth_required"
}