From ade
This skill should be used when the user asks about evaluation methodology, scoring rubrics, testing tools, "how does scoring work", "evaluation criteria", "rubric format", "add a rubric", "create testing tool", "evaluate my feature", "run evaluation", "why did evaluation fail", or needs guidance on adversarial evaluation, graded scoring, or hard thresholds in the ADE workflow. Make sure to use this skill whenever the user mentions quality assessment, code review scoring, feature validation, or wants to understand why a feature passed or failed evaluation.
npx claudepluginhub alexsds/ade-workflow --plugin adeThis skill uses the workspace's default tool permissions.
Adversarial evaluation methodology for the ADE workflow. Anthropic found that "out of the box, Claude is a poor QA agent" — evaluators identify real issues then self-justify approval anyway. The core failure mode is: find a problem, then explain why it's actually fine. ADE counters this with graded scoring, hard thresholds, and an adversarial stance that structurally prevents self-justification.
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
Adversarial evaluation methodology for the ADE workflow. Anthropic found that "out of the box, Claude is a poor QA agent" — evaluators identify real issues then self-justify approval anyway. The core failure mode is: find a problem, then explain why it's actually fine. ADE counters this with graded scoring, hard thresholds, and an adversarial stance that structurally prevents self-justification.
The evaluator exists because self-evaluation does not work. When the same agent that writes code also evaluates it, confirmation bias dominates. The agent has invested effort in the implementation and unconsciously defends it. Separating implementation (Generator) from evaluation (Evaluator) eliminates this bias. The Evaluator has no stake in the code — its only job is to find failures.
Adversarial stance: Approach every feature from the perspective that bugs exist. Hunt for failures, not confirmations of correctness. Start each evaluation assuming something is wrong and try to prove it.
Graded scoring: Score each criterion 1-10 with hard thresholds. Any single criterion below threshold means the feature fails. There is no averaging, no weighting override, no "close enough." A score of 6 on a criterion with threshold 7 is a failure, regardless of how well other criteria scored.
Separate evaluator: The agent that writes code must never evaluate it. Self-evaluation bias causes agents to praise their own mediocre work. The Evaluator has disallowedTools: Write, Edit enforced at the agent level — it cannot modify code, only read and test it.
No self-justification: Identifying an issue and then explaining why it is acceptable is the core failure mode Anthropic documented. If an issue exists, the feature fails. Report it clearly with specific fixes. The Generator needs accurate feedback, not encouragement.
When the Generator hands off a feature for review:
${CLAUDE_PLUGIN_ROOT}/rubrics/ and .ade/rubrics/. Match the "Applies To" section in each rubric against the feature type. Select all matching rubrics — a feature can match multiple${CLAUDE_PLUGIN_ROOT}/testing-tools/ and .ade/testing-tools/. Select tools that match the feature typeMultiple rubrics can apply to a single feature. For example, a user registration page might match:
frontend-design.md — because it has UI componentsux-flows.md — because it is a multi-step user journeyapi-quality.md — because it has backend endpointscode-architecture.md — because it involves business logicAll matching rubrics apply. All criteria in all matching rubrics must pass their thresholds.
Rubrics live in two locations:
${CLAUDE_PLUGIN_ROOT}/rubrics/ — plugin defaults shipped with ADE.ade/rubrics/ — project-specific custom rubrics added by the userProject rubrics override plugin defaults with the same filename. If .ade/rubrics/frontend-design.md exists, it replaces the plugin default frontend-design.md.
| Rubric | Applies To | Key Criteria |
|---|---|---|
frontend-design.md | UI components, pages, layouts, visual elements | Design quality (7), originality (7), craft (6), functionality (6) |
code-architecture.md | Backend logic, data models, module structure | Separation of concerns (7), clarity (7), error handling (6), testability (6) |
api-quality.md | REST/GraphQL/WebSocket endpoints | API design (7), response quality (7), validation (6), security (6) |
ux-flows.md | Multi-step workflows, user journeys, onboarding | Flow coherence (7), edge cases (7), information architecture (6), feedback (6) |
Numbers in parentheses are the threshold scores. Criteria with threshold 7 are the most demanding — these represent the qualities that AI agents tend to struggle with (design originality, architectural clarity) and where the evaluator must be strictest.
Each rubric is a .md file with three required sections:
## Applies To — What types of features this rubric evaluates. Keep it specific so the evaluator can match correctly.## Criteria — Each criterion as an H3 heading with name, weight (high/medium/low), threshold (N/10), and description of what to evaluate. Include concrete examples of what each score looks like.## Scoring Guide — What each score range (9-10, 7-8, 5-6, 3-4, 1-2) means for this rubric.Example criterion structure:
### Design Quality (weight: high)
Threshold: 7/10
The UI should feel like a coherent whole with a unified mood and identity —
not separate components strung together. Colors, typography, layout, and
imagery should work together toward a single vision.
Evaluate: Does the page feel like one designer built it? Is there a
consistent visual language? Do all elements contribute to the same mood?
To add a custom rubric, create a .md file in .ade/rubrics/ following the format above. The evaluator automatically discovers and uses rubrics from both locations. Validate the format before use:
bash ${CLAUDE_PLUGIN_ROOT}/scripts/validate-rubric.sh .ade/rubrics/my-rubric.md
The validator checks for required sections (## Applies To, ## Criteria, ## Scoring Guide) and verifies that criteria have Threshold: N/10 format.
Testing tools live in two locations:
${CLAUDE_PLUGIN_ROOT}/testing-tools/ — plugin defaults.ade/testing-tools/ — project-specific custom tools| Tool | Applies To | Method |
|---|---|---|
playwright.md | UI, web pages, user interactions | Playwright MCP server, falls back to curl |
api-tester.md | REST/GraphQL endpoints | curl with jq for response verification |
unit-test-runner.md | All application code | Auto-detects test framework (jest, vitest, pytest, etc.) |
Playwright MCP: The plugin ships a .mcp.json that configures Playwright MCP automatically. If the MCP server fails to start (e.g., Node.js not installed), the evaluator falls back to curl and Bash-based testing. No manual setup required for most environments.
Each tool is a .md file with required sections:
## Applies To — What types of features this tool tests## Setup — Required configuration (MCP servers, dependencies, prerequisites)## How To Use — Step-by-step instructions for testing with this tool## What To Check — Specific verification checklist for the evaluatorTo add a custom testing tool, create a .md file in .ade/testing-tools/. Validate with:
bash ${CLAUDE_PLUGIN_ROOT}/scripts/validate-testing-tool.sh .ade/testing-tools/my-tool.md
| Score | Meaning |
|---|---|
| 9-10 | Exceptional — exceeds expectations, distinctive and polished |
| 7-8 | Solid — meets standards, cohesive and deliberate |
| 5-6 | Acceptable — functional but generic, rough edges, needs improvement |
| 3-4 | Below standards — significant issues, inconsistent, confusing |
| 1-2 | Failing — broken, incoherent, fundamental problems |
A criterion with threshold 7 means scores 1-6 fail. Any single criterion below threshold causes the entire feature to fail. There is no averaging, no weighting override, no "overall pass."
Score inflation: The natural tendency is to score generously. A feature that "basically works" is a 5-6, not a 7-8. Reserve 7+ for work that is genuinely solid and deliberate.
AI slop penalty: Claude naturally defaults to purple/white gradients, generic card grids, and template-looking UIs. The originality criterion exists specifically to catch this. Default-looking work scores 4-5 on originality, not 7.
Threshold confusion: A score of 6 on a threshold-7 criterion is a failure. It does not matter if the score is "close" — close is not passing. Report the failure with specific fixes.
Justification trap: After finding issues, resist the urge to explain why they are acceptable. This is the exact failure mode Anthropic documented. Issues found = feature fails. Report them.
On pass (all criteria meet thresholds):
Feature: [name]
Status: APPROVED
Rubric: [rubric name]
| Criterion | Score | Threshold | Status |
|----------------|-------|-----------|--------|
| [criterion] | X | Y | PASS |
All criteria met. Moving to next feature.
On fail (any criterion below threshold):
Feature: [name]
Status: FAILED
Rubric: [rubric name]
| Criterion | Score | Threshold | Status |
|----------------|-------|-----------|--------|
| [criterion] | X | Y | FAIL |
Failures:
- [criterion]: [specific description of what is wrong]
What to fix:
1. [specific actionable fix with concrete guidance]
2. [another specific fix]
The report must be specific and actionable. "Design needs improvement" is not useful feedback. "The header uses a generic purple-to-white gradient and the card layout is the default shadcn template — replace with a distinctive color palette and a layout that reflects the app's identity" is useful feedback.
For detailed scoring examples and methodology, consult:
references/scoring-guide.md — Detailed scoring examples for each criterion level (what a 9 vs. 7 vs. 5 vs. 3 looks like), evaluation philosophy, and weight guidelines explaining why certain criteria carry high weight