From agent-patterns
Evaluate the output quality of an agent or pipeline run. Use this skill when asked to "review this agent output", "score this result", "evaluate agent quality", or "suggest improvements" to an agent's response or pipeline output.
npx claudepluginhub ats-kinoshita-iso/agent-workshop --plugin agent-patternsThis skill uses the workspace's default tool permissions.
Evaluate the quality of an agent or pipeline output against defined criteria.
Guides strict Test-Driven Development (TDD): write failing tests first for features, bugfixes, refactors before any production code. Enforces red-green-refactor cycle.
Guides systematic root cause investigation for bugs, test failures, unexpected behavior, performance issues, and build failures before proposing fixes.
Guides A/B test setup with mandatory gates for hypothesis validation, metrics definition, sample size calculation, and execution readiness checks.
Evaluate the quality of an agent or pipeline output against defined criteria. Produce a structured review with a numeric score and specific improvement suggestions.
Before scoring, identify which dimensions apply to this output:
| Dimension | Description | Applicable? |
|---|---|---|
| Correctness | Output matches the expected answer or solves the problem | Always |
| Completeness | All required sub-tasks or fields are addressed | Always |
| Format compliance | Output matches the required format (JSON, markdown, etc.) | If format specified |
| Conciseness | No unnecessary verbosity or repetition | Always |
| Safety | No harmful, biased, or policy-violating content | Always |
| Tool use quality | Tools called correctly with valid arguments | If tools were used |
Rate each applicable dimension on a scale of 1-5:
1 = Failing (major problems)
2 = Poor (significant issues)
3 = Acceptable (meets minimum bar)
4 = Good (minor issues only)
5 = Excellent (no issues)
For each dimension scored below 4, list concrete issues:
Format:
Issue: <dimension>
Found: "<exact quote from output>"
Problem: <why this is wrong>
Fix: <specific improvement>
Calculate the overall score as a weighted average of dimension scores. Apply this verdict based on the overall score:
| Score | Verdict |
|---|---|
| 4.5 - 5.0 | EXCELLENT -- ready to use |
| 3.5 - 4.4 | GOOD -- minor improvements recommended |
| 2.5 - 3.4 | ACCEPTABLE -- improvements needed before production use |
| 1.5 - 2.4 | POOR -- significant rework required |
| 1.0 - 1.4 | FAILING -- output should be discarded and regenerated |
List 1-3 actionable improvements in priority order:
For each suggestion, include: