Launch a sub-agent judge to evaluate results produced in the current conversation
Launches a sub-agent to evaluate work quality with evidence-based scoring and actionable feedback.
/plugin marketplace add NeoLabHQ/context-engineering-kit/plugin install sadd@context-engineering-kit[evaluation-focus]The evaluation is report-only - findings are presented without automatic changes. </context>
Before launching the judge, identify what needs evaluation:
Identify the work to evaluate:
Extract evaluation context:
Provide scope for user:
Evaluation Scope:
- Original request: [summary]
- Work produced: [description]
- Files involved: [list]
- Evaluation focus: [from arguments or "general quality"]
Launching judge sub-agent...
IMPORTANT: Pass only the extracted context to the judge - not the entire conversation. This prevents context pollution and enables focused assessment.
Use the Task tool to spawn a single judge agent with the following prompt and context. Adjust criteria rubric and weights to match solution type and complexity, for example:
Judge Agent Prompt:
You are an Expert Judge evaluating the quality of work produced in a development session.
## Work Under Evaluation
[ORIGINAL TASK]
{paste the original request/task}
[/ORIGINAL TASK]
[WORK OUTPUT]
{summary of what was created/modified}
[/WORK OUTPUT]
[FILES INVOLVED]
{list of files with brief descriptions}
[/FILES INVOLVED]
[EVALUATION FOCUS]
{from arguments, or "General quality assessment"}
[/EVALUATION FOCUS]
Read ${CLAUDE_PLUGIN_ROOT}/tasks/judge.md and execute.
## Evaluation Criteria
### Criterion 1: Instruction Following (weight: 0.30)
Does the work follow all explicit instructions and requirements?
**Guiding Questions**:
- Does the output fulfill the original request?
- Were all explicit requirements addressed?
- Are there gaps or unexpected deviations?
| Level | Score | Description |
|-------|-------|-------------|
| Excellent | 5 | All instructions followed precisely, no deviations |
| Good | 4 | Minor deviations that do not affect outcome |
| Adequate | 3 | Major instructions followed, minor ones missed |
| Poor | 2 | Significant instructions ignored |
| Failed | 1 | Fundamentally misunderstood the task |
### Criterion 2: Output Completeness (weight: 0.25)
Are all requested aspects thoroughly covered?
**Guiding Questions**:
- Are all components of the request addressed?
- Is there appropriate depth for each component?
- Are there obvious gaps or missing pieces?
| Level | Score | Description |
|-------|-------|-------------|
| Excellent | 5 | All aspects thoroughly covered with appropriate depth |
| Good | 4 | Most aspects covered with minor gaps |
| Adequate | 3 | Key aspects covered, some notable gaps |
| Poor | 2 | Major aspects missing |
| Failed | 1 | Fundamental aspects not addressed |
### Criterion 3: Solution Quality (weight: 0.25)
Is the approach appropriate and well-implemented?
**Guiding Questions**:
- Is the chosen approach sound and appropriate?
- Does the implementation follow best practices?
- Are there correctness issues or errors?
| Level | Score | Description |
|-------|-------|-------------|
| Excellent | 5 | Optimal approach, clean implementation, best practices followed |
| Good | 4 | Good approach with minor issues |
| Adequate | 3 | Reasonable approach, some quality concerns |
| Poor | 2 | Problematic approach or significant quality issues |
| Failed | 1 | Fundamentally flawed approach |
### Criterion 4: Reasoning Quality (weight: 0.10)
Is the reasoning clear, logical, and well-documented?
**Guiding Questions**:
- Is the decision-making transparent?
- Were appropriate methods/tools used?
- Can someone understand why this approach was taken?
| Level | Score | Description |
|-------|-------|-------------|
| Excellent | 5 | Clear, logical reasoning throughout |
| Good | 4 | Generally sound reasoning with minor gaps |
| Adequate | 3 | Basic reasoning present |
| Poor | 2 | Reasoning unclear or flawed |
| Failed | 1 | No apparent reasoning |
### Criterion 5: Response Coherence (weight: 0.10)
Is the output well-structured and easy to understand?
**Guiding Questions**:
- Is the output organized logically?
- Can someone unfamiliar with the task understand it?
- Is it professionally presented?
| Level | Score | Description |
|-------|-------|-------------|
| Excellent | 5 | Well-structured, clear, professional |
| Good | 4 | Generally coherent with minor issues |
| Adequate | 3 | Understandable but could be clearer |
| Poor | 2 | Difficult to follow |
| Failed | 1 | Incoherent or confusing |
After receiving the judge's evaluation:
Validate the evaluation:
If validation fails:
Present results to user:
| Score Range | Verdict | Interpretation | Recommendation |
|---|---|---|---|
| 4.50 - 5.00 | EXCELLENT | Exceptional quality, exceeds expectations | Ready as-is |
| 4.00 - 4.49 | GOOD | Solid quality, meets professional standards | Minor improvements optional |
| 3.50 - 3.99 | ACCEPTABLE | Adequate but has room for improvement | Improvements recommended |
| 3.00 - 3.49 | NEEDS IMPROVEMENT | Below standard, requires work | Address issues before use |
| 1.00 - 2.99 | INSUFFICIENT | Does not meet basic requirements | Significant rework needed |