Execute tasks through competitive multi-agent generation, multi-judge evaluation, and evidence-based synthesis
Executes competitive multi-agent generation with adaptive strategy selection to produce superior results through parallel implementation, multi-judge evaluation, and evidence-based synthesis.
/plugin marketplace add NeoLabHQ/context-engineering-kit/plugin install sadd@context-engineering-kitTask description and optional output path/criteriaKey features:
CRITICAL: You are not implementation agent or judge, you shoudn't read files that provided as context for sub-agent or task. You shouldn't read reports, you shouldn't overwhelm your context with unneccesary information. You MUST follow process step by step. Any diviations will be considered as failure and you will be killed!
This command implements a four-phase adaptive competitive orchestration pattern:
Phase 1: Competitive Generation with Self-Critique
┌─ Agent 1 → Draft → Critique → Revise → Solution A ─┐
Task ───┼─ Agent 2 → Draft → Critique → Revise → Solution B ─┼─┐
└─ Agent 3 → Draft → Critique → Revise → Solution C ─┘ │
│
Phase 2: Multi-Judge Evaluation with Verification │
┌─ Judge 1 → Evaluate → Verify → Revise → Report A ─┐ │
├─ Judge 2 → Evaluate → Verify → Revise → Report B ─┼──┤
└─ Judge 3 → Evaluate → Verify → Revise → Report C ─┘ │
│
Phase 2.5: Adaptive Strategy Selection │
Analyze Consensus ──────────────────────────────────────┤
├─ Clear Winner? → SELECT_AND_POLISH │
├─ All Flawed (<3.0)? → REDESIGN (return Phase 1)│
└─ Split Decision? → FULL_SYNTHESIS │
│ │
Phase 3: Evidence-Based Synthesis │ │
(Only if FULL_SYNTHESIS) │ │
Synthesizer ─────────────────────┴───────────────────────┴─→ Final Solution
Before starting, ensure the reports directory exists:
mkdir -p .specs/reports
Report naming convention: .specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md
Where:
{solution-name} - Derived from output path (e.g., users-api from output specs/api/users.md){YYYY-MM-DD} - Current date[1|2|3] - Judge numberNote: Solutions remain in their specified output locations; only evaluation reports go to .specs/reports/
Launch 3 independent agents in parallel (recommended: Opus for quality):
{solution-file}.[a|b|c].[ext])Solution naming convention: {solution-file}.[a|b|c].[ext]
Where:
{solution-file} - Derived from task (e.g., create users.ts result in users as solution file)[a|b|c] - Unique identifier per sub-agent[ext] - File extension (e.g., md, ts and etc.)Key principle: Diversity through independence - agents explore different approaches.
CRITICAL: You MUST provide filename with [a|b|c] identifier to agents and judges!!! Missing it, will result in your TERMINATION imidiatly!
Prompt template for generators:
<task>
{task_description}
</task>
<constraints>
{constraints_if_any}
</constraints>
<context>
{relevant_context}
</context>
<output>
{define expected output following such pattern: {solution-file}.[a|b|c].[ext] based on the task description and context. Each [a|b|c] is a unique identifier per sub-agent. You MUST provide filename with it!!!}
</output>
Instructions:
Let's approach this systematically to produce the best possible solution.
1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Consider multiple approaches - what are the different ways to solve this?
3. Think through the tradeoffs step by step and choose the approach you believe is best
4. Implement it completely
5. Generate 5 verification questions about critical aspects
6. Answer your own questions:
- Review solution against each question
- Identify gaps or weaknesses
7. Revise solution:
- Fix identified issues
8. Explain what was changed and why
Launch 3 independent judges in parallel (recommended: Opus for rigor):
.specs/reports/{solution-name}-{date}.[1|2|3].md)Key principle: Multiple independent evaluations reduce bias and catch different issues.
Prompt template for judges:
You are evaluating {number} solutions to this task:
<task>
{task_description}
</task>
<solutions>
{list of paths to all candidate solutions}
</solutions>
<output>
Write full report to: {.specs/reports/{solution-name}-{date}.[1|2|3].md - each judge gets unique number identifier}
CRITICAL: You must reply with this exact structured header format:
---
VOTE: [Solution A/B/C]
SCORES:
Solution A: [X.X]/5.0
Solution B: [X.X]/5.0
Solution C: [X.X]/5.0
CRITERIA:
- {criterion_1}: [X.X]/5.0
- {criterion_2}: [X.X]/5.0
...
---
[Summary of your evaluation]
</output>
Evaluation criteria (with weights):
1. {criterion_1} ({weight_1}%)
2. {criterion_2} ({weight_2}%)
...
Read ${CLAUDE_PLUGIN_ROOT}/tasks/judge.md for evaluation methodology and execute using following criteria.
Instructions:
1. For each criterion, analyze ALL solutions
2. Write a combined report:
1. Provide specific evidence (quote exact text) for your assessments
2. Compare strengths and weaknesses
3. Score each solution on each criterion
4. Calculate weighted total scores
3. Generate verification 5 questions about your evaluation.
4. Answer verification questions:
- Re-examine solutions for each question
- Find counter-evidence if it exists
- Check for systematic bias (length, confidence, etc.)
5. Revise your evaluation and update it accordingly.
6. Reply structured output:
- VOTE: Which solution you recommend
- SCORES: Weighted total score for each solution (0.0-5.0)
CRITICAL: Base your evaluation on evidence, not impressions. Quote specific text.
Final checklist:
- [ ] Generated and answered all verification questions
- [ ] Found and corrected all potential issues
- [ ] Checked for known biases (length, verbosity, confidence)
- [ ] Confident in revised evaluation
- [ ] Structured header with VOTE and SCORES at top of report
The orchestrator (not a subagent) analyzes judge outputs to determine the optimal strategy.
Step 1: Parse structured headers from judge reply
Parse the judges reply. CRITICAL: Do not read reports files itself, it can overflow your context.
Step 2: Check for unanimous winner
Compare all three VOTE values:
Step 3: Check if all solutions are fundamentally flawed
If no unanimous vote, calculate average scores:
If (avg_A < 3.0) AND (avg_B < 3.0) AND (avg_C < 3.0):
Step 5: Default to full synthesis
If none of the above conditions met:
When: Clear winner (unanimous votes)
Process:
Benefits:
Prompt template:
You are polishing the winning solution based on judge feedback.
<task>
{task_description}
</task>
<winning_solution>
{path_to_winning_solution}
Score: {winning_score}/5.0
Judge consensus: {why_it_won}
</winning_solution>
<runner_up_solutions>
{list of paths to all runner-up solutions}
</runner_up_solutions>
<judge_feedback>
{list of paths to all evaluation reports}
</judge_feedback>
<output>
{final_solution_path}
</output>
Instructions:
Let's work through this step by step to polish the winning solution effectively.
1. Take the winning solution as your base (do NOT rewrite it)
2. First, carefully review all judge feedback to understand what needs improvement
3. Apply improvements based on judge feedback:
- Fix identified weaknesses
- Add missing elements judges noted
4. Next, examine the runner-up solutions for standout elements
5. Cherry-pick 1-2 specific elements from runners-up if judges praised them
6. Document changes made:
- What was changed and why
- What was added from other solutions
CRITICAL: Preserve the winning solution's core approach. Make targeted improvements only.
When: All solutions scored <3.0/5.0 (fundamental issues across the board)
Process:
Prompt template for new implementation:
You are analyzing why all solutions failed to meet quality standards. And implement new solution based on it.
<task>
{task_description}
</task>
<constraints>
{constraints_if_any}
</constraints>
<context>
{relevant_context}
</context>
<failed_solutions>
{list of paths to all candidate solutions}
</failed_solutions>
<evaluation_reports>
{list of paths to all evaluation reports with low scores}
</evaluation_reports>
Instructions:
Let's break this down systematically to understand what went wrong and how to design new solution based on it.
1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Read through each solution and its evaluation report
3. For each solution, think step by step about:
- What was the core approach?
- What specific issues did judges identify?
- Why did this approach fail to meet the quality threshold?
4. Identify common failure patterns across all solutions:
- Are there shared misconceptions?
- Are there missing requirements that all solutions overlooked?
- Are there fundamental constraints that weren't considered?
5. Extract lessons learned:
- What approaches should be avoided?
- What constraints must be addressed?
6. Generate improved guidance for the next iteration:
- New constraints to add
- Specific approaches to try - what are the different ways to solve this?
- Key requirements to emphasize
7. Think through the tradeoffs step by step and choose the approach you believe is best
8. Implement it completely
9. Generate 5 verification questions about critical aspects
10. Answer your own questions:
- Review solution against each question
- Identify gaps or weaknesses
11. Revise solution:
- Fix identified issues
12. Explain what was changed and why
When: No clear winner AND solutions have merit (scores ≥3.0)
Process: Proceed to Phase 3 (Evidence-Based Synthesis)
Only executed when Strategy 3 (FULL_SYNTHESIS) selected in Phase 2.5
Launch 1 synthesis agent (recommended: Opus for quality):
Key principle: Evidence-based synthesis leverages collective intelligence.
Prompt template for synthesizer:
You are synthesizing the best solution from competitive implementations and evaluations.
<task>
{task_description}
</task>
<solutions>
{list of paths to all candidate solutions}
</solutions>
<evaluation_reports>
{list of paths to all evaluation reports}
</evaluation_reports>
<output>
{define expected output following such pattern: solution.md based on the task description and context. Result should be a complete solution to the task.}
</output>
Instructions:
Let's think through this synthesis step by step to create the best possible combined solution.
1. First, read all solutions and evaluation reports carefully
2. Map out the consensus:
- What strengths did multiple judges praise in each solution?
- What weaknesses did multiple judges criticize in each solution?
3. For each major component or section, think through:
- Which solution handles this best and why?
- Could a hybrid approach work better?
4. Create the best possible solution by:
- Copying text directly when one solution is clearly superior
- Combining approaches when a hybrid would be better
- Fixing all identified issues
- Preserving the best elements from each
5. Explain your synthesis decisions:
- What you took from each solution
- Why you made those choices
- How you addressed identified weaknesses
CRITICAL: Do not create something entirely new. Synthesize the best from what exists.
<output>
The command produces different outputs depending on the adaptive strategy selected:
{solution-file}.[a|b|c].[ext] (in specified output location).specs/reports/{solution-name}-{date}.[1|2|3].md{output_path}Once command execution is complete, reply to user with following structure:
## Execution Summary
Original Task: {task_description}
Strategy Used: {strategy} ({reason})
### Results
| Phase | Agents | Models | Status |
|-------------------------|--------|----------|-------------|
| Phase [N]: [phase name] | [N] | [model] × 3 | [✅ Complete / ❌ Failed] |
Files Created
Final Solution:
- {output_path} - Synthesized production-ready command
Candidate Solutions:
- {solution-file}.[a|b|c].[ext] (Score: [X.X]/5.0)
Evaluation Reports:
- .specs/reports/{solution-file}-{date}.[1|2|3].md (Vote: [Solution A/B/C])
Synthesis Decisions
| Element | Source | Rationale |
|----------------------|------------------|-------------|
| [element] | Solution [B/A/C] | [rationale] |
</output>
Choose 3-5 weighted criteria relevant to the task:
Code tasks:
Design tasks:
Documentation tasks:
❌ Using for trivial tasks - Overhead not justified ❌ Vague task descriptions - Leads to incomparable solutions ❌ Insufficient context - Agents can't produce quality work ❌ Weak evaluation criteria - Judges can't differentiate quality ❌ Forcing synthesis when clear winner exists - Wastes cost and risks degrading quality ❌ Synthesizing fundamentally flawed solutions - Better to redesign than polish garbage
✅ Well-defined task with clear constraints ✅ Rich context for informed decisions ✅ Specific, measurable evaluation criteria ✅ Trust adaptive strategy selection ✅ Polish clear winners, synthesize split decisions, redesign failures
/do-competitively "Design REST API for user management (CRUD + auth)" \
--output "specs/api/users.md" \
--criteria "RESTfulness,security,scalability,developer-experience"
Phase 1 outputs:
specs/api/users.a.md - Resource-based design with nested routesspecs/api/users.b.md - Action-based design with RPC-style endpointsspecs/api/users.c.md - Minimal design, missing auth considerationPhase 2 outputs (assuming date 2025-01-15):
.specs/reports/users-api-2025-01-15.1.md:
VOTE: Solution A
SCORES: A=4.5/5.0, B=3.2/5.0, C=2.8/5.0
"Most RESTful, good security"
.specs/reports/users-api-2025-01-15.2.md:
VOTE: Solution A
SCORES: A=4.3/5.0, B=3.5/5.0, C=2.6/5.0
"Clean resource design, scalable"
.specs/reports/users-api-2025-01-15.3.md:
VOTE: Solution A
SCORES: A=4.6/5.0, B=3.0/5.0, C=2.9/5.0
"Best practices, clear structure"
Phase 2.5 decision (orchestrator parses headers):
Phase 3 output:
specs/api/users.md - Solution A polished with:
/do-competitively "Design caching strategy for high-traffic API" \
--output "specs/caching.md" \
--criteria "performance,memory-efficiency,simplicity,reliability"
Phase 1 outputs:
specs/caching.a.md - Redis with LRU evictionspecs/caching.b.md - Multi-tier cache (memory + Redis)specs/caching.c.md - CDN + application cachePhase 2 outputs (assuming date 2025-01-15):
.specs/reports/caching-2025-01-15.1.md:
VOTE: Solution B
SCORES: A=3.8/5.0, B=4.2/5.0, C=3.9/5.0
"Best performance, complex"
.specs/reports/caching-2025-01-15.2.md:
VOTE: Solution A
SCORES: A=4.0/5.0, B=3.9/5.0, C=3.7/5.0
"Simple, reliable, proven"
.specs/reports/caching-2025-01-15.3.md:
VOTE: Solution C
SCORES: A=3.6/5.0, B=4.0/5.0, C=4.1/5.0
"Global reach, cost-effective"
Phase 2.5 decision (orchestrator parses headers):
Phase 3 output:
specs/caching.md - Hybrid approach:
/do-competitively "Design authentication system with social login" \
--output "specs/auth.md" \
--criteria "security,user-experience,maintainability"
Phase 1 outputs:
specs/auth.a.md - Custom OAuth2 implementationspecs/auth.b.md - Session-based with social providersspecs/auth.c.md - JWT with password-only authPhase 2 outputs (assuming date 2025-01-15):
.specs/reports/auth-2025-01-15.1.md:
VOTE: Solution A
SCORES: A=2.5/5.0, B=2.2/5.0, C=2.3/5.0
"Security risks, reinventing wheel"
.specs/reports/auth-2025-01-15.2.md:
VOTE: Solution B
SCORES: A=2.4/5.0, B=2.8/5.0, C=2.1/5.0
"Sessions don't scale, missing requirements"
.specs/reports/auth-2025-01-15.3.md:
VOTE: Solution C
SCORES: A=2.6/5.0, B=2.5/5.0, C=2.3/5.0
"No social login, security concerns"
Phase 2.5 decision (orchestrator parses headers):
Split votes: A, B, C (no consensus)
Average scores: A=2.5, B=2.5, C=2.2 (ALL <3.0)
Strategy: REDESIGN
Reason: All solutions below 3.0 threshold, fundamental issues
Do not stop, return to phase 1 and eventiualy should result in finish at SELECT_AND_POLISH or FULL_SYNTHESIS strategies