Ultrathink LLM-as-Judge validation of completed work. Uses extended thinking by DEFAULT for thorough evaluation.
Validate completed work using extended thinking for thorough LLM-as-Judge analysis. Use this after finishing code to get deep, multi-dimensional evaluation of correctness, security, and quality before proceeding.
/plugin marketplace add anton-abyzov/specweave/plugin install sw@specweaveULTRATHINK BY DEFAULT - Validate completed work using extended thinking and the LLM-as-Judge pattern.
This command ALWAYS uses ultrathink (extended thinking) for thorough LLM-as-Judge evaluation:
DEFAULT BEHAVIOR = ULTRATHINK MODE
- Extended thinking enabled
- Deep chain-of-thought reasoning
- Thorough multi-dimensional analysis
- ~60-90 seconds for comprehensive evaluation
Use --quick only if you explicitly need faster (but less thorough) validation.
Use when you've completed work and want maximum-quality AI validation:
# DEFAULT: Ultrathink validation (recommended)
/sw:judge-llm src/file.ts
/sw:judge-llm "src/**/*.ts"
# Validate git changes (ultrathink by default)
/sw:judge-llm --staged # Staged changes
/sw:judge-llm --last-commit # Last commit
/sw:judge-llm --diff main # Diff vs branch
# Quick mode (ONLY if you need speed over thoroughness)
/sw:judge-llm src/file.ts --quick
# Additional options
/sw:judge-llm src/file.ts --strict # Fail on any concern
/sw:judge-llm src/file.ts --fix # Include fix suggestions
/sw:judge-llm src/file.ts --export # Export report to markdown
When you invoke /sw:judge-llm, Claude will:
Determine what to validate:
--staged → get staged git changes--last-commit → get files from last commit--diff <branch> → get diff against branchMANDATORY: Use extended thinking for deep LLM-as-Judge evaluation:
Claude MUST use ultrathink/extended thinking to:
1. **DEEP READ**: Thoroughly understand all code, context, and intent
2. **MULTI-DIMENSIONAL ANALYSIS**: Evaluate across ALL dimensions:
- Correctness: Does it work exactly as intended?
- Completeness: ALL edge cases handled? ALL requirements met?
- Security: ANY vulnerabilities? OWASP Top 10 checked?
- Performance: Algorithmic complexity? Memory usage? Bottlenecks?
- Maintainability: Clean? Clear? Follows conventions?
- Testability: Can it be tested? Are tests adequate?
- Error handling: All failure modes covered?
3. **CRITICAL EVALUATION**: Weigh ALL findings by severity
4. **REASONED VERDICT**: Form verdict based on thorough analysis
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
JUDGE-LLM VERDICT: APPROVED | CONCERNS | REJECTED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Mode: ULTRATHINK (extended thinking)
Confidence: 0.XX
Files Analyzed: N
REASONING:
[Detailed chain-of-thought from extended thinking]
ISSUES (if any):
🔴 CRITICAL: [title]
[description]
📍 [file:line]
💡 [suggestion]
🟡 HIGH: [title]
...
🟢 LOW: [title]
...
VERDICT: [summary sentence]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
| Verdict | Meaning | Action |
|---|---|---|
| APPROVED | Work is solid, no significant issues | Safe to proceed |
| CONCERNS | Issues found worth addressing | Review and fix recommended |
| REJECTED | Critical issues found | MUST fix before proceeding |
--quick)--strict)User: /sw:judge-llm src/core/auth/login.ts
Claude: [Uses extended thinking for deep analysis]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
JUDGE-LLM VERDICT: APPROVED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Mode: ULTRATHINK (extended thinking)
Confidence: 0.91
Files Analyzed: 1
REASONING:
After thorough analysis with extended thinking:
The login implementation demonstrates excellent security practices:
- Password hashing uses bcrypt with cost factor 12 (appropriate)
- Rate limiting implemented correctly (5 attempts, 15 min exponential backoff)
- Input validation prevents SQL injection and XSS
- Error messages are generic (don't leak user existence)
- Session tokens use cryptographically secure random generation
- CSRF protection properly implemented
Edge cases handled:
- Empty input validation ✓
- Unicode normalization for usernames ✓
- Timing attack mitigation via constant-time comparison ✓
No security, performance, or maintainability issues found.
VERDICT: Production-ready implementation with excellent security posture.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
User: /sw:judge-llm --staged
Claude: [Uses extended thinking]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
JUDGE-LLM VERDICT: CONCERNS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Mode: ULTRATHINK (extended thinking)
Confidence: 0.84
Files Analyzed: 3
REASONING:
Extended thinking analysis of staged changes reveals:
Positive aspects:
- New API endpoint follows existing patterns
- TypeScript types are correct
- Error handling present
However, thorough analysis found issues:
🟡 HIGH: Missing Input Validation
User input passed to database without sanitization
📍 src/api/users.ts:45
💡 Add Zod schema validation:
```typescript
const schema = z.object({ userId: z.string().uuid() });
const { userId } = schema.parse(req.body);
🟡 HIGH: Information Disclosure Risk Stack traces exposed in error responses 📍 src/api/users.ts:62 💡 Use production error handler that sanitizes output
🟢 LOW: Missing rate limiting New endpoint has no rate limiting 📍 src/api/users.ts:30 💡 Add rate limiter middleware
VERDICT: Address HIGH issues before merging. LOW can be follow-up. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
### Example 3: Quick validation (when needed)
User: /sw:judge-llm src/utils/format.ts --quick
Claude: [Standard reasoning, no extended thinking]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ JUDGE-LLM VERDICT: APPROVED ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Mode: QUICK (standard reasoning) Confidence: 0.75 Files Analyzed: 1
REASONING: Utility formatting functions look correct. No obvious issues.
VERDICT: Looks good for a utility file. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
## Simplest Usage
Just say in your prompt:
"judge-llm my work" "use judge-llm" "judge-llm this"
Claude will:
1. Automatically gather context from the conversation
2. Use ULTRATHINK extended thinking by default
3. Apply thorough LLM-as-Judge evaluation
## Difference from /sw:qa
| Aspect | `/sw:qa` | `/sw:judge-llm` |
|--------|-----------------|------------------------|
| **Scope** | Increments only | Any files |
| **Input** | Increment ID | Files, git diff, context |
| **Default Mode** | Standard | **ULTRATHINK** |
| **Pattern** | 7-dimension scoring | Judge LLM reasoning |
| **Focus** | Spec quality, risks | Code correctness |
| **When** | Before increment close | After any work |
## Best Practices
1. **Use by default**: Ultrathink is worth the extra time for quality
2. **Use `--staged`**: Validate before committing
3. **Use `--strict` for critical code**: Payment, auth, security
4. **Fix CRITICAL issues immediately**: Never ignore these
5. **Trust the ultrathink analysis**: Extended thinking catches subtle issues
## Limitations
- ❌ Doesn't execute tests (use test runners)
- ❌ Doesn't auto-apply fixes (only suggests)
- ❌ May miss domain-specific issues
- ❌ Not a replacement for human review
## Related
- `/sw:qa` - Increment-bound quality assessment
- `/sw:validate` - Rule-based increment validation
- `ado-sync-judge` agent - Uses judge pattern for sync validation