Increment Quality Judge v2.0

LLM-as-Judge Pattern Implementation

AI-powered quality assessment using the LLM-as-Judge pattern - an established AI/ML evaluation technique where an LLM evaluates outputs with chain-of-thought reasoning, BMAD-pattern risk scoring, and formal quality gate decisions (PASS/CONCERNS/FAIL).

LLM-as-Judge: What It Is

LLM-as-Judge (LaaJ) is a recognized pattern in AI/ML evaluation where a large language model assesses quality using structured reasoning.

┌─────────────────────────────────────────────────────────────┐
│                 LLM-as-Judge Pattern                        │
├─────────────────────────────────────────────────────────────┤
│  Input:  spec.md, plan.md, tasks.md                        │
│                                                             │
│  Process:                                                   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ <thinking>                                          │   │
│  │   1. Read and understand the specification          │   │
│  │   2. Evaluate against 7 quality dimensions          │   │
│  │   3. Identify risks (P×I scoring)                   │   │
│  │   4. Form evidence-based verdict                    │   │
│  │ </thinking>                                         │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Output: Structured verdict with:                          │
│  • Dimension scores (0-100)                                │
│  • Risk assessment (CRITICAL/HIGH/MEDIUM/LOW)              │
│  • Quality gate decision (PASS/CONCERNS/FAIL)              │
│  • Actionable recommendations                              │
└─────────────────────────────────────────────────────────────┘

Why LLM-as-Judge works:

Consistency: Uniform evaluation criteria without human fatigue
Reasoning: Chain-of-thought explains WHY something is an issue
Scalability: Evaluates in seconds vs hours of manual review
Industry standard: Used by OpenAI, Anthropic, Google for AI evals

References:

"Judging LLM-as-a-Judge" (NeurIPS 2023)
LMSYS Chatbot Arena evaluation methodology
AlpacaEval, MT-Bench frameworks

IMPORTANT: This is a SKILL (Not an Agent)

DO NOT try to spawn this as an agent via Task tool.

This is a skill that auto-activates when you discuss quality assessment. To run quality assessment:

# Use the CLI command directly
specweave qa 0001 --pre

# Or use the slash command
/sw:qa 0001

The skill provides guidance and documentation. The CLI handles execution.

Why no agent? Having both a skill and agent with the same name (increment-quality-judge-v2) caused Claude to incorrectly construct agent type names. The skill-only approach eliminates this confusion.

What's New in v2.0

Risk Assessment Dimension - Probability × Impact scoring (0-10 scale, BMAD pattern)
Quality Gate Decisions - Formal PASS/CONCERNS/FAIL with thresholds
NFR Checking - Non-functional requirements (performance, security, scalability)
Enhanced Output - Blockers, concerns, recommendations with actionable mitigations
7 Dimensions - Added "Risk" to the existing 6 dimensions

Purpose

Provide comprehensive quality assessment that goes beyond structural validation to evaluate:

✅ Specification quality (6 dimensions)
✅ Risk levels (BMAD P×I scoring) - NEW!
✅ Quality gate readiness (PASS/CONCERNS/FAIL) - NEW!

When to Use

Auto-activates for:

/qa {increment-id} command
/qa {increment-id} --pre (pre-implementation check)
/qa {increment-id} --gate (quality gate check)
Natural language: "assess quality of increment 0001"

Keywords:

validate quality, quality check, assess spec
evaluate increment, spec review, quality score
risk assessment, qa check, quality gate
PASS/CONCERNS/FAIL

Evaluation Dimensions (7 total, was 6)

dimensions:
  clarity:
    weight: 0.18 # was 0.20
    criteria:
      - "Is the problem statement clear?"
      - "Are objectives well-defined?"
      - "Is terminology consistent?"

  testability:
    weight: 0.22 # was 0.25
    criteria:
      - "Are acceptance criteria testable?"
      - "Can success be measured objectively?"
      - "Are edge cases identifiable?"

  completeness:
    weight: 0.18 # was 0.20
    criteria:
      - "Are all requirements addressed?"
      - "Is error handling specified?"
      - "Are non-functional requirements included?"

  feasibility:
    weight: 0.13 # was 0.15
    criteria:
      - "Is the architecture scalable?"
      - "Are technical constraints realistic?"
      - "Is timeline achievable?"

  maintainability:
    weight: 0.09 # was 0.10
    criteria:
      - "Is design modular?"
      - "Are extension points identified?"
      - "Is technical debt addressed?"

  edge_cases:
    weight: 0.09 # was 0.10
    criteria:
      - "Are failure scenarios covered?"
      - "Are performance limits specified?"
      - "Are security considerations included?"

  # NEW: Risk Assessment (BMAD pattern)
  risk:
    weight: 0.11 # NEW!
    criteria:
      - "Are security risks identified and mitigated?"
      - "Are technical risks (scalability, performance) addressed?"
      - "Are implementation risks (complexity, dependencies) managed?"
      - "Are operational risks (monitoring, support) considered?"

Risk Assessment (BMAD Pattern) - NEW!

Risk Scoring Formula

Risk Score = Probability × Impact

Probability (0.0-1.0):
- 0.0-0.3: Low (unlikely to occur)
- 0.4-0.6: Medium (may occur)
- 0.7-1.0: High (likely to occur)

Impact (1-10):
- 1-3: Minor (cosmetic, no user impact)
- 4-6: Moderate (some impact, workaround exists)
- 7-9: Major (significant impact, no workaround)
- 10: Critical (system failure, data loss, security breach)

Final Score (0.0-10.0):
- 9.0-10.0: CRITICAL risk (FAIL quality gate)
- 6.0-8.9: HIGH risk (CONCERNS quality gate)
- 3.0-5.9: MEDIUM risk (PASS with monitoring)
- 0.0-2.9: LOW risk (PASS)

Risk Categories

Security Risks
- OWASP Top 10 vulnerabilities
- Data exposure, authentication, authorization
- Cryptographic failures
Technical Risks
- Architecture complexity, scalability bottlenecks
- Performance issues, technical debt
Implementation Risks
- Tight timeline, external dependencies
- Technical complexity
Operational Risks
- Lack of monitoring, difficult to maintain
- Poor documentation

Risk Assessment Prompt

You are evaluating SOFTWARE RISKS for an increment using BMAD's Probability × Impact scoring.

Read increment files:
- .specweave/increments/{id}/spec.md
- .specweave/increments/{id}/plan.md

For EACH risk you identify:

1. **Calculate PROBABILITY** (0.0-1.0)
   - Based on spec clarity, past experience, complexity
   - Low: 0.2, Medium: 0.5, High: 0.8

2. **Calculate IMPACT** (1-10)
   - 10 = Critical (security breach, data loss, system failure)
   - 7-9 = Major (significant user impact, no workaround)
   - 4-6 = Moderate (some impact, workaround exists)
   - 1-3 = Minor (cosmetic, no user impact)

3. **Calculate RISK SCORE** = Probability × Impact

4. **Provide MITIGATION** strategy

5. **Link to ACCEPTANCE CRITERIA** (if applicable)

Output format (JSON):
{
  "risks": [
    {
      "id": "RISK-001",
      "category": "security",
      "title": "Password storage not specified",
      "description": "Spec doesn't mention password hashing algorithm",
      "probability": 0.9,
      "impact": 10,
      "score": 9.0,
      "severity": "CRITICAL",
      "mitigation": "Use bcrypt or Argon2, never plain text",
      "location": "spec.md, Authentication section",
      "acceptance_criteria": "AC-US1-01"
    }
  ],
  "overall_risk_score": 7.5,
  "dimension_score": 0.35
}

Quality Gate Decisions - NEW!

Decision Logic

enum QualityGateDecision {
  PASS = "PASS",          // Ready for production
  CONCERNS = "CONCERNS",  // Issues found, should address
  FAIL = "FAIL"           // Blockers, must fix
}

Thresholds (BMAD pattern):

FAIL if any:
- Risk score ≥ 9.0 (CRITICAL)
- Test coverage < 60%
- Spec quality < 50
- Critical security vulnerabilities ≥ 1

CONCERNS if any:
- Risk score 6.0-8.9 (HIGH)
- Test coverage < 80%
- Spec quality < 70
- High security vulnerabilities ≥ 1

PASS otherwise

Output Example

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
QA ASSESSMENT: Increment 0008-user-authentication
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Overall Score: 82/100 (GOOD) ✓

Dimension Scores:
  Clarity:         90/100 ✓✓
  Testability:     75/100 ⚠️
  Completeness:    88/100 ✓
  Feasibility:     85/100 ✓
  Maintainability: 80/100 ✓
  Edge Cases:      70/100 ⚠️
  Risk Assessment: 65/100 ⚠️  (7.2/10 risk score)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RISKS IDENTIFIED (3)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🔴 RISK-001: CRITICAL (9.0/10)
   Category: Security
   Title: Password storage implementation
   Description: Spec doesn't specify password hashing
   Probability: 0.9 (High) × Impact: 10 (Critical)
   Location: spec.md, Authentication section
   Mitigation: Use bcrypt/Argon2, never plain text
   AC: AC-US1-01

🟡 RISK-002: HIGH (6.0/10)
   Category: Security
   Title: Rate limiting not specified
   Description: No brute-force protection mentioned
   Probability: 0.6 (Medium) × Impact: 10 (Critical)
   Location: spec.md, Security section
   Mitigation: Add 5 failed attempts → 15 min lockout
   AC: AC-US1-03

🟢 RISK-003: LOW (2.4/10)
   Category: Technical
   Title: Session storage scalability
   Description: Plan uses in-memory sessions
   Probability: 0.4 (Medium) × Impact: 6 (Moderate)
   Location: plan.md, Architecture section
   Mitigation: Use Redis for session store

Overall Risk Score: 7.2/10 (MEDIUM-HIGH)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
QUALITY GATE DECISION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🟡 CONCERNS (Not Ready for Production)

Blockers (MUST FIX):
  1. 🔴 CRITICAL RISK: Password storage (Risk ≥9)
     → Add task: "Implement bcrypt password hashing"

Concerns (SHOULD FIX):
  2. 🟡 HIGH RISK: Rate limiting not specified (Risk ≥6)
     → Update spec.md: Add rate limiting section
     → Add E2E test for rate limiting

  3. ⚠️  Testability: 75/100 (target: 80+)
     → Make acceptance criteria more measurable

Recommendations (NICE TO FIX):
  4. Edge cases: 70/100
     → Add error handling scenarios
  5. Session scalability
     → Consider Redis for session store

Decision: Address 1 blocker before proceeding

Would you like to:
  [E] Export blockers to tasks.md
  [U] Update spec.md with fixes (experimental)
  [C] Continue without changes

Workflow Integration

Quick Mode (Default)

User: /sw:qa 0001

Step 1: Rule-based validation (120 checks) - FREE, FAST
├── If FAILED → Stop, show errors
└── If PASSED → Continue

Step 2: AI Quality Assessment (Quick)
├── Spec quality (6 dimensions)
├── Risk assessment (BMAD P×I)
└── Quality gate decision (PASS/CONCERNS/FAIL)

Output: Enhanced report with risks and gate decision

Pre-Implementation Mode

User: /sw:qa 0001 --pre

Checks:
✅ Spec quality (clarity, testability, completeness)
✅ Risk assessment (identify issues early)
✅ Architecture review (plan.md soundness)
✅ Test strategy (test plan in tasks.md)

Gate decision before implementation starts

Quality Gate Mode

User: /sw:qa 0001 --gate

Comprehensive checks:
✅ All pre-implementation checks
✅ Test coverage (AC-ID coverage, gaps)
✅ E2E test coverage
✅ Documentation completeness

Final gate decision before closing increment

Enhanced Scoring Algorithm

Step 1: Dimension Evaluation (7 dimensions)

For each dimension (including NEW risk dimension), use Chain-of-Thought prompting:

<thinking>
1. Read spec.md thoroughly
2. For risk dimension specifically:
   - Identify all risks (security, technical, implementation, operational)
   - For each risk: calculate P, I, Score
   - Group by category
   - Calculate overall risk score
3. For other dimensions: evaluate criteria as before
4. Score 0.00-1.00
5. Identify issues
6. Provide suggestions
</thinking>

Score: 0.XX

Step 2: Weighted Overall Score (NEW weights)

overall_score =
  (clarity * 0.18) +
  (testability * 0.22) +
  (completeness * 0.18) +
  (feasibility * 0.13) +
  (maintainability * 0.09) +
  (edge_cases * 0.09) +
  (risk * 0.11)  // NEW!

Step 3: Quality Gate Decision

gate_decision = decide({
  spec_quality: overall_score,
  risk_score: risk_assessment.overall_risk_score,
  test_coverage: test_coverage.percentage, // if available
  security_audit: security_audit  // if available
})

Token Usage

Estimated per increment (Quick mode):

Small spec (<100 lines): ~~2,500 tokens (~~$0.025)
Medium spec (100-250 lines): ~~3,500 tokens (~~$0.035)
Large spec (>250 lines): ~~5,000 tokens (~~$0.050)

Cost increase from v1.0: +25% (added risk assessment dimension)

Optimization:

Only evaluate spec.md + plan.md for risks
Cache risk patterns for 5 min
Skip risk assessment if spec < 50 lines (too small to assess)

Configuration

{
  "qa": {
    "qualityGateThresholds": {
      "fail": {
        "riskScore": 9.0,
        "testCoverage": 60,
        "specQuality": 50,
        "criticalVulnerabilities": 1
      },
      "concerns": {
        "riskScore": 6.0,
        "testCoverage": 80,
        "specQuality": 70,
        "highVulnerabilities": 1
      }
    },
    "dimensions": {
      "risk": {
        "enabled": true,
        "weight": 0.11
      }
    }
  }
}

Migration from v1.0

v1.0 (6 dimensions):

Clarity, Testability, Completeness, Feasibility, Maintainability, Edge Cases

v2.0 (7 dimensions, NEW: Risk):

All v1.0 dimensions + Risk Assessment
Weights adjusted to accommodate new dimension
Quality gate decisions added
BMAD risk scoring added

Backward Compatibility:

v1.0 skills still work (auto-upgrade to v2.0 if risk assessment enabled)
Existing scores rescaled to new weights automatically
Can disable risk assessment in config to revert to v1.0 behavior

Best Practices

Run early and often: Use --pre mode before implementation
Fix blockers immediately: Don't proceed if FAIL
Address concerns before release: CONCERNS = should fix
Use risk scores to prioritize: Fix CRITICAL risks first
Export to tasks.md: Convert blockers/concerns to actionable tasks

Limitations

What quality-judge v2.0 CAN'T do:

❌ Understand domain-specific compliance (HIPAA, PCI-DSS)
❌ Verify technical feasibility with actual codebase
❌ Replace human expertise and security audits
❌ Predict actual probability without historical data

What quality-judge v2.0 CAN do:

✅ Catch vague or ambiguous language
✅ Identify missing security considerations (OWASP-based)
✅ Spot untestable acceptance criteria
✅ Suggest industry best practices
✅ Flag missing edge cases
✅ Assess risks systematically (BMAD pattern) - NEW!
✅ Provide formal quality gate decisions - NEW!

Summary

increment-quality-judge v2.0 adds comprehensive risk assessment and quality gate decisions:

✅ Risk assessment (BMAD P×I scoring, 0-10 scale) ✅ Quality gate decisions (PASS/CONCERNS/FAIL with thresholds) ✅ 7 dimensions (added "Risk" to existing 6) ✅ NFR checking (performance, security, scalability) ✅ Enhanced output (blockers, concerns, recommendations) ✅ Chain-of-thought (LLM-as-Judge 2025 best practices) ✅ Backward compatible (can disable risk assessment)

Use it when: You want comprehensive quality assessment with risk scoring and formal gate decisions before implementation or release.

Skip it when: Quick iteration, tight token budget, or simple features where rule-based validation suffices.

Version: 2.0.0 Since: v0.8.0 Related: /sw:qa command, QAOrchestrator agent (v0.9.0)

increment-quality-judge-v2