Enhanced AI-powered quality assessment with RISK SCORING (BMAD pattern) and quality gate decisions. Evaluates specifications, plans, and tests for clarity, testability, completeness, feasibility, maintainability, edge cases, and RISKS. Provides PASS/CONCERNS/FAIL decisions. Activates for validate quality, quality check, assess spec, evaluate increment, spec review, quality score, risk assessment, qa check, quality gate, /sw:qa command.
AI-powered quality assessment with risk scoring and PASS/CONCERNS/FAIL decisions. Evaluates specs, plans, and tests across 7 dimensions including security risks. Auto-activates for quality checks, risk assessment, and `/sw:qa` commands.
/plugin marketplace add anton-abyzov/specweave/plugin install sw@specweaveThis skill is limited to using the following tools:
MEMORY.mdLLM-as-Judge Pattern Implementation
AI-powered quality assessment using the LLM-as-Judge pattern - an established AI/ML evaluation technique where an LLM evaluates outputs with chain-of-thought reasoning, BMAD-pattern risk scoring, and formal quality gate decisions (PASS/CONCERNS/FAIL).
LLM-as-Judge (LaaJ) is a recognized pattern in AI/ML evaluation where a large language model assesses quality using structured reasoning.
┌─────────────────────────────────────────────────────────────┐
│ LLM-as-Judge Pattern │
├─────────────────────────────────────────────────────────────┤
│ Input: spec.md, plan.md, tasks.md │
│ │
│ Process: │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ <thinking> │ │
│ │ 1. Read and understand the specification │ │
│ │ 2. Evaluate against 7 quality dimensions │ │
│ │ 3. Identify risks (P×I scoring) │ │
│ │ 4. Form evidence-based verdict │ │
│ │ </thinking> │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Output: Structured verdict with: │
│ • Dimension scores (0-100) │
│ • Risk assessment (CRITICAL/HIGH/MEDIUM/LOW) │
│ • Quality gate decision (PASS/CONCERNS/FAIL) │
│ • Actionable recommendations │
└─────────────────────────────────────────────────────────────┘
Why LLM-as-Judge works:
References:
DO NOT try to spawn this as an agent via Task tool.
This is a skill that auto-activates when you discuss quality assessment. To run quality assessment:
# Use the CLI command directly
specweave qa 0001 --pre
# Or use the slash command
/sw:qa 0001
The skill provides guidance and documentation. The CLI handles execution.
Why no agent? Having both a skill and agent with the same name (increment-quality-judge-v2) caused Claude to incorrectly construct agent type names. The skill-only approach eliminates this confusion.
Provide comprehensive quality assessment that goes beyond structural validation to evaluate:
Auto-activates for:
/qa {increment-id} command/qa {increment-id} --pre (pre-implementation check)/qa {increment-id} --gate (quality gate check)Keywords:
dimensions:
clarity:
weight: 0.18 # was 0.20
criteria:
- "Is the problem statement clear?"
- "Are objectives well-defined?"
- "Is terminology consistent?"
testability:
weight: 0.22 # was 0.25
criteria:
- "Are acceptance criteria testable?"
- "Can success be measured objectively?"
- "Are edge cases identifiable?"
completeness:
weight: 0.18 # was 0.20
criteria:
- "Are all requirements addressed?"
- "Is error handling specified?"
- "Are non-functional requirements included?"
feasibility:
weight: 0.13 # was 0.15
criteria:
- "Is the architecture scalable?"
- "Are technical constraints realistic?"
- "Is timeline achievable?"
maintainability:
weight: 0.09 # was 0.10
criteria:
- "Is design modular?"
- "Are extension points identified?"
- "Is technical debt addressed?"
edge_cases:
weight: 0.09 # was 0.10
criteria:
- "Are failure scenarios covered?"
- "Are performance limits specified?"
- "Are security considerations included?"
# NEW: Risk Assessment (BMAD pattern)
risk:
weight: 0.11 # NEW!
criteria:
- "Are security risks identified and mitigated?"
- "Are technical risks (scalability, performance) addressed?"
- "Are implementation risks (complexity, dependencies) managed?"
- "Are operational risks (monitoring, support) considered?"
Risk Score = Probability × Impact
Probability (0.0-1.0):
- 0.0-0.3: Low (unlikely to occur)
- 0.4-0.6: Medium (may occur)
- 0.7-1.0: High (likely to occur)
Impact (1-10):
- 1-3: Minor (cosmetic, no user impact)
- 4-6: Moderate (some impact, workaround exists)
- 7-9: Major (significant impact, no workaround)
- 10: Critical (system failure, data loss, security breach)
Final Score (0.0-10.0):
- 9.0-10.0: CRITICAL risk (FAIL quality gate)
- 6.0-8.9: HIGH risk (CONCERNS quality gate)
- 3.0-5.9: MEDIUM risk (PASS with monitoring)
- 0.0-2.9: LOW risk (PASS)
Security Risks
Technical Risks
Implementation Risks
Operational Risks
You are evaluating SOFTWARE RISKS for an increment using BMAD's Probability × Impact scoring.
Read increment files:
- .specweave/increments/{id}/spec.md
- .specweave/increments/{id}/plan.md
For EACH risk you identify:
1. **Calculate PROBABILITY** (0.0-1.0)
- Based on spec clarity, past experience, complexity
- Low: 0.2, Medium: 0.5, High: 0.8
2. **Calculate IMPACT** (1-10)
- 10 = Critical (security breach, data loss, system failure)
- 7-9 = Major (significant user impact, no workaround)
- 4-6 = Moderate (some impact, workaround exists)
- 1-3 = Minor (cosmetic, no user impact)
3. **Calculate RISK SCORE** = Probability × Impact
4. **Provide MITIGATION** strategy
5. **Link to ACCEPTANCE CRITERIA** (if applicable)
Output format (JSON):
{
"risks": [
{
"id": "RISK-001",
"category": "security",
"title": "Password storage not specified",
"description": "Spec doesn't mention password hashing algorithm",
"probability": 0.9,
"impact": 10,
"score": 9.0,
"severity": "CRITICAL",
"mitigation": "Use bcrypt or Argon2, never plain text",
"location": "spec.md, Authentication section",
"acceptance_criteria": "AC-US1-01"
}
],
"overall_risk_score": 7.5,
"dimension_score": 0.35
}
enum QualityGateDecision {
PASS = "PASS", // Ready for production
CONCERNS = "CONCERNS", // Issues found, should address
FAIL = "FAIL" // Blockers, must fix
}
Thresholds (BMAD pattern):
FAIL if any:
- Risk score ≥ 9.0 (CRITICAL)
- Test coverage < 60%
- Spec quality < 50
- Critical security vulnerabilities ≥ 1
CONCERNS if any:
- Risk score 6.0-8.9 (HIGH)
- Test coverage < 80%
- Spec quality < 70
- High security vulnerabilities ≥ 1
PASS otherwise
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
QA ASSESSMENT: Increment 0008-user-authentication
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Overall Score: 82/100 (GOOD) ✓
Dimension Scores:
Clarity: 90/100 ✓✓
Testability: 75/100 ⚠️
Completeness: 88/100 ✓
Feasibility: 85/100 ✓
Maintainability: 80/100 ✓
Edge Cases: 70/100 ⚠️
Risk Assessment: 65/100 ⚠️ (7.2/10 risk score)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RISKS IDENTIFIED (3)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔴 RISK-001: CRITICAL (9.0/10)
Category: Security
Title: Password storage implementation
Description: Spec doesn't specify password hashing
Probability: 0.9 (High) × Impact: 10 (Critical)
Location: spec.md, Authentication section
Mitigation: Use bcrypt/Argon2, never plain text
AC: AC-US1-01
🟡 RISK-002: HIGH (6.0/10)
Category: Security
Title: Rate limiting not specified
Description: No brute-force protection mentioned
Probability: 0.6 (Medium) × Impact: 10 (Critical)
Location: spec.md, Security section
Mitigation: Add 5 failed attempts → 15 min lockout
AC: AC-US1-03
🟢 RISK-003: LOW (2.4/10)
Category: Technical
Title: Session storage scalability
Description: Plan uses in-memory sessions
Probability: 0.4 (Medium) × Impact: 6 (Moderate)
Location: plan.md, Architecture section
Mitigation: Use Redis for session store
Overall Risk Score: 7.2/10 (MEDIUM-HIGH)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
QUALITY GATE DECISION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🟡 CONCERNS (Not Ready for Production)
Blockers (MUST FIX):
1. 🔴 CRITICAL RISK: Password storage (Risk ≥9)
→ Add task: "Implement bcrypt password hashing"
Concerns (SHOULD FIX):
2. 🟡 HIGH RISK: Rate limiting not specified (Risk ≥6)
→ Update spec.md: Add rate limiting section
→ Add E2E test for rate limiting
3. ⚠️ Testability: 75/100 (target: 80+)
→ Make acceptance criteria more measurable
Recommendations (NICE TO FIX):
4. Edge cases: 70/100
→ Add error handling scenarios
5. Session scalability
→ Consider Redis for session store
Decision: Address 1 blocker before proceeding
Would you like to:
[E] Export blockers to tasks.md
[U] Update spec.md with fixes (experimental)
[C] Continue without changes
User: /sw:qa 0001
Step 1: Rule-based validation (120 checks) - FREE, FAST
├── If FAILED → Stop, show errors
└── If PASSED → Continue
Step 2: AI Quality Assessment (Quick)
├── Spec quality (6 dimensions)
├── Risk assessment (BMAD P×I)
└── Quality gate decision (PASS/CONCERNS/FAIL)
Output: Enhanced report with risks and gate decision
User: /sw:qa 0001 --pre
Checks:
✅ Spec quality (clarity, testability, completeness)
✅ Risk assessment (identify issues early)
✅ Architecture review (plan.md soundness)
✅ Test strategy (test plan in tasks.md)
Gate decision before implementation starts
User: /sw:qa 0001 --gate
Comprehensive checks:
✅ All pre-implementation checks
✅ Test coverage (AC-ID coverage, gaps)
✅ E2E test coverage
✅ Documentation completeness
Final gate decision before closing increment
For each dimension (including NEW risk dimension), use Chain-of-Thought prompting:
<thinking>
1. Read spec.md thoroughly
2. For risk dimension specifically:
- Identify all risks (security, technical, implementation, operational)
- For each risk: calculate P, I, Score
- Group by category
- Calculate overall risk score
3. For other dimensions: evaluate criteria as before
4. Score 0.00-1.00
5. Identify issues
6. Provide suggestions
</thinking>
Score: 0.XX
overall_score =
(clarity * 0.18) +
(testability * 0.22) +
(completeness * 0.18) +
(feasibility * 0.13) +
(maintainability * 0.09) +
(edge_cases * 0.09) +
(risk * 0.11) // NEW!
gate_decision = decide({
spec_quality: overall_score,
risk_score: risk_assessment.overall_risk_score,
test_coverage: test_coverage.percentage, // if available
security_audit: security_audit // if available
})
Estimated per increment (Quick mode):
Cost increase from v1.0: +25% (added risk assessment dimension)
Optimization:
{
"qa": {
"qualityGateThresholds": {
"fail": {
"riskScore": 9.0,
"testCoverage": 60,
"specQuality": 50,
"criticalVulnerabilities": 1
},
"concerns": {
"riskScore": 6.0,
"testCoverage": 80,
"specQuality": 70,
"highVulnerabilities": 1
}
},
"dimensions": {
"risk": {
"enabled": true,
"weight": 0.11
}
}
}
}
v1.0 (6 dimensions):
v2.0 (7 dimensions, NEW: Risk):
Backward Compatibility:
--pre mode before implementationWhat quality-judge v2.0 CAN'T do:
What quality-judge v2.0 CAN do:
increment-quality-judge v2.0 adds comprehensive risk assessment and quality gate decisions:
✅ Risk assessment (BMAD P×I scoring, 0-10 scale) ✅ Quality gate decisions (PASS/CONCERNS/FAIL with thresholds) ✅ 7 dimensions (added "Risk" to existing 6) ✅ NFR checking (performance, security, scalability) ✅ Enhanced output (blockers, concerns, recommendations) ✅ Chain-of-thought (LLM-as-Judge 2025 best practices) ✅ Backward compatible (can disable risk assessment)
Use it when: You want comprehensive quality assessment with risk scoring and formal gate decisions before implementation or release.
Skip it when: Quick iteration, tight token budget, or simple features where rule-based validation suffices.
Version: 2.0.0 Since: v0.8.0 Related: /sw:qa command, QAOrchestrator agent (v0.9.0)
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.