From judges
Orchestrates parallel judge agent execution to evaluate implementation plans (16 judges), code artifacts (11 judges), or PRDs (4 judges); aggregates CaseScore results into validated JSON files.
npx claudepluginhub closedloop-ai/claude-plugins --plugin judgesThis skill uses the workspace's default tool permissions.
Execute specialized judge agents in parallel to evaluate implementation plan quality (16 judges), code quality (11 judges), or PRD quality (4 judges). Aggregates results into `$CLOSEDLOOP_WORKDIR/plan-judges.json` (plan), `$CLOSEDLOOP_WORKDIR/code-judges.json` (code), or `$CLOSEDLOOP_WORKDIR/prd-judges.json` (prd) with validated output format.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
Execute specialized judge agents in parallel to evaluate implementation plan quality (16 judges), code quality (11 judges), or PRD quality (4 judges). Aggregates results into $CLOSEDLOOP_WORKDIR/plan-judges.json (plan), $CLOSEDLOOP_WORKDIR/code-judges.json (code), or $CLOSEDLOOP_WORKDIR/prd-judges.json (prd) with validated output format.
--workdir: Path to the working directory containing judge artifacts (optional)
--workdir argument → $CLOSEDLOOP_WORKDIR environment variable → .closedloop-ai/judges (default, relative to current working directory)plan-judges.json, code-judges.json, prd-judges.json, judge-input.json, perf.jsonl, etc.) are written to this resolved directory--artifact-type: Artifact category to evaluate (plan | code | prd), default: plan
judge-input.json)The judge input contract is maintained in:
skills/run-judges/references/judge-input-contract.md (resolve to an absolute path at runtime via Glob)
This keeps orchestration flow readable while preserving a single source of truth for contract fields and semantics.
You are orchestrating quality evaluation for a ClosedLoop artifact (implementation plan, code, or PRD). Your responsibilities:
For plan artifacts (default):
judge-input.json with plan task/context mapping$CLOSEDLOOP_WORKDIR/plan-judges.jsonFor code artifacts (--artifact-type code):
judge-input.json with code task/context mapping$CLOSEDLOOP_WORKDIR/code-judges.jsonFor PRD artifacts (--artifact-type prd):
$CLOSEDLOOP_WORKDIR/prd.md exists (graceful exit if missing)judge-input.json with evaluation_type: "prd" and primary_artifact pointing to prd.md$CLOSEDLOOP_WORKDIR/prd-judges.jsonSuccess criteria:
The run-judges skill supports per-artifact-type threshold customization via JSON configuration files. This allows you to adjust evaluation strictness for different artifact types (e.g., applying a lower threshold for test-judge when evaluating code vs plan).
Threshold overrides are defined in a JSON file with the following structure:
{
"overrides": {
"artifact_type:judge_name": <threshold_float>
}
}
Where:
"artifact_type:judge_name" (e.g., "code:test-judge", "plan:technical-accuracy-judge")[0.0, 1.0]Example configuration:
{
"overrides": {
"code:test-judge": 0.75,
"plan:technical-accuracy-judge": 0.85
}
}
The skill checks the following locations in order, using the first valid configuration found:
Run-specific overrides (highest precedence):
$CLOSEDLOOP_WORKDIR/.claude/settings/threshold-overrides.jsonRepo-level defaults (fallback):
<project-root>/.claude/settings/threshold-overrides.jsonHardcoded defaults (graceful degradation):
The following default overrides apply when evaluating code artifacts:
| Judge | Code Threshold | Plan Threshold | Rationale |
|---|---|---|---|
test-judge | 0.75 | 0.8 | Code may have tests written separately from implementation, lower threshold accounts for incremental test development |
All other judges use the same threshold (typically 0.8) across artifact types.
When loading threshold overrides, the skill applies the following validation rules:
Schema Validation:
"overrides" keyartifact_type:judge_name[0.0, 1.0]plan, code, prd) and judge namesError Behavior:
Warning: Invalid threshold-overrides.json, skipping overrides: {error}
Error recovery ensures the skill always completes judge execution, even if threshold configuration is incorrect.
When executing judges:
final_status (pass/fail) based on metric scoresWhen artifact type is code:
You MUST emit a pipeline_step event to $CLOSEDLOOP_WORKDIR/perf.jsonl at the end of each phase below. This keeps perf telemetry in the canonical schema and adds nested metadata for judge/sub-agent work.
Context: CLOSEDLOOP_WORKDIR, CLOSEDLOOP_RUN_ID, and CLOSEDLOOP_ITERATION are set by the run-loop. CLOSEDLOOP_PARENT_STEP and CLOSEDLOOP_PARENT_STEP_NAME are set as env vars on the claude invocation by run-loop; they are inherited by all Bash tool calls — no sourcing needed.
Use sub_step as numeric phase order and optional sub_step_name to capture the judge/sub-agent name when applicable (for batch-level phases where many judges run, use the batch label).
Sub-step numbering:
| Artifact | sub_step | sub_step_name |
|---|---|---|
| plan | 0 | context_manager |
| plan | 1–4 | batch_1 … batch_4 |
| plan | 5 | aggregate |
| plan | 6 | validate |
| code | 0 | context_manager |
| code | 1–3 | batch_1 … batch_3 |
| code | 4 | aggregate |
| code | 5 | validate |
| prd | 0 | context_prep (skipped — prd mode does not use context-manager-for-judges) |
| prd | 1 | prd_judges |
| prd | 2 | aggregate |
| prd | 3 | validate |
Start of phase (run Bash once at the beginning of each phase): Set the two sub-step variables at the top for the current phase, then run the block. It writes start time to a temp file so the end-of-phase Bash can compute duration. CLOSEDLOOP_PARENT_STEP and CLOSEDLOOP_PARENT_STEP_NAME are already in the environment (set by run-loop on the claude invocation).
# Set these two values for the current phase:
SUB_STEP_NUM=0
SUB_STEP_LABEL="context_manager" # context_manager | batch_1 … | aggregate | validate
mkdir -p "$CLOSEDLOOP_WORKDIR/.closedloop"
{
echo "SUB_STEP=${SUB_STEP_NUM}"
echo "SUB_STEP_NAME=${SUB_STEP_LABEL}"
echo "PARENT_STEP=${CLOSEDLOOP_PARENT_STEP:-0}"
echo "PARENT_STEP_NAME=${CLOSEDLOOP_PARENT_STEP_NAME:-unknown}"
echo "STARTED_AT=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "START_EPOCH=$(date +%s)"
} > "$CLOSEDLOOP_WORKDIR/.closedloop/perf-substep-start.env"
End of phase (run Bash once at the end of each phase, after the phase work is done): Read start time, compute duration, append one line to perf.jsonl, then remove the temp file.
source "$CLOSEDLOOP_WORKDIR/.closedloop/perf-substep-start.env"
END_EPOCH=$(date +%s)
ENDED_AT=$(date -u +%Y-%m-%dT%H:%M:%SZ)
DURATION=$((END_EPOCH - START_EPOCH))
jq -n -c \
--arg event "pipeline_step" \
--arg run_id "${CLOSEDLOOP_RUN_ID:-unknown}" \
--argjson iteration "${CLOSEDLOOP_ITERATION:-0}" \
--argjson step "$PARENT_STEP" \
--arg step_name "$PARENT_STEP_NAME" \
--argjson sub_step "$SUB_STEP" \
--arg sub_step_name "$SUB_STEP_NAME" \
--arg started_at "$STARTED_AT" \
--arg ended_at "$ENDED_AT" \
--argjson duration_s "$DURATION" \
--argjson exit_code 0 \
--argjson skipped false \
'{event:$event,run_id:$run_id,iteration:$iteration,step:$step,step_name:$step_name,sub_step:$sub_step,sub_step_name:$sub_step_name,started_at:$started_at,ended_at:$ended_at,duration_s:$duration_s,exit_code:$exit_code,skipped:$skipped}' >> "$CLOSEDLOOP_WORKDIR/perf.jsonl"
rm -f "$CLOSEDLOOP_WORKDIR/.closedloop/perf-substep-start.env"
Order of operations per phase: Run the "start of phase" Bash first (set SUB_STEP_NUM and SUB_STEP_LABEL at the top, then run the block), then perform the phase work, then run the "end of phase" Bash.
Before any other step, resolve the working directory and export it as CLOSEDLOOP_WORKDIR:
# Resolve working directory (precedence: --workdir arg > env var > default)
if [ -n "$ARG_WORKDIR" ]; then
WORKDIR="$ARG_WORKDIR"
elif [ -n "$CLOSEDLOOP_WORKDIR" ]; then
WORKDIR="$CLOSEDLOOP_WORKDIR"
else
WORKDIR="$(pwd)/.closedloop-ai/judges"
fi
mkdir -p "$WORKDIR"
export CLOSEDLOOP_WORKDIR="$WORKDIR"
Where $ARG_WORKDIR is the value passed via --workdir in the invocation prompt. All subsequent references to $CLOSEDLOOP_WORKDIR use this resolved value.
Before any judge execution, ensure a snapshot of judge agent definitions exists in $CLOSEDLOOP_WORKDIR/agents-snapshot/. This preserves the exact agent versions used for each evaluation run.
Action: Run the snapshot script via Bash:
bash "${CLAUDE_PLUGIN_ROOT}/skills/run-judges/scripts/ensure_agents_snapshot.sh" "$CLOSEDLOOP_WORKDIR"
The script is idempotent — it skips if manifest.json already exists.
Error handling: If the script fails or is not found, log a warning and continue — snapshot failure must not block judge execution.
Before any prerequisite checks or judge launches:
Glob with:
**/skills/run-judges/references/judge-input-contract.mdjudge-input-contract.md file in full.$CLOSEDLOOP_WORKDIR/judge-input.json.Performance: At the start of this phase run the "start of phase" Bash with SUB_STEP_NUM=0 and SUB_STEP_LABEL=context_manager for both plan and code modes. At the end of the phase run the "end of phase" Bash.
Before starting, verify required inputs exist:
For plan artifacts (default):
# Validate input files exist
if [ ! -f "$CLOSEDLOOP_WORKDIR/prd.md" ]; then
echo "WARNING: $CLOSEDLOOP_WORKDIR/prd.md not found. Skipping judges."
exit 0 # Graceful skip - do not fail workflow
fi
if [ ! -f "$CLOSEDLOOP_WORKDIR/plan.json" ]; then
echo "WARNING: $CLOSEDLOOP_WORKDIR/plan.json not found. Skipping judges."
exit 0
fi
Investigation log resolution (plan mode):
After validating prd.md and plan.json, resolve supporting context for plan judges:
Use existing file first
$CLOSEDLOOP_WORKDIR/investigation-log.md exists, use it as-is.Check @code:pre-explorer availability before invoking
@code:pre-explorer in the active Claude/plugin environment.Task() call targeting @code:pre-explorer.If available, invoke pre-explorer
@code:pre-explorer with WORKDIR=$CLOSEDLOOP_WORKDIR to generate missing pre-exploration artifacts.$CLOSEDLOOP_WORKDIR/investigation-log.md after completion.If unavailable or invocation failed, run internal fallback
investigation-log.md with a lightweight local-only investigation.prd.md and extract top entities/actions as search seeds.Glob/Grep against the local repository for likely implementation files.Files Discovered / Key Findings.Requirements Mapping.## Search Strategy## Files Discovered## Key Findings## Requirements Mapping## UncertaintiesNever block plan context preparation on investigation context
Prepare plan-context.json via context-manager-for-judges
@judges:context-manager-for-judges with artifact_type=plan.$CLOSEDLOOP_WORKDIR/plan-context.json exists.plan.json + prd.md.Plan-mode source-of-truth policy
plan-context.json is primary and required.plan.json + prd.md may be used for this run only.Build plan-mode judge-input.json
evaluation_type = plan.task to plan quality evaluation objective (16-plan-judge workflow).primary_artifact to plan-context.json in normal mode.plan.json and include prd.md as supporting.investigation-log.md as supporting artifact when available.source_of_truth ordering from primary to secondary artifacts.For code artifacts (--artifact-type code):
# Resolve investigation context for code judges (best effort)
if [ ! -f "$CLOSEDLOOP_WORKDIR/investigation-log.md" ]; then
echo "INFO: investigation-log.md missing. Attempting best-effort generation via @code:pre-explorer..."
# Launch @code:pre-explorer with WORKDIR=$CLOSEDLOOP_WORKDIR
# If unavailable/fails, continue with warning (non-blocking for code judges)
fi
# Launch context-manager-for-judges agent to prepare compressed context
# This agent reads code artifacts (git diff, changed-files.json, etc.)
# and produces code-context.json with token-budgeted compression
# investigation-log.md is optional secondary context for code judging
if [ ! -f "$CLOSEDLOOP_WORKDIR/investigation-log.md" ]; then
echo "WARNING: investigation-log.md unavailable. Continuing code judges with code-context.json only."
fi
# Verify code-context.json exists after context manager completes
if [ ! -f "$CLOSEDLOOP_WORKDIR/code-context.json" ]; then
echo "ERROR: Context preparation failed - code-context.json not found"
# Abort with error CaseScore for all judges
# Generate error report with final_status=3, justification="Context preparation failed"
exit 1
fi
# Build code-mode judge-input.json
# - evaluation_type: "code"
# - task: code quality evaluation objective (11-code-judge workflow)
# - primary_artifact: code-context.json
# - supporting_artifacts: investigation-log.md (optional), plus any other run artifacts
# - source_of_truth: ["code_context", ...]
For PRD artifacts (--artifact-type prd):
PRD mode does NOT use context-manager-for-judges. Context preparation is lightweight: verify the PRD document exists, then build judge-input.json directly from it.
# PRD mode context prep: check prd.md exists
if [ ! -f "$CLOSEDLOOP_WORKDIR/prd.md" ]; then
echo "WARNING: $CLOSEDLOOP_WORKDIR/prd.md not found. Skipping PRD judges."
exit 0 # Graceful exit — do not fail parent workflow
fi
# Build prd-mode judge-input.json
# - evaluation_type: "prd"
# - task: PRD quality evaluation objective (prd-auditor + 3 critics)
# - primary_artifact: $CLOSEDLOOP_WORKDIR/prd.md
# - supporting_artifacts: [] (none required)
# - source_of_truth: ["prd"]
PRD context prep notes:
prd.md results in a WARNING and graceful exit (code 0), not an errorjudge-input.json is built directly with primary_artifact pointing to $CLOSEDLOOP_WORKDIR/prd.mdIf required files are missing:
The run-judges skill supports three artifact types with different judge configurations:
plan-judges.json{RUN_ID}-plan-judges--category plan (16 judges expected)code-judges.json{RUN_ID}-code-judges--category code (11 judges expected)Code Judge Batches:
Batch 1: Core Principles (4 judges)
judges:dry-judgejudges:ssot-judgejudges:kiss-judgejudges:code-organization-judgeBatch 2: Best Practices + SOLID Principles (4 judges)
judges:custom-best-practices-judgejudges:readability-judgejudges:solid-isp-dip-judgejudges:solid-liskov-substitution-judgeBatch 3: Technical Quality + Testing (3 judges)
judges:solid-open-closed-judgejudges:technical-accuracy-judgejudges:test-judgeprd-judges.json{RUN_ID}-prd-judges--category prd (4 judges expected)$CLOSEDLOOP_WORKDIR/prd.mdPRD Execution:
Batch 1: All PRD Judges (sub_step=1)
judges:prd-auditor — structural completeness audit of the PRDjudges:prd-dependency-judge — evaluates dependency clarity and completenessjudges:prd-testability-judge — evaluates requirement testabilityjudges:prd-scope-judge — evaluates scope definition and boundary clarityPerformance: For each batch/phase, run "start of phase" Bash before launching the batch and "end of phase" Bash after the batch completes. Plan: batch_1=sub_step 1, batch_2=sub_step 2, batch_3=sub_step 3, batch_4=sub_step 4. Code: batch_1=sub_step 1, batch_2=sub_step 2, batch_3=sub_step 3. PRD: prd_judges=sub_step 1.
Constraint: The Task tool supports maximum 4 concurrent agents per batch.
Action: Launch judges in sequential batches based on artifact type.
<judge_batches>
Batch 1: Core Principles (DRY/SSOT/KISS + Organization)
| Agent Type | Evaluates |
|---|---|
judges:dry-judge | Don't Repeat Yourself violations |
judges:ssot-judge | Single Source of Truth violations |
judges:kiss-judge | Keep It Simple violations |
judges:code-organization-judge | File and folder structure organization |
Batch 2: Best Practices + Response Quality
| Agent Type | Evaluates |
|---|---|
judges:custom-best-practices-judge | Adherence to custom best practices documents |
judges:goal-alignment-judge | Alignment with stated health goals |
judges:readability-judge | Plan readability, clarity, structure, template adherence |
judges:verbosity-judge | Verbosity calibration to problem complexity |
Batch 3: SOLID Principles
| Agent Type | Evaluates |
|---|---|
judges:solid-isp-dip-judge | Interface Segregation & Dependency Inversion Principles |
judges:solid-liskov-substitution-judge | Liskov Substitution Principle adherence |
judges:solid-open-closed-judge | Open/Closed Principle adherence |
judges:technical-accuracy-judge | Technical accuracy (API usage, algorithms) |
Batch 4: Plan Grounding + Testing
| Agent Type | Evaluates |
|---|---|
judges:test-judge | Test coverage, assertions, structure, best practices |
judges:brownfield-accuracy-judge | Reuse vs reimplementation, integration-point accuracy, scope accuracy against investigation findings |
judges:codebase-grounding-judge | File-path/module-reference accuracy and existing-code awareness grounded in investigation findings |
judges:convention-adherence-judge | Alignment with established naming, structural, and tooling conventions in the codebase |
Batch 1: All PRD Judges (sub_step=1)
| Agent Type | Evaluates |
|---|---|
judges:prd-auditor | Structural completeness, section coverage, clarity |
judges:prd-dependency-judge | Dependency clarity and completeness |
judges:prd-testability-judge | Requirement testability and measurability |
judges:prd-scope-judge | Scope definition and boundary clarity |
</judge_batches>
<prompt_template>
Before invoking each judge, prepend the common and artifact-specific preambles:
Locate preamble files:
skills/artifact-type-tailored-context/preambles/common_input_preamble.mdskills/artifact-type-tailored-context/preambles/{artifact_type}_preamble.md**/artifact-type-tailored-context/preambles/*.mdRead preamble content:
common_input_preamble.md{artifact_type}_preamble.mdConcatenate:
common_input_preamble + "\n\n---\n\n" + artifact_preamble + "\n\n---\n\n" + judge_promptcommon_input_preamble.md is the only runtime source of judge input-loading contract text; judge-specific agent files should not duplicate that contract.Pass to judge: Use concatenated prompt as judge's full prompt
If either preamble file is missing:
final_status=3, justification="Preamble file not found: {path}"For plan artifacts:
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` as authoritative.
If `fallback_mode.active=true`, use fallback artifacts specified in the envelope.
For code artifacts:
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` as authoritative.
Apply your {judge_name} criteria to assess code quality.
For PRD artifacts:
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` ($CLOSEDLOOP_WORKDIR/prd.md) as the authoritative PRD document.
Apply your {judge_name} criteria to assess PRD quality.
</prompt_template>
<expected_output> Each judge returns a CaseScore JSON object:
{
"type": "case_score",
"case_id": "dry-judge",
"final_status": 1,
"metrics": [
{
"metric_name": "dry_score",
"threshold": 0.8,
"score": 0.85,
"justification": "Plan follows DRY principles..."
}
]
}
Status Code Semantics:
| Code | Meaning | When to Use |
|---|---|---|
1 | Pass | Score meets or exceeds threshold |
2 | Fail | Score below threshold |
3 | Error | Judge execution failed |
</expected_output>
<error_handling>
CRITICAL REQUIREMENT: If a judge Task call fails, you MUST construct an error CaseScore.
Error CaseScore Template:
{
"type": "case_score",
"case_id": "{judge-name}",
"final_status": 3,
"metrics": [
{
"metric_name": "{metric}_score",
"threshold": 0.8,
"score": 0.0,
"justification": "Judge execution failed: {error message}"
}
]
}
Continue-on-failure semantics:
</error_handling>
Performance: Run "start of phase" with sub_step 5 (plan), 4 (code), or 2 (prd), sub_step_name=aggregate. Emit 'end of phase' after the aggregation step regardless of file write outcome.
Task: Collect all CaseScore outputs and structure them into an EvaluationReport.
<output_structure>
Output file logic:
if artifact_type == 'code':
report_filename = 'code-judges.json'
report_id = f'{RUN_ID}-code-judges'
elif artifact_type == 'prd':
report_filename = 'prd-judges.json'
report_id = f'{RUN_ID}-prd-judges'
else:
report_filename = 'plan-judges.json'
report_id = f'{RUN_ID}-plan-judges'
output_path = $CLOSEDLOOP_WORKDIR / report_filename
Plan artifact report structure (plan-judges.json):
{
"report_id": "{RUN_ID}-plan-judges",
"timestamp": "2024-02-03T15:45:30Z",
"stats": [
{ /* CaseScore from dry-judge */ },
{ /* CaseScore from ssot-judge */ },
{ /* CaseScore from kiss-judge */ },
{ /* CaseScore from code-organization-judge */ },
{ /* CaseScore from custom-best-practices-judge */ },
{ /* CaseScore from goal-alignment-judge */ },
{ /* CaseScore from readability-judge */ },
{ /* CaseScore from verbosity-judge */ },
{ /* CaseScore from solid-isp-dip-judge */ },
{ /* CaseScore from solid-liskov-substitution-judge */ },
{ /* CaseScore from solid-open-closed-judge */ },
{ /* CaseScore from technical-accuracy-judge */ },
{ /* CaseScore from test-judge */ },
{ /* CaseScore from brownfield-accuracy-judge */ },
{ /* CaseScore from codebase-grounding-judge */ },
{ /* CaseScore from convention-adherence-judge */ }
]
}
Code artifact report structure (code-judges.json):
{
"report_id": "{RUN_ID}-code-judges",
"timestamp": "2024-02-03T15:45:30Z",
"stats": [
{ /* CaseScore from dry-judge */ },
{ /* CaseScore from ssot-judge */ },
{ /* CaseScore from kiss-judge */ },
{ /* CaseScore from code-organization-judge */ },
{ /* CaseScore from custom-best-practices-judge */ },
{ /* CaseScore from readability-judge */ },
{ /* CaseScore from solid-isp-dip-judge */ },
{ /* CaseScore from solid-liskov-substitution-judge */ },
{ /* CaseScore from solid-open-closed-judge */ },
{ /* CaseScore from technical-accuracy-judge */ },
{ /* CaseScore from test-judge */ }
]
}
PRD artifact report structure (prd-judges.json):
{
"report_id": "{RUN_ID}-prd-judges",
"timestamp": "2024-02-03T15:45:30Z",
"stats": [
{ /* CaseScore from prd-auditor */ },
{ /* CaseScore from prd-dependency-judge */ },
{ /* CaseScore from prd-testability-judge */ },
{ /* CaseScore from prd-scope-judge */ }
]
}
Field requirements:
| Field | Format | How to Derive |
|---|---|---|
report_id | {RUN_ID}-plan-judges, {RUN_ID}-code-judges, or {RUN_ID}-prd-judges | Extract RUN_ID from $CLOSEDLOOP_WORKDIR directory name, append suffix based on artifact type |
timestamp | ISO 8601 | Generate with date -u +%Y-%m-%dT%H:%M:%SZ |
stats | Array[CaseScore] | 16 CaseScore objects for plan, 11 for code, 4 for prd (one per judge) |
</output_structure>
Performance: Run "start of phase" with sub_step 6 (plan), 5 (code), or 3 (prd), sub_step_name=validate. Emit 'end of phase' after each validation attempt regardless of exit code, then apply failure recovery logic.
CRITICAL: You MUST run the validation script after writing the judge report. Do not consider the task complete until validation passes.
<validation_workflow>
Step 3.1: Locate the Validation Script
The script is in this skill's scripts/ directory:
SCRIPT_PATH="scripts/validate_judge_report.py"
Step 3.2: Ensure uv is Installed
if ! command -v uv &> /dev/null; then
# Install uv — alternatives: brew install uv, pip install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
fi
Step 3.3: Run Validation
# CRITICAL: Run from script's directory so uv can find inline dependencies
cd "$(dirname "$SCRIPT_PATH")"
# Determine category based on artifact type
CATEGORY="plan" # default
if [ "$ARTIFACT_TYPE" = "code" ]; then
CATEGORY="code"
elif [ "$ARTIFACT_TYPE" = "prd" ]; then
CATEGORY="prd"
fi
# Run validation with appropriate category
uv run "$SCRIPT_PATH" --workdir "$CLOSEDLOOP_WORKDIR" --category "$CATEGORY"
Argument requirements:
--workdir must be the absolute path to $CLOSEDLOOP_WORKDIR--category must be plan (16 judges), code (11 judges), or prd (4 judges)plan-judges.json, code-judges.json, or prd-judges.json is located</validation_workflow>
<validation_checks>
The script validates using strict Pydantic models:
| Check | Requirement |
|---|---|
| JSON syntax | Valid JSON format |
| Required fields | report_id, timestamp, stats array |
| Judge coverage | All expected judges present (16 for plan, 11 for code, 4 for prd) |
| Status values | final_status ∈ {1, 2, 3} |
| Metric completeness | Each judge has ≥1 metric |
| Report ID format | Ends with '-judges' (plan), '-code-judges' (code), or '-prd-judges' (prd) |
Expected judge case_ids for plan artifacts (16 total):
brownfield-accuracy-judge
code-organization-judge
codebase-grounding-judge
convention-adherence-judge
custom-best-practices-judge
dry-judge
goal-alignment-judge
kiss-judge
readability-judge
solid-isp-dip-judge
solid-liskov-substitution-judge
solid-open-closed-judge
ssot-judge
technical-accuracy-judge
test-judge
verbosity-judge
Expected judge case_ids for code artifacts (11 total):
code-organization-judge
custom-best-practices-judge
dry-judge
kiss-judge
readability-judge
solid-isp-dip-judge
solid-liskov-substitution-judge
solid-open-closed-judge
ssot-judge
technical-accuracy-judge
test-judge
Note: Code artifacts exclude: goal-alignment-judge, verbosity-judge
Expected judge case_ids for PRD artifacts (4 total):
prd-auditor
prd-dependency-judge
prd-testability-judge
prd-scope-judge
Note: All 4 PRD judges run in a single parallel batch.
</validation_checks>
| Code | Meaning | Action |
|---|---|---|
0 | Valid | Task complete ✓ |
1 | Invalid | Read error, fix report JSON, re-validate |
<failure_recovery>
Follow this sequence:
</failure_recovery>
<pydantic_schema>
The validation script uses these strict Pydantic models:
class MetricStatistics(BaseModel):
"""A single metric evaluation result."""
metric_name: str
threshold: Optional[float] = None
score: float
justification: str
class CaseScore(BaseModel):
"""Score for a single judge evaluation."""
type: Optional[str] = "case_score"
case_id: str
final_status: int # 1=pass, 2=fail, 3=error
metrics: List[MetricStatistics]
class EvaluationReport(BaseModel):
"""Top-level report containing all judge evaluations."""
report_id: str
timestamp: str
stats: List[CaseScore]
Model constraints:
ConfigDict(strict=True) enforces exact type matchingfinal_status validator rejects values outside {1, 2, 3}</pydantic_schema>
<completion_criteria>
Before marking this task complete, verify:
For all artifact types:
agents-snapshot/manifest.json exists in $CLOSEDLOOP_WORKDIR (created if missing, skipped if present)For plan artifacts (default):
artifact_type=planplan-context.json exists, or compatibility mode explicitly activatedjudge-input.json exists with required fieldsinvestigation-log.md reused, generated via pre-explorer, or best-effort generated internallyplan-judges.json written to $CLOSEDLOOP_WORKDIR--category planFor code artifacts (--artifact-type code):
$CLOSEDLOOP_WORKDIRjudge-input.json exists with required fieldsinvestigation-log.md reused or generated best-effort; missing file does not block code judgingcode-judges.json written to $CLOSEDLOOP_WORKDIR--category codeFor PRD artifacts (--artifact-type prd):
$CLOSEDLOOP_WORKDIR/prd.md found, or graceful exit with WARNING (code 0)judge-input.json written with evaluation_type="prd" and primary_artifact=$CLOSEDLOOP_WORKDIR/prd.mdprd-judges.json written to $CLOSEDLOOP_WORKDIR--category prd (sub_step=3)</completion_criteria>
| Error Message | Root Cause | Solution |
|---|---|---|
| "Report file does not exist" | File not written to correct location | Verify $CLOSEDLOOP_WORKDIR is set; check write path matches artifact type (plan-judges.json, code-judges.json, or prd-judges.json) |
| "Invalid JSON" | Syntax error in output file | Run python3 -m json.tool "$CLOSEDLOOP_WORKDIR/{plan,code,prd}-judges.json" to identify syntax error |
| "Missing expected judges" | Incomplete batch execution | Verify all batches launched (4 for plan, 3 for code, 1 for prd); check error CaseScores for failures; plan expects 16 judges, code expects 11, prd expects 4 |
| "final_status must be 1, 2, or 3" | Invalid status code | Use only: 1 (pass), 2 (fail), 3 (error) |
| "report_id should end with '-plan-judges'" | Incorrect ID format for plan | Use pattern: {RUN_ID}-plan-judges for plan artifacts |
| "report_id should end with '-code-judges'" | Incorrect ID format for code | Use pattern: {RUN_ID}-code-judges for code artifacts |
| "Judge {name} has no metrics" | Empty metrics array | Each CaseScore must have ≥1 MetricStatistics entry |
| "Context preparation failed" | context-manager-for-judges failed | Check context-manager agent output; verify artifact files exist |
| "judge-input.json missing" | Orchestrator did not generate envelope | Build $CLOSEDLOOP_WORKDIR/judge-input.json before launching judges |
| "judge-input schema invalid" | Missing required envelope fields | Ensure required fields: evaluation_type, task, primary_artifact, supporting_artifacts, source_of_truth, fallback_mode, metadata |
| "plan-context.json not found" | plan context manager did not produce output | Run @judges:context-manager-for-judges with artifact_type=plan; if still missing, activate one-run compatibility fallback to plan.json + prd.md |
| "Preamble file not found" | Missing common or artifact preamble .md file | Verify both skills/artifact-type-tailored-context/preambles/common_input_preamble.md and skills/artifact-type-tailored-context/preambles/{artifact_type}_preamble.md exist |
| "pre-explorer unavailable" | @code:pre-explorer not installed/resolvable | Log warning and use internal fallback investigation to create investigation-log.md |
| "investigation-log.md missing after fallback" | Both pre-explorer and internal fallback failed | Log warning and continue; do not block context preparation |
| "investigation-log.md missing in code mode" | pre-explorer unavailable or generation failed during code preflight | Log warning and continue with code-context.json only (non-blocking) |
| "Invalid --artifact-type value" | Unsupported artifact type | Use only 'plan', 'code', or 'prd' |
| "prd.md not found" | PRD document missing from workdir | Emit WARNING and exit gracefully (code 0); do not fail the parent workflow |
| "report_id should end with '-prd-judges'" | Incorrect ID format for prd | Use pattern: {RUN_ID}-prd-judges for PRD artifacts |
If --artifact-type value is not 'plan', 'code', or 'prd':
If context-manager-for-judges agent exceeds 5 minutes:
final_status=3, justification="Context preparation timeout"If context-manager-for-judges agent exceeds 5 minutes in plan mode:
plan.json + prd.mdIf a single judge Task call fails during execution:
When --artifact-type is not specified or equals 'plan':
plan-judges.json (not code-judges.json)plan-context.json as primary input; use one-run compatibility fallback only if context preparation failsjudge-input.json envelope to judges--category planThis is the standard plan mode flow; orchestrators must support context-manager launch, judge-input.json construction, and preamble injection. The compatibility fallback (raw plan.json + prd.md) activates only when context preparation fails (e.g., context-manager timeout), not for orchestrators that have not been updated.
When --artifact-type prd is specified:
$CLOSEDLOOP_WORKDIR/prd.md exists; emit WARNING and exit gracefully (code 0) if missingjudge-input.json with evaluation_type="prd" and primary_artifact=$CLOSEDLOOP_WORKDIR/prd.mdprd-judges.json--category prd (sub_step=3)