From terraphim-engineering-skills
Evaluate agent task outputs using a three-dimension rubric (Semantic, Pragmatic, Syntactic) derived from the KLS quality framework. Use when: (1) a task has been completed and needs quality assessment before acceptance, (2) automated post-task quality checks are required, (3) multi-model consensus verdicts are needed for agent outputs, (4) documentation, code, or specification quality must be scored with structured JSON verdicts, or (5) a human fallback decision is needed after model disagreement. Produces JSONL verdict records compatible with the verdict schema in automation/judge/.
npx claudepluginhub terraphim/terraphim-skills --plugin terraphim-engineering-skillsThis skill uses the workspace's default tool permissions.
Evaluate agent task outputs against a three-dimension rubric and produce structured
Orchestrates parallel judge agent execution to evaluate implementation plans (16 judges), code artifacts (11 judges), or PRDs (4 judges); aggregates CaseScore results into validated JSON files.
Builds binary pass/fail LLM-as-Judge evaluators for one specific interpretive failure mode like tone, faithfulness, or completeness when code checks are insufficient.
Share bugs, ideas, or general feedback.
Evaluate agent task outputs against a three-dimension rubric and produce structured verdict records. The judge operates as a quality gate at the task completion boundary, scoring outputs on Semantic accuracy, Pragmatic usefulness, and Syntactic consistency.
The rubric reuses three dimensions from the KLS (Krogstie-Lindland-Sindre) quality
framework defined in disciplined-quality-evaluation:
| Dimension | Question | Criteria |
|---|---|---|
| Semantic | Does it accurately represent the domain? | Factual correctness, domain terminology, no contradictions |
| Pragmatic | Does it enable the intended decisions/actions? | Actionable, useful, addresses the task goal |
| Syntactic | Is it internally consistent and well-structured? | Format compliance, structural completeness, no broken references |
| Score | Meaning |
|---|---|
| 1 | Poor -- major issues, blocks use |
| 2 | Below Standard -- significant gaps |
| 3 | Adequate -- meets minimum bar |
| 4 | Good -- clear, useful, few issues |
| 5 | Excellent -- exemplary, no issues |
| Condition | Verdict |
|---|---|
| All dimensions >= 3 AND average >= 3.5 | accept |
| Any dimension < 3 OR average < 3.5, but all >= 2 | improve |
| Any dimension < 2 | reject |
| Models disagree on accept/reject | escalate |
Every judge evaluation produces a JSON verdict record. See automation/judge/verdict-schema.json
for the full schema. Minimal structure:
{
"task_id": "issue-18",
"model": "opencode/gpt-5-nano",
"mode": "quick",
"verdict": "accept",
"scores": {
"semantic": 4,
"pragmatic": 4,
"syntactic": 5
},
"average": 4.33,
"reasoning": "Brief justification for scores",
"improvements": [],
"timestamp": "2026-02-17T14:30:00Z"
}
Output MUST be valid JSON only -- no markdown fencing, no preamble, no trailing text.
1. Run quick judge
|
+-- accept --> DONE (accept)
+-- reject --> DONE (reject, log improvements)
+-- improve --> Run deep judge
|
+-- accept --> DONE (accept)
+-- reject --> DONE (reject, log improvements)
+-- improve --> Human fallback
When quick and deep disagree on accept vs reject:
Quick: accept + Deep: reject --> Run tiebreaker
Quick: reject + Deep: accept --> Run tiebreaker
Tiebreaker verdict is final (no further iteration)
Maximum 3 model calls before human fallback.
Escalate to human review when:
The judge is invoked by automation/judge/run-judge.sh (v2), which:
--file <tempfile> for reliable prompt deliveryAll verdicts are appended to automation/judge/verdicts.jsonl as one JSON object per line.
This enables:
The judge can be invoked as a pre-push hook via automation/judge/pre-push-judge.sh,
blocking pushes that receive a "reject" verdict.
When terraphim-cli is available, the judge runner uses knowledge graph-based term normalization to identify rubric dimensions in model reasoning.
# Install judge KG files and configure role
bash automation/judge/setup-judge-kg.sh
# Verify installation
terraphim-cli thesaurus --limit 50
terraphim-cli find "factual correctness and actionability"
Located in automation/judge/kg/:
| File | Normalized Term | Purpose |
|---|---|---|
judge-semantic.md | judge-semantic | Synonyms for semantic quality dimension |
judge-pragmatic.md | judge-pragmatic | Synonyms for pragmatic quality dimension |
judge-syntactic.md | judge-syntactic | Synonyms for syntactic quality dimension |
judge-verdicts.md | judge-verdicts | Verdict vocabulary normalization |
judge-checklist.md | judge-checklist | Required verdict elements |
If terraphim-cli is not installed, the judge falls back to direct JSON extraction without term normalization. All core functionality works without terraphim -- the KG integration is an enrichment layer.