From ouroboros
Runs a three-stage verification pipeline (mechanical, semantic, multi-model consensus) to evaluate execution sessions. Useful for structured quality checks on code or task outputs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ouroboros:evaluateThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Evaluate an execution session using the three-stage verification pipeline.
Evaluate an execution session using the three-stage verification pipeline.
/ouroboros:evaluate <session_id> [artifact]
Trigger keywords: "evaluate this", "3-stage check"
The evaluation pipeline runs three progressive stages:
Stage 1: Mechanical Verification ($0 cost)
Stage 2: Semantic Evaluation (Standard tier)
Stage 3: Multi-Model Consensus (Frontier tier, optional)
When the user invokes this skill:
The Ouroboros MCP tools are often registered as deferred tools that must be explicitly loaded before use. You MUST perform this step before proceeding.
ToolSearch tool to find and load the evaluate MCP tool:
ToolSearch query: "+ouroboros evaluate"
mcp__plugin_ouroboros_ouroboros__ouroboros_evaluate (with a plugin prefix). After ToolSearch returns, the tool becomes callable.IMPORTANT: Do NOT skip this step. Do NOT assume MCP tools are unavailable just because they don't appear in your immediate tool list. They are almost always available as deferred tools that need to be loaded first.
CRITICAL — deferred-schema guard (prevents "Invalid tool parameters"):
This skill can call ouroboros_evaluate after a fresh turn. A deferred tool's
schema loaded on one turn is NOT guaranteed to still be loaded on the next. If
you call it while its schema is not loaded in the current turn, the runtime
rejects the call with "Invalid tool parameters" before it reaches the server.
Therefore: immediately before EVERY ouroboros_evaluate call in this skill,
re-run ToolSearch query: "+ouroboros evaluate" (idempotent — a no-op when
already loaded). If the load returns no matching tool, switch to the documented
fallback instead of retrying the failing call.
Determine what to evaluate:
session_id provided: Use it directlyGather the artifact to evaluate:
Call the ouroboros_evaluate MCP tool:
Tool: ouroboros_evaluate
Arguments:
session_id: <session ID>
artifact: <the code/output to evaluate>
seed_content: <original seed YAML, if available>
acceptance_criterion: <specific AC to check, optional>
artifact_type: "code" (or "docs", "config")
trigger_consensus: false (true if user requests Stage 3)
Present results clearly:
📍 Done! Your implementation passes all checks. Optional: ooo evolve to iteratively refinecode_changes_detected: true): 📍 Next: Fix the build/test failures above, then ooo evaluate — or ooo ralph for automated fix loopcode_changes_detected: false): 📍 Next: Run ooo run first to produce code, then ooo evaluate📍 Next: ooo run to re-execute with fixes — or ooo evolve for iterative refinement📍 Next: ooo interview to re-examine requirements — or ooo unstuck to challenge assumptionsIf the MCP server is not available, use the ouroboros:evaluator agent to perform a prompt-based evaluation:
ouroboros:evaluator agentUser: /ouroboros:evaluate sess-abc-123
Evaluation Results
============================================================
Final Approval: APPROVED
Highest Stage Completed: 2
Stage 1: Mechanical Verification
[PASS] lint: No issues found
[PASS] build: Build successful
[PASS] test: 12/12 tests passing
Stage 2: Semantic Evaluation
Score: 0.85
AC Compliance: YES
Goal Alignment: 0.90
Drift Score: 0.08
📍 Done! Your implementation passes all checks. Optional: `ooo evolve` to iteratively refine
npx claudepluginhub q00/ouroboros --plugin ouroborosEvaluates agent session work quality at end using 4D weighted scoring (completeness 35%, honesty 30%, deferral 20%, evidence 15%). Verifies tests, catches rationalizations, generates handoff artifacts with git diffs.
Evaluates TandemKit Generator output against specs using Codex as second opinion. Autonomous verification loops via bash state watchers and signals until pass or user intervention.
Performs comprehensive multi-agent evaluation of code projects across 12 dimensions like safety, completeness, and design quality. Outputs scored reports with executive summaries and improvement roadmaps in 5-10 minutes.