Harness - Quality-Gated SOP Execution
A MCP server + Claude Code plugin that enforces quality gates on any multi-step workflow. Every step submission is evaluated by a separate, independent evaluator -- not self-evaluation -- using a 6-dimension scoring rubric with calibration anchors and AI slop detection.
Why This Exists
Anthropic's own research proves that when LLMs evaluate their own work, they "confidently praise their own mediocre work." A standalone evaluator tuned for skepticism is far more tractable than making a generator self-critical.
This harness implements that insight: the entity doing the work never judges its own output.
How It Works
┌─────────────────────────────────────────────────────────────┐
│ 1. Define SOP (YAML) │
│ phases → steps → acceptance_criteria │
│ │
│ 2. Start Session │
│ harness_start("feature-dev") → session_id + step 1 │
│ │
│ 3. For each step: │
│ a. Agent does the work │
│ b. Agent submits via harness_submit_step │
│ c. Separate evaluator scores on 6 dimensions │
│ d. PASS → next step | FAIL → retry with feedback │
│ e. 3 fails → escalate to human │
│ │
│ 4. All steps pass → session complete │
└─────────────────────────────────────────────────────────────┘
Features
Evaluator System
- 3 backends: Subagent (default, no API key), Anthropic API (Haiku), OpenAI-compatible (vLLM/Ollama)
- 6 scoring dimensions: Completeness (25%), Specificity (20%), Correctness (20%), Coherence (10%), Actionability (15%), Format Compliance (10%)
- 3 profiles:
default (threshold 3.5), strict (4.0), lenient (3.0)
- AI slop detection: 15+ flagged patterns; 3+ flags auto-penalize specificity score
- Calibration anchors: Scored examples prevent evaluator drift
- Prompt injection defense: User content wrapped in structural delimiters
Workflow Engine
- YAML SOP definitions with phases, steps, dependencies (topological sort)
- Per-step evaluator profiles and optional timeouts
- Session management: Resume blocked sessions, skip steps, list all sessions
- Event sourcing: Automatic recovery from corrupted state files
- Atomic writes: Crash-safe state persistence
MCP Tools (9)
| Tool | Purpose |
|---|
harness_start | Begin a session with an SOP |
harness_submit_step | Submit output for current step |
harness_report_evaluation | Report subagent evaluation results |
harness_get_status | Get session status |
harness_get_feedback | Get evaluation feedback history |
harness_list_sops | Discover available SOPs |
harness_resume | Resume a paused/blocked session |
harness_skip_step | Skip current step |
harness_list_sessions | List all sessions |
Built-in SOP Templates
| Template | Phases | Steps | Use Case |
|---|
feature-dev | 4 | 12 | Feature development lifecycle |
investigation | 3 | 7 | Security/research investigation |
code-review | 3 | 7 | Multi-perspective code review |
_TEMPLATE | - | - | Commented template for new SOPs |
Installation
As a standalone MCP server
git clone https://github.com/p-vbordei/harness.git
cd harness
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
As a Claude Code plugin
Copy the harness/ directory to your Claude Code plugins location, or symlink it:
ln -s /path/to/harness ~/.claude/plugins/harness
The .mcp.json at the plugin root registers the MCP server automatically.
Configuration
Evaluator Backends
| Backend | Env Vars | Cost | Notes |
|---|
subagent (default) | None | Free | Uses Claude Code reviewer agent |
anthropic | HARNESS_EVAL_BACKEND=anthropic, ANTHROPIC_API_KEY | ~$0.001/eval | Haiku by default |
openai | HARNESS_EVAL_BACKEND=openai, HARNESS_EVAL_BASE_URL, HARNESS_EVAL_MODEL | Varies | vLLM, Ollama, etc. |
Evaluator Profiles
Profiles control scoring thresholds per step. Set in SOP YAML:
steps:
- id: security-check
evaluator_profile: strict # threshold 4.0, correctness weighted 25%
- id: draft-outline
evaluator_profile: lenient # threshold 3.0, no slop penalty
Usage
From Claude Code CLI
/harness feature-dev