harness

Harness - Quality-Gated SOP Execution

A MCP server + Claude Code plugin that enforces quality gates on any multi-step workflow. Every step submission is evaluated by a separate, independent evaluator -- not self-evaluation -- using a 6-dimension scoring rubric with calibration anchors and AI slop detection.

Why This Exists

Anthropic's own research proves that when LLMs evaluate their own work, they "confidently praise their own mediocre work." A standalone evaluator tuned for skepticism is far more tractable than making a generator self-critical.

This harness implements that insight: the entity doing the work never judges its own output.

How It Works

┌─────────────────────────────────────────────────────────────┐
│  1. Define SOP (YAML)                                       │
│     phases → steps → acceptance_criteria                    │
│                                                             │
│  2. Start Session                                           │
│     harness_start("feature-dev") → session_id + step 1     │
│                                                             │
│  3. For each step:                                          │
│     a. Agent does the work                                  │
│     b. Agent submits via harness_submit_step                │
│     c. Separate evaluator scores on 6 dimensions            │
│     d. PASS → next step | FAIL → retry with feedback        │
│     e. 3 fails → escalate to human                          │
│                                                             │
│  4. All steps pass → session complete                       │
└─────────────────────────────────────────────────────────────┘

Features

Evaluator System

3 backends: Subagent (default, no API key), Anthropic API (Haiku), OpenAI-compatible (vLLM/Ollama)
6 scoring dimensions: Completeness (25%), Specificity (20%), Correctness (20%), Coherence (10%), Actionability (15%), Format Compliance (10%)
3 profiles: default (threshold 3.5), strict (4.0), lenient (3.0)
AI slop detection: 15+ flagged patterns; 3+ flags auto-penalize specificity score
Calibration anchors: Scored examples prevent evaluator drift
Prompt injection defense: User content wrapped in structural delimiters

Workflow Engine

YAML SOP definitions with phases, steps, dependencies (topological sort)
Per-step evaluator profiles and optional timeouts
Session management: Resume blocked sessions, skip steps, list all sessions
Event sourcing: Automatic recovery from corrupted state files
Atomic writes: Crash-safe state persistence

MCP Tools (9)

Tool	Purpose
`harness_start`	Begin a session with an SOP
`harness_submit_step`	Submit output for current step
`harness_report_evaluation`	Report subagent evaluation results
`harness_get_status`	Get session status
`harness_get_feedback`	Get evaluation feedback history
`harness_list_sops`	Discover available SOPs
`harness_resume`	Resume a paused/blocked session
`harness_skip_step`	Skip current step
`harness_list_sessions`	List all sessions

Built-in SOP Templates

Template	Phases	Steps	Use Case
`feature-dev`	4	12	Feature development lifecycle
`investigation`	3	7	Security/research investigation
`code-review`	3	7	Multi-perspective code review
`_TEMPLATE`	-	-	Commented template for new SOPs

Installation

As a standalone MCP server

git clone https://github.com/p-vbordei/harness.git
cd harness
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

As a Claude Code plugin

Copy the harness/ directory to your Claude Code plugins location, or symlink it:

ln -s /path/to/harness ~/.claude/plugins/harness

The .mcp.json at the plugin root registers the MCP server automatically.

Configuration

Evaluator Backends

Backend	Env Vars	Cost	Notes
`subagent` (default)	None	Free	Uses Claude Code reviewer agent
`anthropic`	`HARNESS_EVAL_BACKEND=anthropic`, `ANTHROPIC_API_KEY`	~$0.001/eval	Haiku by default
`openai`	`HARNESS_EVAL_BACKEND=openai`, `HARNESS_EVAL_BASE_URL`, `HARNESS_EVAL_MODEL`	Varies	vLLM, Ollama, etc.

Evaluator Profiles

Profiles control scoring thresholds per step. Set in SOP YAML:

steps:
  - id: security-check
    evaluator_profile: strict    # threshold 4.0, correctness weighted 25%
  - id: draft-outline
    evaluator_profile: lenient   # threshold 3.0, no slop penalty

Usage

From Claude Code CLI

/harness feature-dev

Popularity

What's Inside

README

Harness - Quality-Gated SOP Execution

Why This Exists

How It Works

Features

Evaluator System

Workflow Engine

MCP Tools (9)

Built-in SOP Templates

Installation

As a standalone MCP server

As a Claude Code plugin

Configuration

Evaluator Backends

Evaluator Profiles

Usage

From Claude Code CLI

Confidence

Similar Plugins

agentic-sop-kit

agent-sop

harness-session

consensus-loop

sdlc-wizard

automation

Popularity

Health & Quality

Similar Plugins

agentic-sop-kit

agent-sop

harness-session

consensus-loop

sdlc-wizard

automation