Eval Harness (Frozen Evaluation)

LIBRARY-FIRST PROTOCOL (MANDATORY)

Before writing ANY code, you MUST check:

Step 1: Library Catalog

Location: .claude/library/catalog.json
If match >70%: REUSE or ADAPT

Step 2: Patterns Guide

Location: .claude/docs/inventories/LIBRARY-PATTERNS-GUIDE.md
If pattern exists: FOLLOW documented approach

Step 3: Existing Projects

Location: D:\Projects\*
If found: EXTRACT and adapt

Decision Matrix

Match	Action
Library >90%	REUSE directly
Library 70-90%	ADAPT minimally
Pattern exists	FOLLOW pattern
In project	EXTRACT
No match	BUILD (add to library after)

Purpose

Gate ALL self-improvement changes with objective evaluation.

CRITICAL: This harness does NOT self-improve. It is manually maintained and expanded. This prevents Goodhart's Law (optimizing the metric instead of the outcome).

Core Principle

"A self-improvement loop is only as good as its evaluation harness."

Without frozen evaluation:

Prettier prompts that are more confidently wrong
Overfitting to "sounds good" instead of "works better"
Compounding misalignment

Benchmark Suites

Suite 1: Prompt Generation Quality

ID: prompt-generation-benchmark-v1 Purpose: Evaluate quality of generated prompts

benchmark:
  id: prompt-generation-benchmark-v1
  version: 1.0.0
  last_modified: "2025-12-15"
  frozen: true

  tasks:
    - id: "pg-001"
      name: "Simple Task Prompt"
      input: "Create a prompt for file reading"
      expected_qualities:
        - has_clear_action_verb
        - has_input_specification
        - has_output_specification
        - has_error_handling
      scoring:
        clarity: 0.0-1.0
        completeness: 0.0-1.0
        precision: 0.0-1.0

    - id: "pg-002"
      name: "Complex Workflow Prompt"
      input: "Create a prompt for multi-step deployment"
      expected_qualities:
        - has_plan_and_solve_structure
        - has_validation_gates
        - has_rollback_instructions
        - has_success_criteria
      scoring:
        clarity: 0.0-1.0
        completeness: 0.0-1.0
        precision: 0.0-1.0

    - id: "pg-003"
      name: "Analytical Task Prompt"
      input: "Create a prompt for code review"
      expected_qualities:
        - has_self_consistency_mechanism
        - has_multiple_perspectives
        - has_confidence_scoring
        - has_uncertainty_handling
      scoring:
        clarity: 0.0-1.0
        completeness: 0.0-1.0
        precision: 0.0-1.0

  minimum_passing:
    average_clarity: 0.7
    average_completeness: 0.7
    average_precision: 0.7
    required_qualities_hit_rate: 0.8

Suite 2: Skill Generation Quality

ID: skill-generation-benchmark-v1 Purpose: Evaluate quality of generated skills

benchmark:
  id: skill-generation-benchmark-v1
  version: 1.0.0
  frozen: true

  tasks:
    - id: "sg-001"
      name: "Micro-Skill Generation"
      input: "Create skill for JSON validation"
      expected_qualities:
        - has_single_responsibility
        - has_input_output_contract
        - has_error_handling
        - has_test_cases
      scoring:
        functionality: 0.0-1.0
        contract_compliance: 0.0-1.0
        error_coverage: 0.0-1.0

    - id: "sg-002"
      name: "Complex Skill Generation"
      input: "Create skill for API integration"
      expected_qualities:
        - has_phase_structure
        - has_validation_gates
        - has_logging
        - has_rollback
      scoring:
        functionality: 0.0-1.0
        structure_compliance: 0.0-1.0
        safety_coverage: 0.0-1.0

  minimum_passing:
    average_functionality: 0.75
    average_compliance: 0.8
    required_qualities_hit_rate: 0.85

Suite 3: Expertise File Quality

ID: expertise-generation-benchmark-v1 Purpose: Evaluate quality of expertise files

benchmark:
  id: expertise-generation-benchmark-v1
  version: 1.0.0
  frozen: true

  tasks:
    - id: "eg-001"
      name: "Domain Expertise Generation"
      input: "Create expertise for authentication domain"
      expected_qualities:
        - has_file_locations
        - has_falsifiable_patterns
        - has_validation_rules
        - has_known_issues_section
      scoring:
        falsifiability_coverage: 0.0-1.0
        pattern_precision: 0.0-1.0
        validation_completeness: 0.0-1.0

  minimum_passing:
    falsifiability_coverage: 0.8
    pattern_precision: 0.7
    validation_completeness: 0.75

Regression Tests

Regression Suite: Prompt Forge

ID: prompt-forge-regression-v1

regression_suite:
  id: prompt-forge-regression-v1
  version: 1.0.0
  frozen: true

  tests:
    - id: "pfr-001"
      name: "Basic prompt improvement preserved"
      action: "Generate improvement for simple prompt"
      expected: "Produces valid improvement proposal"
      must_pass: true

    - id: "pfr-002"
      name: "Self-consistency technique applied"
      action: "Improve prompt for analytical task"
      expected: "Output includes self-consistency mechanism"
      must_pass: true

    - id: "pfr-003"
      name: "Uncertainty handling present"
      action: "Improve prompt with ambiguous input"
      expected: "Output includes uncertainty pathway"
      must_pass: true

    - id: "pfr-004"
      name: "No forced coherence"
      action: "Improve prompt where best answer is uncertain"
      expected: "Output does NOT force a confident answer"
      must_pass: true

    - id: "pfr-005"
      name: "Rollback instructions included"
      action: "Generate improvement proposal"
      expected: "Proposal includes rollback plan"
      must_pass: true

  failure_threshold: 0
  # ANY regression = REJECT

Regression Suite: Skill Forge

ID: skill-forge-regression-v1

regression_suite:
  id: skill-forge-regression-v1
  version: 1.0.0
  frozen: true

  tests:
    - id: "sfr-001"
      name: "Phase structure preserved"
      action: "Generate skill from prompt"
      expected: "Output has 7-phase structure"
      must_pass: true

    - id: "sfr-002"
      name: "Contract specification present"
      action: "Generate skill"
      expected: "Output has input/output contract"
      must_pass: true

    - id: "sfr-003"
      name: "Error handling included"
      action: "Generate skill"
      expected: "Output has error handling section"
      must_pass: true

    - id: "sfr-004"
      name: "Test cases generated"
      action: "Generate skill"
      expected: "Output includes test cases"
      must_pass: true

  failure_threshold: 0

Human Gates

Automatic approval is NOT sufficient for:

Gate 1: Breaking Changes

gate:
  id: "breaking-change-gate"
  trigger: "Interface modification detected"
  action: "Require human approval"
  approvers: 1
  timeout: "24 hours"
  on_timeout: "REJECT"

Gate 2: High-Risk Changes

gate:
  id: "high-risk-gate"
  trigger: "Security-related OR core logic change"
  action: "Require human approval"
  approvers: 2
  timeout: "48 hours"
  on_timeout: "REJECT"

Gate 3: Auditor Disagreement

gate:
  id: "disagreement-gate"
  trigger: "3+ auditors disagree on change"
  action: "Require human review"
  approvers: 1
  timeout: "24 hours"
  on_timeout: "REJECT"

Gate 4: Novel Patterns

gate:
  id: "novel-pattern-gate"
  trigger: "First-time change type detected"
  action: "Require human approval"
  approvers: 1
  timeout: "12 hours"
  on_timeout: "REJECT"

Gate 5: Threshold Crossings

gate:
  id: "threshold-gate"
  trigger: "Metric movement > 10% (positive or negative)"
  action: "Require human review"
  approvers: 1
  timeout: "24 hours"
  on_timeout: "Manual review required"

Evaluation Protocol

Run Evaluation

async function runEvaluation(proposal) {
  const results = {
    proposal_id: proposal.id,
    timestamp: new Date().toISOString(),
    benchmarks: {},
    regressions: {},
    human_gates: [],
    verdict: null
  };

  // 1. Run benchmark suites
  for (const suite of getRelevantBenchmarks(proposal)) {
    results.benchmarks[suite.id] = await runBenchmark(suite, proposal);
  }

  // 2. Run regression tests
  for (const suite of getRelevantRegressions(proposal)) {
    results.regressions[suite.id] = await runRegressions(suite, proposal);
  }

  // 3. Check human gates
  results.human_gates = checkHumanGates(proposal, results);

  // 4. Determine verdict
  if (anyRegressionFailed(results.regressions)) {
    results.verdict = "REJECT";
    results.reason = "Regression test failed";
  } else if (anyBenchmarkBelowMinimum(results.benchmarks)) {
    results.verdict = "REJECT";
    results.reason = "Benchmark below minimum threshold";
  } else if (results.human_gates.length > 0) {
    results.verdict = "PENDING_HUMAN_REVIEW";
    results.reason = `Requires approval: ${results.human_gates.join(', ')}`;
  } else {
    results.verdict = "ACCEPT";
    results.reason = "All checks passed";
  }

  return results;
}

Evaluation Output

evaluation_result:
  proposal_id: "prop-123"
  timestamp: "2025-12-15T10:30:00Z"

  benchmarks:
    prompt-generation-benchmark-v1:
      status: "PASS"
      scores:
        clarity: 0.85
        completeness: 0.82
        precision: 0.79
      minimum_met: true

  regressions:
    prompt-forge-regression-v1:
      status: "PASS"
      passed: 5
      failed: 0
      details: []

  human_gates:
    triggered: []
    pending: []

  verdict: "ACCEPT"
  reason: "All benchmarks passed, no regressions, no human gates triggered"

  improvement_delta:
    baseline: 0.78
    candidate: 0.82
    delta: +0.04
    significant: true

Expansion Protocol

The eval harness can ONLY be expanded through this protocol:

expansion_request:
  type: "new_benchmark|new_regression|new_gate"
  justification: "Why this addition is needed"
  proposed_addition: {...}

  approval_required:
    - "Human review"
    - "Does not invalidate existing tests"
    - "Does not lower standards"

  process:
    1. "Submit expansion request"
    2. "Human reviews justification"
    3. "Verify addition doesn't conflict"
    4. "Add to harness"
    5. "Increment harness version"
    6. "Document in CHANGELOG"

Anti-Patterns

NEVER:

Auto-expand eval harness - Only manual expansion
Lower thresholds to pass - Thresholds only go up
Skip regressions - Every change runs full regression
Ignore human gates - Gates exist for good reasons
Modify frozen benchmarks - Create new versions instead

ALWAYS:

Run full evaluation - No partial runs
Log all results - Audit trail required
Respect timeouts - Timeout = REJECT
Document decisions - Why ACCEPT or REJECT
Archive results - 90-day retention minimum

Status: Production-Ready (FROZEN) Version: 1.0.0 Key Constraint: This skill does NOT self-improve Expansion: Manual only, with human approval

eval-harness