Help us improve
Share bugs, ideas, or general feedback.
From antigravity-skills
This skill should be used when designing autonomous agent harnesses: research loops, evaluation scaffolds, locked and editable surfaces, durable logs, novelty gates, pruning, rollback, PR preparation, and human approval boundaries.
npx claudepluginhub guanyang/antigravity-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/antigravity-skills:harness-engineeringThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Harness engineering designs the control system around an agent: what it may edit, how it receives feedback, where it writes state, how failures recover, and who can approve irreversible actions. The harness is the difference between a helpful agent session and an autonomous loop that can run for days without corrupting its objective.
Guides technical evaluation of code review feedback: read fully, restate for understanding, verify against codebase, respond with reasoning or pushback before implementing.
Share bugs, ideas, or general feedback.
Harness engineering designs the control system around an agent: what it may edit, how it receives feedback, where it writes state, how failures recover, and who can approve irreversible actions. The harness is the difference between a helpful agent session and an autonomous loop that can run for days without corrupting its objective.
Activate this skill when:
Do not activate this skill for adjacent work owned by other skills:
evaluation.tool-design.project-development.hosted-agents.Separate the agent from the environment it operates inside. The agent proposes actions; the harness defines allowed surfaces, feedback, persistence, and promotion rules.
Use four surface classes:
| Surface | Examples | Rule |
|---|---|---|
| Locked | Eval metric, rubric, validation script, merge policy | Agent may read and propose changes, but cannot score itself with modified rules |
| Editable | Skill draft, experiment file, prompt, config under test | Agent may mutate during the loop |
| Append-only | Results log, research thread, rejected ideas | Agent may append, not rewrite |
| Human-controlled | Merge, production deploy, credentials, destructive operations | Requires explicit human approval |
Autonomy works when feedback is fast, unambiguous, and hard to game. Karpathy's autoresearch is the minimal pattern: one editable file, one locked evaluation file, fixed wall-clock budget, one scalar metric, git rollback, and a durable results log. The lesson is not that every harness needs one metric; it is that ambiguous feedback creates ambiguous autonomy.
For open-ended research-to-skill work, replace the scalar metric with locked rubrics, deterministic structure checks, source traceability, and human review thresholds.
Long-running agents must externalize state. Store plans, source queues, results, failures, and handoffs in files so future agents can resume without relying on chat history. Prime Intellect's autonomous nanoGPT work showed the value of durable scratchpads and THREAD.md-style logs for recovery, monitoring, and audit.
Use append-only logs for:
Agents tend to exploit the nearest surface, stack complexity, and under-run pruning. Add explicit search rules:
For research-to-skill systems, track accepted mechanisms separately from prose. A mechanism record should include a stable mechanism_id, owning_skill, status, activation scenario, behavior change, evidence, and failure modes. Novelty gates should compare against this registry before using broader corpus overlap, because keyword overlap catches stale phrasing while mechanism comparison catches real duplication.
Autonomous agents may prepare PRs, but governance must be explicit. They can draft changes, run checks, and write PR summaries. They should not merge, deploy, or push without human approval unless the user has explicitly granted that permission for the specific action.
Use this pattern when optimizing an artifact against a stable evaluator:
read locked context -> choose hypothesis -> edit allowed surface -> commit/checkpoint
-> run evaluator -> log result -> keep if better -> discard or rollback if worse
-> repeat
Required properties:
Use this pattern when sources become skill changes:
discover -> retrieve -> gate -> score -> extract mechanism
-> map to existing or new skill -> draft proposal -> validate structure
-> prepare PR -> human review
The locked evaluator is a combination of source rubrics, skill-change rubrics, structure checks, and reviewer approval. The editable artifact is the proposed skill delta.
Assume an optimizing agent will learn the harness. Guard against:
Mitigation: lock rubrics per run, report per-dimension scores, require source retrieval evidence, preserve rejected attempts, and route governance changes to human review.
Use monitoring agents for long runs, but restrict them to read-only reporting unless explicitly tasked otherwise. Monitoring output should report:
research-run/
THREAD.md
sources/
queue.md
evaluations/
proposals/
logs/
results.tsv
rejected.md
drafts/
Use TSV or JSONL for append-only machine-readable logs. Use Markdown for handoffs and reviewer-facing summaries.
Example 1: Locked metric
An agent optimizes train.py, but prepare.py owns data loading and evaluation. The agent can edit the model but cannot change the metric. Failed experiments are logged and rolled back.
Example 2: Locked rubric
An agent evaluates a new Anthropic or OpenAI engineering post, but the source curation rubric is locked for the run. If the source passes, the agent drafts a skill proposal. It cannot lower the rubric threshold to admit the source.
Example 3: Auto-PR without auto-merge
An agent prepares a branch and PR body after passing source, skill, and structure checks. The PR states unresolved risks and waits for human merge approval.
This skill connects to:
Internal references:
researcher/README.md - Read when implementing the repo-native research-to-skill operating systemresearcher/rubrics/harness-change.md - Read when evaluating changes to an agent harnessresearcher/runbooks/autonomous-research-loop.md - Read when running a source-to-skill loopExternal resources:
autoresearch - Constrained autonomous experiment loop with locked evaluationCreated: 2026-05-14 Last Updated: 2026-05-15 Author: Agent Skills for Context Engineering Contributors Version: 1.1.0