Design, audit, and improve the agent harness — the environment, constraints, context management, evaluation gates, and feedback loops that surround and guide AI agents. Inspired by OpenAI and Anthropic harness engineering best practices. Triggers: "harness", "harness audit", "improve harness", "agent environment", "context management", "evaluation gates", "feedback loop", "harness engineering".
From superomninpx claudepluginhub wilder1222/superomni --plugin superomniThis skill is limited to using the following tools:
SKILL.md.tmplSearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
mkdir -p ~/.omni-skills/sessions
_PROACTIVE=$(~/.claude/skills/superomni/bin/config get proactive 2>/dev/null || echo "true")
_BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown")
_TEL_START=$(date +%s)
echo "Branch: $_BRANCH | PROACTIVE: $_PROACTIVE"
If PROACTIVE is false: do NOT proactively suggest skills. Only run skills the
user explicitly invokes. If you would have auto-invoked, say:
"I think [skill-name] might help here — want me to run it?" and wait.
Report status using one of these at the end of every skill session:
Pipeline stage order: THINK → PLAN → REVIEW → BUILD → VERIFY → SHIP → REFLECT
REVIEW is the only human gate. All other stages auto-advance on DONE.
| Status | At REVIEW stage | At all other stages |
|---|---|---|
| DONE | STOP — present review summary, wait for user input (Y / N / revision notes) | Auto-advance — print [STAGE] DONE → advancing to [NEXT-STAGE] and immediately invoke next skill |
| DONE_WITH_CONCERNS | STOP — present concerns, wait for user decision | STOP — present concerns, wait for user decision |
| BLOCKED / NEEDS_CONTEXT | STOP — present blocker, wait for user | STOP — present blocker, wait for user |
When auto-advancing:
docs/superomni/[STAGE] DONE → advancing to [NEXT-STAGE] ([skill-name])When the user sends a follow-up message after a completed session, before doing anything else:
ls docs/superomni/specs/spec-*.md docs/superomni/plans/plan-*.md docs/superomni/ .superomni/ 2>/dev/null | head -20
git log --oneline -3 2>/dev/null
To find the latest spec or plan:
_LATEST_SPEC=$(ls docs/superomni/specs/spec-*.md 2>/dev/null | sort | tail -1)
_LATEST_PLAN=$(ls docs/superomni/plans/plan-*.md 2>/dev/null | sort | tail -1)
workflow skill for stage → skill mapping) and announce:
"Continuing in superomni mode — picking up at [stage] using [skill-name]."using-skills/SKILL.md.When asking the user a question, match the confirmation requirement to the complexity of the response:
| Question type | Confirmation rule |
|---|---|
| Single-choice — user picks one option (A/B/C, 1/2/3, Yes/No) | The user's selection IS the confirmation. Do NOT ask "Are you sure?" or require a second submission. |
| Free-text input — user types a value and presses Enter | The submitted text IS the confirmation. No secondary prompt needed. |
| Multi-choice — user selects multiple items from a list | After the user lists their selections, ask once: "Confirm these selections? (Y to proceed)" before acting. |
| Complex / open-ended discussion — back-and-forth clarification | Collect all input, then present a summary and ask: "Ready to proceed with the above? (Y/N)" before acting. |
Rule: never add a redundant confirmation layer on top of a single-choice or text-input answer.
Custom Input Option Rule: Whenever you present a predefined list of choices (A/B/C, numbered options, etc.), always append a final "Other" option that lets the user describe their own idea:
[last letter/number + 1]) Other — describe your own idea: ___________
When the user selects "Other" and provides their custom text, treat that text as the chosen option and proceed exactly as you would for any other selection. If the custom text is ambiguous, ask one clarifying question before proceeding.
Load context progressively — only what is needed for the current phase:
| Phase | Load these | Defer these |
|---|---|---|
| Planning | Latest docs/superomni/specs/spec-*.md, constraints, prior decisions | Full codebase, test files |
| Implementation | Latest docs/superomni/plans/plan-*.md, relevant source files | Unrelated modules, docs |
| Review/Debug | diff, failing test output, minimal repro | Full history, specs |
If context pressure is high: summarize prior phases into 3-5 bullet points, then discard raw content.
All skill artifacts are written to docs/superomni/ (relative to project root).
See the Document Output Convention in CLAUDE.md for the full directory map.
Agent failures are harness signals — not reasons to retry the same approach:
harness-engineering skill to update the harness before retrying.It is always OK to stop and say "this is too hard for me." Escalation is expected, not penalized.
After completing any skill session, run a 3-question self-check before writing the final status:
If any answer is NO, address it before reporting DONE. If it cannot be addressed, report DONE_WITH_CONCERNS and name the gap.
For a full performance evaluation spanning the entire sprint, use the self-improvement skill.
_TEL_END=$(date +%s)
_TEL_DUR=$(( _TEL_END - _TEL_START ))
~/.claude/skills/superomni/bin/analytics-log "SKILL_NAME" "$_TEL_DUR" "OUTCOME" 2>/dev/null || true
Nothing is sent to external servers. Data is stored only in ~/.omni-skills/analytics/.
Goal: Design and maintain the agent harness — the scaffolding of environment, context, tools, constraints, evaluation gates, and feedback loops that determine how well agents perform.
"Engineers design the system. Agents execute." — OpenAI Harness Engineering
THE HARNESS IS THE PRODUCT. CODE IS ITS OUTPUT.
A well-designed harness produces reliable, high-quality agent output without requiring manual intervention on every task. When agents fail repeatedly, the correct response is to improve the harness — not to keep retrying the same prompt.
| Principle | What it means in superomni |
|---|---|
| Context is everything | Agents can only work with what they can see — keep docs, specs, and constraints in-repo and up-to-date |
| Fewer, more expressive tools | Prefer composable skills over sprawling tool menus |
| Evaluate relentlessly | Judgment gates must exist at every major transition point |
| Signal-driven iteration | Agent failures are design signals — update the harness, not just the prompt |
| Boring > clever | Prefer simple, composable patterns over novel abstractions |
| Garbage collection | Periodically audit for drift, stale docs, and architectural decay |
Take stock of the current harness state:
# Skill count + structure
ls skills/ | wc -l
ls skills/
# Agent count
ls agents/
# Command count
ls commands/
# Preamble size (context overhead)
wc -l lib/preamble.md
# Skill template sizes (larger = more context pressure)
wc -l skills/*/SKILL.md.tmpl | sort -n | tail -10
# Validation status
npm test 2>/dev/null || bash lib/validate-skills.sh 2>/dev/null
# Recent harness changes
git log --oneline -10 -- lib/ skills/ agents/ commands/
# Any stale/out-of-date docs
find docs/ -name "*.md" -older /tmp 2>/dev/null | head -10
Document findings:
Context window pressure is one of the most common causes of agent degradation. Audit the harness context load:
Review lib/preamble.md:
Target preamble size: < 150 lines. Flag if > 200 lines.
For each skill > 200 lines, ask:
Does the framework expose only necessary context at each stage?
| Stage | Context needed | Currently loaded |
|---|---|---|
| Planning | spec, constraints | |
| Implementation | plan, code context | |
| Review | diff, standards | |
| Debug | error, minimal repro |
Good harnesses load context on demand, not all at once.
Per Anthropic's principle: fewer, more expressive tools outperform large menus of narrow ones.
Review the agent's tool access:
# Check allowed-tools across all skills
grep "allowed-tools" skills/*/SKILL.md.tmpl
For each skill, evaluate:
Recommended tool sets by role:
| Role | Minimal tool set |
|---|---|
| Planning / Brainstorming | Read, Write, Glob |
| Implementation | Bash, Read, Write, Edit, Grep, Glob |
| Review / Audit | Read, Grep, Glob |
| Debug | Bash, Read, Grep, Glob |
Flag any skill whose tool set exceeds its role's minimum.
"Evaluation is the load-bearing part of agent harness design." — OpenAI/Anthropic harness engineering principles
Map every major workflow transition and verify an evaluation gate exists:
| Transition | Evaluation gate | Present? |
|---|---|---|
| Spec → Plan | plan-review skill or planner agent | |
| Plan → Execution | dependency analysis wave plan | |
| Execution Wave → Next Wave | wave verification step | |
| Implementation → Review | code-review skill or code-reviewer agent | |
| Review → Ship | production-readiness skill | |
| Ship → Done | verification skill | |
| Sprint → Next Sprint | self-improvement skill |
Any gap = harness deficiency. Add missing gates.
A healthy harness converts agent failures into harness improvements:
Agent fails → Signal captured → Harness updated → Agent retries → Improvement
↑ |
└───────────────────────────────────────────────────────────────────┘
Check the current feedback paths:
When an agent fails a task repeatedly (3+ attempts), is there a defined process to:
Does the self-improvement skill output get consumed?
ls docs/superomni/improvements/ 2>/dev/null | head -5
Is there a regular cadence for cleaning up:
Recommended: Schedule a harness GC pass after every 5 sprints.
Score the harness on each dimension (1-5):
| Dimension | Score | Key Finding |
|---|---|---|
| Context efficiency | /5 | |
| Tool space minimalism | /5 | |
| Evaluation gate coverage | /5 | |
| Feedback loop completeness | /5 | |
| Documentation freshness | /5 |
Total: __ / 25
Scoring guide:
For each finding from Phases 2-5 with a score < 4:
HARNESS IMPROVEMENT [N]: [TITLE]
Dimension: [context | tools | evaluation | feedback | docs]
Finding: [specific issue identified]
Impact: [how this degrades agent performance]
Fix: [concrete change to harness — specific file, section, or process]
Priority: [P0 — blocks agent / P1 — degrades quality / P2 — nice to have]
Generate a prioritized backlog. P0 items must be fixed before the next sprint.
HARNESS_DIR="docs/superomni/harness-audits"
mkdir -p "$HARNESS_DIR"
BRANCH=$(git branch --show-current 2>/dev/null | tr '/' '-' || echo "main")
TIMESTAMP=$(date +%Y-%m-%d-%H%M%S)
REPORT_FILE="$HARNESS_DIR/harness-audit-${BRANCH}-${TIMESTAMP}.md"
echo "Saving harness audit to $REPORT_FILE"
Save the full audit report including all scores, findings, and improvement backlog.
HARNESS AUDIT REPORT
════════════════════════════════════════
Branch: [branch]
Date: [date]
Skills / Agents: [N] skills, [N] agents, [N] commands
Preamble size: [N] lines ([OK / BLOATED])
Validation: [PASS / FAIL]
Health score: [N]/25 ([rating])
Top finding: [single most important issue]
P0 improvements: [N]
P1 improvements: [N]
P2 improvements: [N]
Report saved: [docs/superomni/harness-audits/...]
Status: DONE | DONE_WITH_CONCERNS | BLOCKED
════════════════════════════════════════