From harness-engineer
Expert guide for designing autonomous AI agent systems using harness engineering principles — the full environment of scaffolding, constraints, alignment, and evaluation that makes AI agents production-reliable. Use this skill whenever the user asks about: building AI agent systems or pipelines, designing context/prompt/scaffolding architecture, setting up evaluation frameworks or benchmarks for AI agents, aligning AI behavior with constitutional principles, measuring AI agent performance (accuracy, latency, cost, safety), debugging or improving agent reliability, or any request involving "agents", "harness", "evals", "scaffolding", "LangSmith", "CI/CD for AI", "agentic workflows", "multi-agent systems", or similar.
npx claudepluginhub lauraflorentin/skills-marketplace --plugin harness-engineerThis skill uses the workspace's default tool permissions.
Harness engineering is the discipline of designing the full environment surrounding an AI agent
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Harness engineering is the discipline of designing the full environment surrounding an AI agent — the scaffolding, constraints, feedback loops, and evaluation pipelines — that makes agentic behavior reliable, observable, and aligned in production.
The model provides the intelligence. The harness provides the control.
Read references/context-engineering.md for deep guidance on CLAUDE.md / AGENTS.md patterns.
Read references/alignment.md for Constitutional AI and value hierarchy implementation.
Read references/evaluation-framework.md for evaluation harness design and benchmarking.
This skill provides the theory and research foundation that informs the plugin's four operational skills:
| Operational Skill | Theory Connection |
|---|---|
harness-init | Context Engineering — the doc hierarchy and CLAUDE.md patterns scaffold during init come from these principles |
harness-doctor | Alignment & Anti-Patterns — doctor diagnoses failures that often stem from architectural drift and constraint violations described here |
harness-gc | Entropy Management — gc implements the garbage collection agent pattern theorized in context engineering research |
harness-onboard | Observability — onboard orients new sessions using the tracing and documentation principles from this skill |
Use this skill when you need to understand why the operational skills work the way they do, or when designing new harness patterns from scratch.
Ensuring the agent has the right information at the right time.
Core principle: Anything inaccessible to the agent in-context does not exist. Repository = single source of truth.
Layered documentation structure (avoid monolithic instruction files):
CLAUDE.md / AGENTS.md → Top-level map: project structure, entry points, core beliefs
design-docs/ → API contracts, state management patterns, UI standards
exec-plans/ → Feature steps, tech debt trackers, task-specific logic
.cursorrules → Local constraints: naming conventions, linter configs
When to use which file:
Enforcing structural boundaries that prevent agentic drift and converge on correct solutions faster.
Dependency layering (enforce via pre-commit hooks + structural tests):
Types → Config → Repo → Service → Runtime → UI
Each domain imports only from layers to its left. This is enforced mechanically, not suggested.
Entropy management: Schedule "garbage collection" agents to:
LLM-based auditors: Use a second agent (with no generation context) to review the primary agent's outputs for architectural violations before committing.
Continuously measuring agent behavior across accuracy, cost, latency, safety, and consistency.
Five-stage evaluation pipeline:
pass@k for codeNever evaluate on accuracy alone. Use the CLASS framework:
| Metric | What to Measure | Why It Matters |
|---|---|---|
| Cost | Token usage, API expense per query | Economic sustainability |
| Latency | P50 and P99 response times | User experience |
| Accuracy | Exact match, semantic similarity, pass@k | Core utility |
| Stability | Paraphrase consistency, deterministic output | Behavioral reliability |
| Safety | Refusal rates, toxicity, jailbreak resistance | Compliance and risk |
A system is considered production-ready not because it succeeds once, but because it behaves predictably across different inputs, environments, and conditions.
For coding tasks, prefer pass@k over single-attempt accuracy:
$$pass@k = E\left[1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\right]$$
Where n = total samples, c = correct samples, k = attempts allowed.
For agents operating in regulated or high-stakes domains, implement a value priority hierarchy:
| Priority | Value | Practical Application |
|---|---|---|
| Tier 1 | Safety & Oversight | Human control preserved; no autonomous irreversible actions |
| Tier 2 | Ethical Behavior | Honest, harm-avoiding, transparent about reasoning |
| Tier 3 | Compliance | Follows org guidelines, legal constraints, domain regulations |
| Tier 4 | Helpfulness | Meets user goals effectively within above constraints |
Self-alignment loop (RLAIF pattern):
Constitutional AI produces a Pareto improvement: the agent becomes simultaneously more helpful and more harmless, rather than trading one for the other.
Use nested span tracing (e.g., LangSmith) to visualize the full trajectory of multi-turn agentic workflows:
Treat AI updates with the same rigor as software releases:
| Mode | Context | Purpose |
|---|---|---|
| Offline | Pre-release / dev | Regression testing against golden sets |
| Online | Production | Real-time quality drift detection |
| Simulation | Historical data | Forecast on thousands of real tickets |
# [Project Name]
## What this is
[One paragraph: purpose, stack, key constraints]
## Entry points
- Main: [path]
- Tests: [path]
- Config: [path]
## Core beliefs
- [Principle 1: e.g., "Service layer never calls UI layer"]
- [Principle 2: e.g., "All external calls are wrapped in retry logic"]
## Design docs
> See design-docs/ for API contracts and architecture decisions
> See exec-plans/ for current feature implementation steps
{
"task_name": "customer-support-agent",
"model": "claude-sonnet-4-20250514",
"metrics": ["exact_match", "semantic_f1", "latency_p99", "refusal_rate"],
"evaluation_modes": ["offline", "online"],
"golden_set": "evals/golden-set-v2.jsonl",
"judge_model": "claude-opus-4-20250514",
"positional_swap": true,
"verbosity_penalty": true
}
harness-init, harness-doctor, harness-gc, harness-onboard for applying these principles../../agents/orchestrator.md for autonomous multi-skill workflows