Help us improve
Share bugs, ideas, or general feedback.
From hatch3r
Eval-driven development workflow for shipping AI features: write eval before prompt, measure, iterate, ship with caching, cost telemetry, model fallback, and hallucination SLI.
npx claudepluginhub hatch3r/hatch3rHow this skill is triggered — by the user, by Claude, or both
Slash command
/hatch3r:hatch3r-ai-featureThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run this skill before shipping any LLM-driven feature. It defines the canonical eval-driven loop (write eval, write prompt, measure, iterate) and the production-readiness gates. Skipping any of the 9 steps = the feature is not done.
Audits pre-launch AI features across 6 dimensions—model selection, data quality, cost, monitoring, failure UX, optimization—grading readiness and blocking shipment of broken products.
Use this skill when the user asks to "design an eval suite", "build evals for my AI feature", "create an evaluation framework", "how do I evaluate my AI", "what evals should I run", "build an eval system", or wants to create a systematic evaluation framework for an AI-powered product feature. Typically run after error-analysis has identified the failure categories to prioritize.
Share bugs, ideas, or general feedback.
Run this skill before shipping any LLM-driven feature. It defines the canonical eval-driven loop (write eval, write prompt, measure, iterate) and the production-readiness gates. Skipping any of the 9 steps = the feature is not done.
This skill is the implementation counterpart to rules/hatch3r-ai-evals.md (backend governance) and rules/hatch3r-ai-ux-patterns.md (UI governance). The rules define the bar; this skill defines the route to clearing the bar.
Before any work, scan the invocation for unresolved questions in scope, intent, acceptance criteria, target environment, or irreversibility. If any are found, ask the user via the platform-native question tool per agents/shared/user-question-protocol.md. Do not proceed under silent assumption. Default path, not an exception. Triggers for THIS skill: task class (classification vs open-ended vs RAG vs agentic), model pin (Sonnet vs Opus vs Haiku), eval threshold values, budget per request (cost cap), and fallback policy (graceful degrade vs hard fail).
This skill delegates per task size:
Never under-fan-out to save tokens. Token cost is dominated by quality and completeness gains. Emit sub_agents_spawned: { count, rationale } in your output.
evals/<feature>/golden.jsonl with input + expected_output (or a graded rubric when the task is open-ended).evals/<feature>/thresholds.json. Without an explicit threshold, "passing the eval" is undefined.rules/hatch3r-ai-evals.md Golden Dataset Versioning for filename and refresh policy.Match the task class to the tool:
Pin the choice in evals/README.md so the next agent run picks the same tool.
prompts/<feature>/v1.md with frontmatter { id, version: 1, model_pinned, eval_set }.evals/<feature>/thresholds.json.cache_control breakpoints (or rely on OpenAI's automatic prefix cache for ≥1024-token deterministic prefixes). Longest-TTL block first.npx promptfoo eval (or the chosen tool's CLI) against the golden set.v2.md, re-hash, re-run.thresholds.json and the pairwise win-rate vs the prior version is >=55%.model, tokens_in, tokens_out, cache_hit, cached_tokens, cost_usd, latency_ms, prompt_version, prompt_hash, cost_center.gen_ai.* attributes).skills/hatch3r-observability-verify for the per-feature dashboard checklist.rules/hatch3r-resilience-patterns.md (Slice 8) for the primitives.**/prompts/**, **/rag/**, **/ai/**, **/llm/**.evals/<feature>/thresholds.json.First 24 hours after deploy, monitor:
ai.hallucination_rate — SLO <5% on golden set; alert if 7-day rolling rate >5%.ai.refusal_rate — track false-positive refusal rate separately.ai.cost_per_request_usd — p50/p95/p99 vs feature budget; alert at 50%/75%/90% of monthly budget.ai.latency_ms — first-token-latency p95 + total-response-latency p99.ai.cache_hit_ratio — should match the dev-environment baseline within 10%; a drop indicates prefix drift.ai.tokens_per_request — p95 should be within 20% of the eval-time distribution; a spike signals retrieval growth or prompt drift.Cross-reference skills/hatch3r-observability-verify.
evals/<feature>/edge.jsonl.All 9 steps complete = the AI feature is "done". Anything less = not done. The orchestrator running this skill emits a single-line verdict per step (STEP_N: PASS|FAIL <evidence-path>) and aggregates them. One FAIL on any step blocks release.
Evidence paths point at concrete artifacts: the golden set (evals/<feature>/golden.jsonl), the prompt version (prompts/<feature>/v<N>.md), the eval report (evals/<feature>/report-<run-id>.json), and the dashboard URL for production SLI verification. Verdicts without evidence paths are not accepted by the gate.
hatch3r-implementer finishes the surrounding non-AI feature code, before hatch3r-qa-validation.rules/hatch3r-ai-evals.md — backend governance (eval, cost, caching, fallback, SLI).rules/hatch3r-ai-ux-patterns.md — frontend UX patterns (streaming, tool-call cards, citations).skills/hatch3r-ui-ux-verify/SKILL.md — UI verification gate for AI surfaces.skills/hatch3r-observability-verify — observability wiring checklist.rules/hatch3r-resilience-patterns.md (Slice 8) — circuit-breaker + retry primitives reused in the fallback chain.promptfoo.devgithub.com/confident-ai/deepevaldocs.ragas.iogithub.com/UKGovernmentBEIS/inspect_aidocs.anthropic.com/en/docs/build-with-claude/prompt-cachingopentelemetry.io/docs/specs/semconv/gen-ai/gorilla.cs.berkeley.edu/leaderboard.html