Help us improve
Share bugs, ideas, or general feedback.
From Harness Base
Builds and iterates evals for LLM/ML output. Use whenever changing a prompt, model, retrieval config, or tool surface, or when an output's correctness is judgmental rather than deterministic. Arm the Stop gate with eval = true in .claude/HARNESS.toml.
npx claudepluginhub camilool8/harness-engineering-templates --plugin harness-baseHow this skill is triggered — by the user, by Claude, or both
Slash command
/harness-base:methodology-eval-drivenThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Evals are the unit test for judgmental output. They are how you know a prompt or
Creates p5.js generative art with seeded randomness, noise fields, and interactive parameter exploration. Use for algorithmic art, flow fields, or particle systems.
Share bugs, ideas, or general feedback.
Evals are the unit test for judgmental output. They are how you know a prompt or model change improved rather than regressed behavior.
To arm the gate, add this to .claude/HARNESS.toml (create the file if needed):
[harness]
eval = true
While armed, the eval-gate.sh Stop hook runs the fast eval subset before "done"
is allowed and blocks on failure. It is a no-op until evals/run.sh exists, so
arming it early does not punish you.
Do not invent evaluators up front. LLM failure modes are open-ended; you cannot enumerate them in advance.
evals/golden.jsonl
with source: observed-failure:<ref>, and write the cheapest evaluator that
would have caught it.must_contain / must_not_contain / regex). Run on every change.rubric. Use for
open-ended quality assertions cannot express.Escalate a check to the next tier only when the cheaper tier genuinely cannot judge it.
The model-graded judge MUST be a different model family than the generator under test. Same-family judging inflates scores and shares blind spots. If the generator is Claude, the judge is GPT / Gemini / Llama / etc. Verify this before trusting any model-graded number.
The gate expects an evals/run.sh with this contract: exit 0 = pass, non-zero =
fail; honor --fast / EVAL_FAST=1 to run a quick subset. The dataset is
evals/golden.jsonl, one JSON object per line:
{"input": "...", "expected": "...", "rubric": "...", "source": "observed-failure:#123"}
EVAL_FAST=1 evals/run.sh --fast. The eval-gate.sh Stop hook runs this
automatically and blocks "done" on failure.evals/run.sh — run in CI and before shipping a model/prompt
change.golden.jsonl rows — the dataset is the durable asset.