From foundry
Calibration testing for agents and skills. Generates synthetic problems with known outcomes (quasi-ground-truth), runs targets against them, and measures recall, precision, and confidence calibration — revealing whether self-reported confidence scores track actual quality.
npx claudepluginhub borda/ai-rig --plugin foundryThis skill is limited to using the following tools:
<objective>
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Validate agents and skills by measuring their outputs against synthetic problems with defined ground truth. The primary signal is calibration bias — the gap between self-reported confidence and actual recall. A well-calibrated agent reports 0.9 confidence when it genuinely finds ~90% of issues. A miscalibrated one may report 0.9 while only finding 60%.
Calibration data drives the improvement loop: systematic gaps become instruction updates; persistent overconfidence adjusts effective re-run thresholds stored in MEMORY.md.
$ARGUMENTS: {all|agents|skills|routing|communication|<name>} [fast|full] [ab] [apply]
all):
all — all agents + relevant skills + routing + communication + all rules (i.e., everything)agents — all agents onlyskills — calibratable skills only (/audit, /oss:review (requires oss plugin))routing — routing accuracy test: measures how accurately a general-purpose orchestrator selects the correct subagent_type for synthetic task prompts (not a per-agent quality benchmark; included in all)communication — handover + team protocol compliance: runs self-mentor against synthetic agent responses and team transcripts with injected protocol violations (missing JSON envelope, missing summary, AgentSpeak v2 breaches); included in allrules — rule adherence test: for each global rule file (no paths:) and each path-scoped rule when a matching file is in context, generates synthetic tasks that should trigger the rule's key directives, measures whether a general-purpose agent with the rule loaded correctly applies them; reports rules that are ignored, misapplied, or redundant with each other; included in all<agent-name> — single agent (e.g., sw-engineer)/audit or /oss:review — single skillfast):
fast — 3 problems per targetfull — 10 problems per targetab (optional): also run a general-purpose baseline and report delta metricsapply (optional):
fast or full: run the calibration benchmark then immediately apply the new proposals at the endfast/full: skip benchmark; apply proposals from the most recent past runEvery invocation surfaces a report: benchmark runs print the new results; bare apply prints the saved report from the last run before applying any changes.
.claude/logs/calibrations.jsonlDomain tables per mode: see modes/agents.md, modes/skills.md, modes/routing.md, modes/communication.md, modes/rules.md.
Task hygiene: Before creating tasks, call TaskList. For each found task:
completed if the work is clearly donedeleted if orphaned / no longer relevantin_progress only if genuinely continuingTask tracking: create tasks at the start of execution (Step 1) for each phase that will run:
From $ARGUMENTS, determine:
all or omitted → all agents + /audit + /oss:review + routing + communication + all rulesagents → all agents only (the full agent list in modes/agents.md)skills → /audit and /oss:review onlyrouting → routing accuracy test onlycommunication → handover + team protocol compliance onlyrules → rule adherence test (all rule files in .claude/rules/) onlyfast or full in remaining tokens — default fastab present → also spawn a general-purpose baseline per problemapply without fast/full → pure apply mode: skip Steps 2–5; go directly to Step 6apply with fast/full → benchmark + auto-apply: run Steps 2–5 then continue to Step 6If benchmark will run (i.e., fast or full is present, with or without apply): generate timestamp YYYY-MM-DDTHH-MM-SSZ (UTC, e.g. 2026-03-03T13-44-48Z) via date -u +%Y-%m-%dT%H-%M-%SZ. All run dirs use this timestamp.
Create tasks before proceeding:
apply): TaskCreate "Calibrate agents" (if target includes agents), TaskCreate "Calibrate skills" (if target includes skills), TaskCreate "Calibrate routing" (if target includes routing), TaskCreate "Calibrate communication" (if target includes communication), TaskCreate "Calibrate rules" (if target includes rules), TaskCreate "Analyse and report"fast/full + apply): TaskCreate "Calibrate agents" (if target includes agents), TaskCreate "Calibrate skills" (if target includes skills), TaskCreate "Calibrate routing" (if target includes routing), TaskCreate "Calibrate communication" (if target includes communication), TaskCreate "Calibrate rules" (if target includes rules), TaskCreate "Analyse and report", TaskCreate "Apply findings"apply, no fast/full): TaskCreate "Apply findings" onlyFor each target mode in the resolved target list, read the corresponding mode file and execute its spawn instructions. Issue ALL spawns in a single response — modes are independent and run concurrently.
| Target mode | Mode file | Task to mark in_progress |
|---|---|---|
| agents | .claude/skills/calibrate/modes/agents.md | "Calibrate agents" |
| skills | .claude/skills/calibrate/modes/skills.md | "Calibrate skills" |
| routing | .claude/skills/calibrate/modes/routing.md | "Calibrate routing" |
| communication | .claude/skills/calibrate/modes/communication.md | "Calibrate communication" |
| rules | .claude/skills/calibrate/modes/rules.md | "Calibrate rules" |
Each mode file defines <TARGET>, <DOMAIN>, any N overrides, and extra instructions for the pipeline subagent. The pipeline template lives at .claude/skills/calibrate/templates/pipeline-prompt.md. N override: communication caps at fast=3 / full=5 (not the global FULL_N=10) to prevent pipeline context overflow — read modes/communication.md for details. rules mode spawns one general-purpose subagent per rule file (not the standard pipeline template) — read modes/rules.md for the direct-spawn approach.
Health monitoring — apply the protocol from CLAUDE.md §8. Run dir for liveness checks: .reports/calibrate/<TIMESTAMP>/<TARGET>/. Constants below tighten the global defaults for this skill:
# Initialise checkpoints after all pipeline spawns
LAUNCH_AT=$(date +%s)
for TARGET in <target-list>; do touch /tmp/calibrate-check-$TARGET; done
# Use extended timeout for dual-source runs (Codex active in agents/skills modes)
EFFECTIVE_TIMEOUT_MIN=$PIPELINE_TIMEOUT_MIN
for T in agents skills; do [ "$TARGET" = "$T" ] && EFFECTIVE_TIMEOUT_MIN=$PIPELINE_TIMEOUT_MIN_DUAL; done
# Every HEALTH_CHECK_INTERVAL_MIN (5 min): check each still-running pipeline
NEW=$(find .reports/calibrate/<TIMESTAMP>/$TARGET/ -newer /tmp/calibrate-check-$TARGET -type f 2>/dev/null | wc -l | tr -d ' ') # tr -d strips leading spaces from wc -l on macOS
touch /tmp/calibrate-check-$TARGET
ELAPSED=$(( ($(date +%s) - LAUNCH_AT) / 60 ))
if [ "$NEW" -gt 0 ]; then
echo "✓ $TARGET active"
elif [ "$ELAPSED" -ge "$EFFECTIVE_TIMEOUT_MIN" ]; then
echo "⏱ $TARGET TIMED OUT (hard limit)"
elif [ "$ELAPSED" -ge "$HEALTH_CHECK_INTERVAL_MIN" ]; then
OUTPUT_FILE=".reports/calibrate/<TIMESTAMP>/$TARGET/pipeline.jsonl"
if tail -20 "$OUTPUT_FILE" 2>/dev/null | grep -qi 'delay\|wait\|slow'; then
echo "⏸ $TARGET: extension granted (+5 min)"
else
echo "⏱ $TARGET TIMED OUT"
fi
fi
On timeout: read tail -100 <output_file> for partial JSON; if none use: {"target":"<TARGET>","verdict":"timed_out","mean_recall":null,"gaps":["pipeline timed out — re-run individually with /calibrate <target> fast"]}. Timed-out targets appear in the report with ⏱ prefix and null metrics.
After all pipeline subagents have completed or timed out: mark "Calibrate agents", "Calibrate skills", "Calibrate routing", "Calibrate communication", and "Calibrate rules" completed (whichever ran). Mark "Analyse and report" in_progress. Parse the compact JSON summary from each.
Print the combined benchmark report:
## Calibrate — <date> — <MODE>
| Target | Recall | SevAcc | Fmt | Confidence | Bias | F1 | Scope | Verdict | Top Gap |
|------------------|--------|--------|------|------------|---------|------|-------|------------|----------------------|
| sw-engineer | 0.83 | 0.91 | 0.87 | 0.85 | +0.02 ✓ | 0.81 | 0 ✓ | calibrated | async error paths |
| ... | | | | | | | | | |
*Recall: in-scope issues found / total. SevAcc: severity match rate for found issues (±1 tier) — high recall + low SevAcc = issues found but misprioritized. Fmt: fraction of found issues with location + severity + fix (actionability). Bias: confidence − recall (+ = overconfident). Scope: FP on out-of-scope input (0 ✓).*
If AB mode, add ΔRecall, ΔSevAcc, ΔFmt, ΔTokens, and AB Verdict columns after F1. ΔTokens = token_ratio − 1.0 (negative = specialist more concise).
| Target | Recall | SevAcc | Fmt | Bias | F1 | ΔRecall | ΔSevAcc | ΔFmt | ΔTokens | Scope | AB Verdict |
|-------------|--------|--------|------|---------|------|---------|---------|-------|---------|-------|------------|
| sw-engineer | 0.83 | 0.91 | 0.87 | +0.02 ✓ | 0.81 | +0.05 ~ | +0.12 ✓ | +0.15 ✓ | −0.18 ✓ | 0 ✓ | marginal ~ |
*ΔRecall/ΔSevAcc/ΔFmt: specialist − general (positive = specialist better). ΔTokens: token_ratio − 1.0 (negative = more focused). AB Verdict covers ΔRecall and ΔF1 only; use ΔSevAcc and ΔFmt as supplementary evidence for agents where ΔRecall ≈ 0.*
If target is routing: read modes/routing.md "Report format" section and use that table instead of the standard table above. Mark "Calibrate routing" completed.
Flag any target where recall < 0.70 or |bias| > 0.15 with ⚠.
After the table, print the full content of each proposal.md for targets where proposed_changes > 0.
If apply was not set, print:
→ Review proposals above, then run `/calibrate <targets> [fast|full] apply` to apply them.
→ Proposals saved to: .reports/calibrate/<TIMESTAMP>/<TARGET>/proposal.md
If apply was set (benchmark + auto-apply mode), print → Auto-applying proposals now… and proceed to Step 6.
Targets with verdict calibrated and no proposed changes get a single line: ✓ <target> — no instruction changes needed.
Append each target's result line to .claude/logs/calibrations.jsonl using native tools (no Bash needed):
*/result.jsonl, path .reports/calibrate/<TIMESTAMP>/) to find all result files.claude/logs/calibrations.jsonl (if it exists; use empty string if missing).claude/logs/calibrations.jsonlFor each flagged target (recall < 0.70 or |bias| > 0.15):
→ Update <target> <antipatterns_to_flag> for: <gaps from result>→ Raise effective re-run threshold for <target> in MEMORY.md (default 0.70 → ~<mean_confidence>)→ <target> is conservative; threshold can stay at defaultProposals shown in Step 3 already surface the actionable signals. If apply was not set, end with:
→ Run /calibrate <target> [fast|full] apply to run a fresh benchmark and apply proposals.
Mark "Analyse and report" completed. If apply was set: proceed to Step 6.
Mark "Apply findings" in_progress.
Determine run directory:
fast/full + apply): use the TIMESTAMP already generated in Step 1 — proposals were just written by Steps 2–5.apply, no fast/full): find the most recent run:LATEST=$(ls -td .reports/calibrate/*/ 2>/dev/null | head -1)
TIMESTAMP=$(basename "$LATEST")
For each target in the target list, check whether .reports/calibrate/<TIMESTAMP>/<target>/proposal.md exists. Collect the set of targets that have a proposal (found) and those that don't (missing).
Print ⚠ No proposal found for <target> — run /calibrate <target> [fast|full] first for each missing target.
Print the run's report before applying: for each found target, read and print .reports/calibrate/<TIMESTAMP>/<target>/report.md verbatim so the user sees the benchmark basis before any file is changed.
Spawn one general-purpose subagent per found target. Issue ALL spawns in a single response — no waiting between spawns.
Each subagent receives this self-contained prompt (substitute <TARGET>, <PROPOSAL_PATH>, <AGENT_FILE>):
Read the proposal file at <PROPOSAL_PATH> and apply each "Change N" block to <AGENT_FILE> (or the skill file if the target is a skill).
For each change:
Applying Change N to <file> [<section>]old_string = Current text verbatim, new_string = Proposed text"none" (new insertion): find the section header and insert the Proposed text after the last item in that block⚠ Skipped — current text not found✓ Already applied — skippedAfter processing all changes return only this compact JSON:
{"status":"done","target":"<TARGET>","applied":N,"skipped":N,"file":"<AGENT_FILE>","summary":"Applied N, skipped N changes to <AGENT_FILE>"}
After all subagents complete, collect their JSON results and print the final summary:
## Fix Apply — <date>
| Target | File | Applied | Skipped |
|-------------|-------------------------------|---------|---------|
| sw-engineer | .claude/agents/sw-engineer.md | 2 | 0 |
→ Run /calibrate <targets> to verify improvement.
Mark "Apply findings" completed.
End your response with a ## Confidence block per CLAUDE.md output standards.
verdict:"timed_out"; re-run individually with /calibrate <target> fast after the session.all full ab with all targets (agents + skills + routing + communication + rules) returns ~5KB total, well within limits.general-purpose subagents. Each scorer reads response files from disk, returns ~200 bytes. Phase 3b runs Codex scorers sequentially via Bash (writes per-problem files). Phase 3c merges both into scores.json. The pipeline holds only compact JSONs regardless of N or A/B mode — no context budget concern.general-purpose is a built-in Claude Code agent type (no .claude/agents/general-purpose.md file needed) — no custom system prompt, all tools available.templates/pipeline-prompt.md Phase 1a) to test false-positive discipline and reveal upper-bound limits. Remaining gap: synthetically generated adversarial problems are weaker than expert-authored ones; generator_recall_delta surfaces whether one generator's problems are systematically easier or harder. ceiling_recall (reported separately from mean_recall) is the primary signal for upper-bound performance — partial recall (0.4–0.7) on ceiling problems is expected and does not affect the calibration verdict.scorer_agreement measures fraction of issues where both scorers agreed — low agreement (< SCORER_AGREEMENT_WARN=0.70) flags ambiguous ground truth or scorer blind spots. Severity disputes (scorers disagree >1 tier) are excluded from the SevAcc aggregate.result.jsonl includes "scoring":"dual|single" and "source_mode":"dual|claude-only". When analyzing trends in calibrations.jsonl, filter by these fields — dual-scored results are not directly comparable to single-scored baselines.<antipatterns_to_flag> | consistent low recall → consider model tier upgrade (sonnet → opus) | large calibration bias → document adjusted threshold in MEMORY.md | re-calibrate after instruction changes to quantify improvement.apply (no fast/full) prints the saved report from the last run before applying, so the user always sees the basis for any changes before files are touched.apply semantics: fast apply / full apply = run fresh benchmark then auto-apply the new proposals in one go. apply alone (no fast/full) = apply proposals from the most recent past run without re-running the benchmark.apply uses verbatim text matching (old_string = Current from proposal). If the agent file was edited between the benchmark run and apply, any change whose Current text no longer matches is skipped with a warning — no silent clobbering of intermediate edits.routing target vs /audit Check 12: /audit Check 12 performs static analysis of description overlap (finds potential confusion zones); /calibrate routing tests behavioral impact — it generates real routing decisions and measures whether descriptions actually disambiguate. Run in sequence: /audit first (fast, structural), then /calibrate routing (behavioral, slower). They are complementary, not redundant.routing, communication, rules in all: see the all entry in <inputs> for the authoritative definition — use explicit targets only when running a single mode in isolation./calibrate <agent> fast apply → /calibrate <agent> to verify improvement — stop and escalate to user if recall is still < 0.70 after this cycle (max 1 apply cycle per run)/calibrate routing to verify improvement/calibrate routing after any agent description change; run /calibrate communication after any protocol or handoff change(agent + loop) composite. If the loop were enabled, it would inflate both recall and confidence by an unknown ratio, masking real instruction gaps and making it impossible to attribute improvement to instruction changes vs. the loop self-correcting at inference time./calibrate for quality and recall.general-purpose subagent matches its recall and F1, the specialization adds no value. ab mode quantifies this gap per-target. significant (Δ>0.10) confirms the agent's domain depth earns its cost; marginal (0.05–0.10) suggests instruction improvements may help; none (<0.05) signals the agent's current instructions add no measurable lift over a vanilla agent. Token cost is informational (logged in scores.json) but not part of the verdict — prioritize recall/F1 delta as the primary signal. Role-specificity caveat: for agents whose domain is well-covered by general training data, none ΔRecall does NOT mean "retire the agent" — specialization shows up in ΔSevAcc, ΔFmt, and ΔTokens even when ΔRecall ≈ 0; positive ΔSevAcc/ΔFmt combined with negative ΔTokens still confirms the specialist earns its cost.general-purpose baseline agents inside the pipeline subagent. Phase 3 spawns general-purpose scorer agents inside the same pipeline subagent. All at 2 levels (main → pipeline → agents) — no additional depth.modes/agents.md, modes/skills.md, modes/routing.md, modes/communication.md, modes/rules.md. Add a new target mode by creating a new file in modes/ and adding a row to the Step 2 dispatch table.