Skill

calibrate

Scores past predictions against actual sprint outcomes, creates calibration claims, computes accuracy scorecards by evidence tier and claim type. Useful for feedback loops after implementations.

developer-tools

npx claudepluginhub grainulation/grainulator --plugin grainulator

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The user wants to check what actually happened after a sprint's recommendations were implemented.

SKILL.md

Similar Skills

measure-okr-grader

178

Scores completed OKR sets at cycle close with KR-level grading by type (committed, aspirational, learning, etc.), evidence quality assessment, learning synthesis, and next-cycle recommendations. Enforces rules against retroactive changes and gaming.

2 files

pm-skills

harvest

Analyzes sprint claims for type distributions, evidence quality tiers, stale claims over 7 days, velocity metrics, and prediction scoring. Generates HTML retrospective reports.

grainulator

ring:dev-report

169

Generates dev cycle feedback reports: calculates assertiveness scores, analyzes prompt quality, aggregates metrics, root cause analysis on failures, outputs to docs/feedbacks/cycle-{date}/.

ring-dev-team

Stats

Stars82

Forks6

Last CommitApr 7, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

/calibrate -- Score predictions vs outcomes

The user wants to check what actually happened after a sprint's recommendations were implemented.

Arguments

$ARGUMENTS

Expected format: /calibrate --outcome "what happened" or /calibrate <claim_id> "actual result"

Instructions

Parse the outcome: The user provides outcome data as free text or claim-specific results.

Match outcomes to predictions: Use wheat_search to find the original estimate, recommendation, or risk claims that predicted something. Compare prediction to actual outcome.

Create calibration claims as cal### claims with evidence tier production (these are real outcomes):

If prediction was accurate: factual claim noting the match
If prediction was wrong: factual claim noting the delta (predicted X, actual Y)
If prediction was partially right: estimate claim with the refined numbers

Compute accuracy scorecard:

Group by evidence tier: what % of stated vs web vs documented vs tested claims were accurate?
Group by claim type: are estimates less accurate than factual claims?
This validates whether the evidence tier system is predictive

Run wheat_compile.

Print scorecard:

Calibration results:
Predictions scored: <N>
Accurate: <N> (<percent>)
Partially accurate: <N>
Wrong: <N>

Accuracy by evidence tier:
  stated: <percent>
  web: <percent>
  documented: <percent>
  tested: <percent>

Next steps:
  /brief              -- recompile with calibrated data
  /research <topic>   -- investigate where predictions went wrong

/calibrate -- Score predictions vs outcomes

The user wants to check what actually happened after a sprint's recommendations were implemented.

Arguments

$ARGUMENTS

Expected format: /calibrate --outcome "what happened" or /calibrate <claim_id> "actual result"

Instructions

Parse the outcome: The user provides outcome data as free text or claim-specific results.
Match outcomes to predictions: Use wheat_search to find the original estimate, recommendation, or risk claims that predicted something. Compare prediction to actual outcome.
Create calibration claims as cal### claims with evidence tier production (these are real outcomes):
- If prediction was accurate: factual claim noting the match
- If prediction was wrong: factual claim noting the delta (predicted X, actual Y)
- If prediction was partially right: estimate claim with the refined numbers
Compute accuracy scorecard:
- Group by evidence tier: what % of stated vs web vs documented vs tested claims were accurate?
- Group by claim type: are estimates less accurate than factual claims?
- This validates whether the evidence tier system is predictive
Run wheat_compile.

Print scorecard:

Calibration results:
Predictions scored: <N>
Accurate: <N> (<percent>)
Partially accurate: <N>
Wrong: <N>

Accuracy by evidence tier:
  stated: <percent>
  web: <percent>
  documented: <percent>
  tested: <percent>

Next steps:
  /brief              -- recompile with calibrated data
  /research <topic>   -- investigate where predictions went wrong