Probability calibration training skill for improving forecast accuracy and reducing overconfidence
Trains forecasters to improve prediction accuracy by assessing and correcting confidence calibration through quizzes and analysis.
npx claudepluginhub a5c-ai/babysitterThis skill is limited to using the following tools:
The Calibration Trainer skill provides capabilities for assessing and improving forecaster calibration. It helps decision-makers align their confidence levels with actual accuracy, reducing overconfidence and improving the quality of probabilistic judgments.
# Generate calibration quiz
quiz_config = {
"type": "general_knowledge",
"format": "confidence_interval",
"questions": 20,
"confidence_levels": [50, 80, 90], # percentiles to elicit
"difficulty": "medium",
"domains": ["business", "economics", "technology", "geography"]
}
# Example question
quiz_question = {
"id": "Q001",
"question": "In what year was Amazon founded?",
"actual_answer": 1994,
"format": "numeric_interval",
"required_responses": [
{"confidence": 50, "prompt": "Give your best estimate"},
{"confidence": 80, "prompt": "Give a range you're 80% confident contains the answer"},
{"confidence": 90, "prompt": "Give a range you're 90% confident contains the answer"}
]
}
# Collect responses
responses = {
"participant": "John Smith",
"date": "2024-01-15",
"questions": [
{
"question_id": "Q001",
"responses": {
"point_estimate": 1997,
"interval_80": [1995, 2000],
"interval_90": [1992, 2002]
}
}
# ... more questions
]
}
# Analyze calibration
calibration_analysis = {
"participant": "John Smith",
"n_questions": 20,
"by_confidence_level": {
"80%_intervals": {
"expected_hit_rate": 0.80,
"actual_hit_rate": 0.55,
"calibration_gap": -0.25,
"interpretation": "overconfident"
},
"90%_intervals": {
"expected_hit_rate": 0.90,
"actual_hit_rate": 0.70,
"calibration_gap": -0.20,
"interpretation": "overconfident"
}
},
"brier_score": 0.18, # lower is better, 0 = perfect
"overconfidence_index": 0.23,
"recommendations": [
"Widen confidence intervals by ~25%",
"Practice with domain-specific questions",
"Use reference class thinking"
]
}
# Calibration training program
training_program = {
"participant": "John Smith",
"baseline_calibration": 0.55, # hit rate for 80% intervals
"target_calibration": 0.75,
"exercises": [
{
"week": 1,
"focus": "interval_widening",
"exercise": "Practice giving intervals 50% wider than instinct",
"quiz_count": 10
},
{
"week": 2,
"focus": "reference_class",
"exercise": "For each estimate, identify a reference class first",
"quiz_count": 10
},
{
"week": 3,
"focus": "decomposition",
"exercise": "Break complex estimates into components",
"quiz_count": 10
},
{
"week": 4,
"focus": "consolidation",
"exercise": "Apply all techniques, track improvement",
"quiz_count": 20
}
]
}
# Track progress over time
progress_data = {
"participant": "John Smith",
"history": [
{"date": "2024-01-01", "hit_rate_80": 0.55, "brier_score": 0.22},
{"date": "2024-01-15", "hit_rate_80": 0.62, "brier_score": 0.19},
{"date": "2024-02-01", "hit_rate_80": 0.68, "brier_score": 0.16},
{"date": "2024-02-15", "hit_rate_80": 0.74, "brier_score": 0.13}
],
"trend": "improving",
"improvement_rate": "4% per session"
}
{
"operation": "quiz|analyze|train|track",
"quiz_config": {
"type": "string",
"format": "string",
"questions": "number",
"confidence_levels": ["number"]
},
"responses": {
"participant": "string",
"questions": ["object"]
},
"training_config": {
"target_calibration": "number",
"duration_weeks": "number"
}
}
{
"quiz": {
"questions": ["object"],
"total_count": "number"
},
"calibration_analysis": {
"by_confidence_level": "object",
"brier_score": "number",
"overconfidence_index": "number",
"calibration_curve": "object"
},
"recommendations": ["string"],
"progress": {
"history": ["object"],
"trend": "string",
"target_achieved": "boolean"
}
}
| Metric | Formula | Interpretation |
|---|---|---|
| Hit Rate | % of intervals containing true value | Should match confidence level |
| Brier Score | Mean squared error of probabilities | Lower is better (0-1) |
| Calibration Gap | Expected - Actual hit rate | Positive = overconfident |
| Overconfidence Index | Average calibration gap | Quantifies overall bias |
A well-calibrated forecaster has:
The calibration curve plots stated confidence vs. observed accuracy.
| Technique | Description |
|---|---|
| Widen intervals | Start wider, narrow only with strong evidence |
| Reference classes | Use base rates from similar situations |
| Decomposition | Break estimates into components |
| Devil's advocate | Actively seek reasons to be less confident |
| Pre-mortem | Imagine being wrong, identify why |
Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
Search, retrieve, and install Agent Skills from the prompts.chat registry using MCP tools. Use when the user asks to find skills, browse skill catalogs, install a skill for Claude, or extend Claude's capabilities with reusable AI agent components.
This skill should be used when the user wants to "create a skill", "add a skill to plugin", "write a new skill", "improve skill description", "organize skill content", or needs guidance on skill structure, progressive disclosure, or skill development best practices for Claude Code plugins.