Skill

grade-run

Backward-assess a past simulation run for accuracy. Use when the user wants to "grade a forecast", "score this run", "check how accurate the predictions were", "evaluate past predictions", or "see how the panel did". Fetches fresh news grounding for the elapsed horizons, runs an LLM grader against each prediction, and writes per-prediction scores plus aggregate calibration stats.

npx claudepluginhub danielrosehill/claude-code-plugins --plugin geopol-sim

Tool Access

This skill is limited to using the following tools:

BashReadWriteEditGlob

Preview

This is the seed of a self-improving loop. Once a few runs are graded, the data drives prompts like "this model historically over-confidence-biases on 1m horizons — weight accordingly".

SKILL.md

Similar Skills

ui-ux-pro-max

72.7k

Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.

ui-ux-pro-max

context7-mcp

51.8k

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7-plugin

seo-cannibalization-detector

36.5k

Analyzes multiple pages for keyword overlap, SEO cannibalization risks, and content duplication. Suggests differentiation, consolidation, and resolution strategies when reviewing similar content.

antigravity-bundle-seo-specialist

Stats

Stars0

Forks0

Last CommitApr 28, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Grade a run

This is the seed of a self-improving loop. Once a few runs are graded, the data drives prompts like "this model historically over-confidence-biases on 1m horizons — weight accordingly".

Inputs to gather

Run to grade. Path to a reports/<timestamp>/ dir. Defaults to the runstore's "oldest run with elapsed horizons but no grading.json" if the runstore is configured.
Horizon cutoff. Only grade horizons that have actually elapsed by today's date. Read meta.json for the run timestamp and compute which horizons (24h, 1w, 1m, etc.) have passed. If none have, stop and tell the user when the soonest horizon will be due.
Grader model. Default: a strong reasoning model the user hasn't used in the panel, to reduce circular grading. Suggest anthropic/claude-opus-4-7 or openai/gpt-5-thinking.

Steps

Read sentinel + run. Identify the template and load all per-prediction structured data (synthesis.json for Council, equivalent for Forecaster).
For each elapsed horizon, fetch fresh grounding. Mirror the upstream template's grounding stack — Tavily news search + RSS pulls (Times of Israel, Al Jazeera, BBC World by default) + optional Perplexity Sonar — but bounded to the time window between the run timestamp and today. Don't reuse the run's own grounding (that was the prior; we need posterior facts).
Build the grader prompt. For each prediction, supply:
- The prediction text + its claimed change_factor + confidence
- The horizon (with concrete UTC dates: from/to)
- The fresh grounding bundle
- A scoring rubric: hit / partial / miss / unverifiable, plus a one-sentence justification quoting at least one grounding source
Call the grader. Use OpenRouter via the same client conventions the templates use (read OPENROUTER_API_KEY from the workspace .env). Request structured JSON output with the per-prediction grade + justification.

Write grading.json next to the run dir. Schema:

{
  "graded_at": "<UTC timestamp>",
  "grader_model": "<model id>",
  "horizons_graded": ["24h", "1w"],
  "predictions": [
    {
      "prediction_id": "<from synthesis>",
      "model": "<which council member made it>",
      "horizon": "1w",
      "grade": "partial",
      "justification": "...",
      "evidence_url": "..."
    }
  ]
}

Compute aggregate stats and write grading-report.md:
- Per-model accuracy (hit rate, partial-credit-weighted score)
- Per-horizon accuracy (does the panel do better at 24h than 1m?)
- Calibration: bin predictions by confidence (0.0-0.2, 0.2-0.4, ...) and report observed hit rate per bin. A well-calibrated panel has predicted-confidence ≈ observed-hit-rate.
- Tone bias: did the escalatory predictions score better than conciliatory ones? (signal of a panel-wide directional bias)
Cross-run aggregation (if runstore configured). Append this run's stats into a running <runstore_root>/grading-aggregate.jsonl file so per-model and calibration trends across all graded runs can be tracked over time.
Report back. Headline numbers (overall hit rate, per-horizon), the worst-calibrated bin, and the top-3 most surprising hits or misses (the predictions where confidence was most out of line with the grade).

Notes

The grader is itself an LLM and can be wrong. The justification field is the auditable trail — surface it prominently. A "miss" with a flimsy justification should be re-checked by the user.
unverifiable is a valid grade and should be preferred over guessing — predictions about private deliberations or unobservable events are common in geopolitics.
Never modify the run's original files. grading.json and grading-report.md are written alongside, never replacing.
A graded run is the unit of input for any future fine-tuning / panel-rebalancing work. Keep the schema stable — adding fields is fine, removing or renaming is breaking.