Help us improve
Share bugs, ideas, or general feedback.
From medsci-project
Designs and reviews validity of AI-vs-human-expert benchmarks, covering rubrics, calibration probes, reviewer panels, and reliability targets. Use before data collection.
npx claudepluginhub aperivue/medsci-skills --plugin medsci-presentationHow this skill is triggered — by the user, by Claude, or both
Slash command
/medsci-project:design-ai-benchmarkinginheritThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill pressure-tests an AI-vs-human-expert benchmark **before any ratings are collected**, so that
Reviews study design and validity for radiology and medical AI research, checking analysis unit, cohort logic, leakage, comparator design, and reporting guideline fit.
Use this skill when the user asks to "design a human evaluation", "human eval process", "annotation guidelines", "how to set up human review of AI outputs", "how to get humans to evaluate AI quality", "build a labeling process", "create annotation criteria", or wants to set up a structured process for humans to evaluate AI output quality.
Provides rubrics and guidelines for evaluating AI outputs on accuracy, relevance, completeness, helpfulness, clarity, tone appropriateness, and safety. Includes weighting, calibration, and design artifacts.
Share bugs, ideas, or general feedback.
This skill pressure-tests an AI-vs-human-expert benchmark before any ratings are collected, so that
the comparison is fair, the rubric measures distinct constructs, the scale is calibrated, and the
reported reliability is interpretable. It is the AI-evaluation specialization of /design-study: where
/design-study reviews a study in general, this skill owns the specific machinery of comparing AI
system(s) to a panel of human experts (or to each other) on rated outputs.
Use it when:
Do not use it for: general study/validity review (use /design-study); statistical execution such
as ICC or DeLong (use /analyze-stats); reporting-guideline item audits (use /check-reporting);
or reviewing an already-written manuscript (use /peer-review or /self-review).
## AI-Benchmark Design Review
Evaluation question: ...
Arms / systems compared: ...
Reference (human-expert panel): ...
Unit of rating: (item / case / output)
### Rubric (decoupled dimensions)
- dimension -> construct -> anchors (1..k)
### Calibration probes (blinded, randomized)
- positive-control / known-bad / instability / mechanism-contradiction
### Reviewer panel
- n reviewers, metadata captured, per-reviewer randomized order
### Reliability plan
- overall IRR target + control-item IRR (reported separately)
### Judge strategy
- human-as-judge / LLM-as-judge / both + adjudication rule
### Validity risks
1. ...
### Minimal fixes
- ...
### Decision
- Ready to collect / Needs rubric revision / Needs arm or judge redesign
Pin down, in writing:
Gate: Present the reconstructed evaluation question, arms, and reference to the user and confirm before designing the rubric. A wrong reconstruction misdirects the entire benchmark.
${CLAUDE_SKILL_DIR}/references/elicitation_rubric_template.md.Plant a small number of deliberate control items, blinded and randomized across raters (record who
received which via a probe_arm flag), to (i) anchor the scale, (ii) measure rater drift/fatigue, and
(iii) audit the rubric and pipeline itself. Four useful flavors:
Probes are planted or adjudicated, never fabricated to fit a hypothesis.
Gate: Present the panel composition, stratification, and randomization plan for user review before recruitment is finalized.
/analyze-stats).Define the machine-readable rating record up front: per-item ratings across every rubric dimension,
free-text justifications, follow-up flags, the probe_arm flag, reviewer id and metadata, item order,
and timing. A synthetic schema lives in ${CLAUDE_SKILL_DIR}/references/benchmark_export_schema.json.
Gate: Present the final rubric, probe set, panel plan, judge strategy, and export schema together; collect explicit user approval before any rating begins. Locking these before data collection is the whole point — changes afterward compromise the comparison.
/analyze-stats for ICC / weighted kappa / DeLong, agreement sample size, and effect-size
real-world translation of the benchmark results/check-reporting for STARD-AI, CLAIM, or TRIPOD+AI item-level reporting once the design is locked/design-study when the broader study around the benchmark (cohort logic, analysis unit,
comparator) also needs review/peer-review or /self-review only after ratings exist and a manuscript is being assessed/analyze-stats)./search-lit with a confirmed DOI
or PMID. Mark unverified references as [UNVERIFIED - NEEDS MANUAL CHECK].[VERIFY] and ask the user.${CLAUDE_SKILL_DIR}/references/elicitation_rubric_template.md -- a synthetic, decoupled
multi-dimension rating rubric with anchors and a planted-probe column.${CLAUDE_SKILL_DIR}/references/benchmark_export_schema.json -- a synthetic JSON schema for the
per-item rating export (ratings, justifications, probe_arm, reviewer metadata, order, timing).