From evals-skills
Calibrates LLM judges against human labels using train/dev/test splits, TPR/TNR metrics (>90% target), and bias correction. Verify post-judge-prompt before production.
npx claudepluginhub hamelsmu/evals-skills --plugin evals-skillsThis skill uses the workspace's default tool permissions.
Calibrate an LLM judge against human judgment.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
Calibrate an LLM judge against human judgment.
Split human-labeled data into three disjoint sets:
| Split | Size | Purpose | Rules |
|---|---|---|---|
| Training | 10-20% (~10-20 examples) | Source of few-shot examples for the judge prompt | Only clear-cut Pass and Fail cases. Used directly in the prompt. |
| Dev | 40-45% (~40-45 examples) | Iterative evaluator refinement | Never include in the prompt. Evaluate against repeatedly. |
| Test | 40-45% (~40-45 examples) | Final unbiased accuracy measurement | Do NOT look at during development. Used once at the end. |
Target: 30-50 examples of each class (Pass and Fail) across dev and test combined. Use balanced splits even if real-world prevalence is skewed — you need enough Fail examples to measure TNR reliably.
from sklearn.model_selection import train_test_split
# First split: separate test set
train_dev, test = train_test_split(
labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42
)
# Second split: separate training examples from dev set
train, dev = train_test_split(
train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42
)
# Result: ~15% train, ~45% dev, ~40% test
Run the judge on every example in the dev set. Compare predictions to human labels.
TPR (True Positive Rate): When a human says Pass, how often does the judge also say Pass?
TPR = (judge says Pass AND human says Pass) / (human says Pass)
TNR (True Negative Rate): When a human says Fail, how often does the judge also say Fail?
TNR = (judge says Fail AND human says Fail) / (human says Fail)
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
Use TPR/TNR, not Precision/Recall or raw accuracy. These two metrics directly map to the bias correction formula. Use Cohen's Kappa only for measuring agreement between two human annotators, not for judge-vs-ground-truth.
Examine every case where the judge disagrees with human labels:
| Disagreement Type | Judge | Human | Fix |
|---|---|---|---|
| False Pass | Pass | Fail | Judge is too lenient. Strengthen Fail definitions or add edge-case examples. |
| False Fail | Fail | Pass | Judge is too strict. Clarify Pass definitions or adjust examples. |
For each disagreement, determine whether to:
Refine the judge prompt and re-run on the dev set. Repeat until TPR and TNR stabilize.
Stopping criteria:
If alignment stalls:
| Problem | Solution |
|---|---|
| TPR and TNR both low | Use a more capable LLM for the judge |
| One metric low, one acceptable | Inspect disagreements for the low metric specifically |
| Both plateau below target | Decompose the criterion into smaller, more atomic checks |
| Consistently wrong on certain input types | Add targeted few-shot examples from training set |
| Labels themselves seem inconsistent | Re-examine human labels; the rubric may need refinement |
Run the judge exactly once on the held-out test set. Record final TPR and TNR.
Do not iterate after seeing test set results. Go back to step 4 with new dev data if needed.
Raw judge scores on unlabeled production data are biased. If you need an accurate aggregate pass rate, correct for known judge errors:
theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)
Where:
p_obs = fraction of unlabeled traces the judge scored as PassTPR, TNR = from test set measurementtheta_hat = corrected estimate of true success rateClip to [0, 1]. Invalid when TPR + TNR - 1 is near 0 (judge is no better than random).
Example:
Compute a bootstrap confidence interval. A point estimate alone is not enough.
import numpy as np
def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
"""Bootstrap 95% CI for corrected success rate."""
n = len(human_labels)
estimates = []
for _ in range(n_bootstrap):
idx = np.random.choice(n, size=n, replace=True)
h = np.array(human_labels)[idx]
e = np.array(eval_labels)[idx]
tp = ((h == 'Pass') & (e == 'Pass')).sum()
fn = ((h == 'Pass') & (e == 'Fail')).sum()
tn = ((h == 'Fail') & (e == 'Fail')).sum()
fp = ((h == 'Fail') & (e == 'Pass')).sum()
tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
denom = tpr_b + tnr_b - 1
if abs(denom) < 1e-6:
continue
theta = (p_obs + tnr_b - 1) / denom
estimates.append(np.clip(theta, 0, 1))
return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)
lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")
Or use judgy (pip install judgy):
from judgy import estimate_success_rate
result = estimate_success_rate(
human_labels=test_human_labels,
evaluator_labels=test_eval_labels,
unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")
gpt-4o-2024-05-13, not gpt-4o). Providers update models without notice, causing silent drift.