From quoth
Pairwise LLM-as-judge evaluation with position randomization, active learning, and cost control. Use when building offline ground-truth labels for retrieval/recommender systems where explicit user feedback is unavailable or too sparse.
How this skill is triggered — by the user, by Claude, or both
Slash command
/quoth:llm-as-judgeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- No explicit user labels (or too sparse for supervised training)
Zheng et al. (NeurIPS 2023) show absolute scoring by LLMs has high noise variance. Pairwise comparisons reduce variance significantly and are the production SOTA across major evals (MT-Bench, Chatbot Arena, AlpacaEval).
Given [trajectory/context], which of these two items was more [useful/relevant/load-bearing]?
Item A: ...
Item B: ...
Answer: A, B, or NEITHER
LLMs prefer the first option ~60% of the time without mitigation.
Mitigation: randomize order at prompt construction; store positionMap to decode verdict → actual item.
Longer options rated higher. Mitigation: truncate both items to same length (e.g., 200 chars).
Llama judges Llama outputs higher. Mitigation: judge with a DIFFERENT model family than generation.
When asked "did X cause Y?", judges construct plausible narratives even when X was irrelevant. Mitigation: offer NEITHER as valid answer; anchor against observable outcome.
Selection criterion — Beta credible interval width:
width(α, β) = 2 · z · sqrt(p(1-p)/n) where p = α/(α+β), n = α+β
Judge only if width ≥ 0.3. Skip high-confidence cases (either direction).
Result: 10-20x cost reduction with minimal accuracy loss — most items don't need judging.
QUOTH_JUDGE_DAILY_LIMIT=50 env var1. Nightly: find uncertain clusters/patterns (CI width ≥ 0.3)
2. Build pair prompts, randomize positions
3. Call judge (Haiku) via CLI/API with 30s timeout
4. Parse verdict, map positions back via positionMap
5. Update Beta posteriors: winner α+0.5, loser β+0.5, NEITHER → no-op
6. Log cost per judgment for budget monitoring
npx claudepluginhub montinou/quothCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.