This skill should be used when the user asks for "LLM-as-judge evaluation", "advanced quality assessment", "multi-dimensional scoring", "pairwise comparison", "evaluate with position bias mitigation", "judge this output against criteria", or when high-stakes outputs need a rigorous, multi-pass quality assessment. Extends the base evaluation skill with pairwise comparison, position bias mitigation, self-consistency checks, and calibrated confidence scoring.
From strategy-toolkitnpx claudepluginhub nsalvacao/nsalvacao-claude-code-plugins --plugin strategy-toolkitThis skill uses the workspace's default tool permissions.
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Integrates PayPal payments with express checkout, subscriptions, refunds, and IPN. Includes JS SDK for frontend buttons and Python REST API for backend capture.
High-rigor evaluation framework using LLM-as-judge methodology. Applies systematic bias mitigation, pairwise comparison, and multi-pass scoring to produce reliable, calibrated quality assessments for high-stakes outputs.
Use advanced-evaluation when:
Use evaluation (base skill) for routine quality checks.
Apply the base evaluation rubric from the evaluation skill, but with enhanced evidence requirements:
Evidence Quality Standard:
When evaluating a single output, run TWO passes:
When comparing two outputs (A vs B):
When comparing N outputs:
#### Round-Robin Comparison
| Match | Winner | Margin | Key Differentiator |
|-------|--------|--------|--------------------|
| A vs B | [A/B/tie] | [0-2] | [specific reason] |
| A vs C | [A/C/tie] | [0-2] | [specific reason] |
| B vs C | [B/C/tie] | [0-2] | [specific reason] |
**Ranking**: [1st] > [2nd] > [3rd]
**Consensus**: [High/Medium/Low — based on margin consistency]
After scoring, ask: "If I saw this output cold without knowing the context, would I give the same scores?"
Run a quick adversarial probe:
Use these anchors to calibrate your scores against known examples:
| Score | Anchor |
|---|---|
| 5 | Wikipedia featured article quality; textbook explanation; Paul Graham essay |
| 4 | Good Stack Overflow accepted answer; solid technical blog post |
| 3 | Average README; generic ChatGPT output; first-draft document |
| 2 | Incomplete FAQ; bullet-point notes without synthesis |
| 1 | Wrong, misleading, or incoherent |
In addition to the base dimensions (Accuracy, Completeness, Usefulness, Clarity, Freshness), add:
| Dimension | Weight | Criteria |
|---|---|---|
| Coherence | bonus | Does the output have internal logical consistency? No contradictions? |
| Originality | bonus | Does it add genuine insight beyond summarizing? |
| Calibration | bonus | Are claims appropriately hedged vs stated with false certainty? |
These are bonus dimensions — include when relevant, skip when not applicable.
## Advanced Evaluation Report
**Date**: [today]
**Output evaluated**: [name/description]
**Methodology**: LLM-as-Judge with position bias mitigation
**Passes**: [1 / 2 / N]
---
### Pass 1 Scores (Forward)
| Dimension | Score | Evidence (Level 3) | Improvement |
|-----------|-------|-------------------|-------------|
| Accuracy | [1-5] | "[exact quote]" — [why this is an issue] | [specific fix] |
| Completeness | [1-5] | "[what's missing]" | [what to add] |
| Usefulness | [1-5] | "[does it achieve goal?]" | [how to improve] |
| Clarity | [1-5] | "[structural/language issues]" | [how to clarify] |
| Freshness | [1-5] | "[stale elements]" | [what to update] |
| **Subtotal** | **[X.X/5]** | | |
### Pass 2 Scores (Reverse / Alternative Framing)
[Same table with reverse-order evaluation]
### Position Bias Check
| Dimension | Pass 1 | Pass 2 | Delta | Bias Detected? |
|-----------|--------|--------|-------|----------------|
| Accuracy | | | | Yes/No |
| ... | | | | |
| **Overall bias impact**: [High/Medium/Low/None] |
### Adversarial Probing
**Steelman (case for higher score)**: [strongest argument]
**Devil's advocate (case for lower score)**: [strongest argument]
**Conclusion**: [revised score if warranted, with reasoning]
---
### Final Calibrated Score
| Dimension | Score | Confidence |
|-----------|-------|------------|
| Accuracy | [1-5] | [High/Medium/Low] |
| Completeness | [1-5] | [H/M/L] |
| Usefulness | [1-5] | [H/M/L] |
| Clarity | [1-5] | [H/M/L] |
| Freshness | [1-5] | [H/M/L] |
| **Weighted Total** | **[X.X/5]** | |
| **Grade** | **[A/B/C/D/F]** | |
### Prioritized Improvements
**MUST FIX** (blocks use):
1. [Specific issue with Level 3 evidence] → [exact fix]
**SHOULD FIX** (significant quality gain):
2. [Issue] → [fix]
**NICE TO HAVE**:
3. [Issue] → [fix]
### Recommendation
[Clear, unambiguous recommendation with reasoning]
An advanced evaluation MUST: