Help us improve
Share bugs, ideas, or general feedback.
From thinking-frameworks-skills
Designs structured scoring rubrics with explicit criteria, performance scales, and descriptors for consistent quality assessment and reduced subjective bias.
npx claudepluginhub lyndonkl/claude --plugin thinking-frameworks-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/thinking-frameworks-skills:evaluation-rubricsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- [Workflow](#workflow)
Creates analytic rubrics with criteria, performance levels, and behavioral descriptors for consistent assessment of complex work.
Generate a criterion-referenced rubric with descriptive performance levels for a learning objective. Use for marking guides and curriculum contexts.
Builds a scoring rubric interactively, evaluates an artifact with multiple models in parallel, then autonomously improves it one criterion at a time until a score threshold is met or circuit breaker fires.
Share bugs, ideas, or general feedback.
Scenario: Evaluating technical blog posts (1-5 scale)
| Criterion | 1 (Poor) | 3 (Adequate) | 5 (Excellent) |
|---|---|---|---|
| Technical Accuracy | Multiple factual errors, misleading | Mostly correct, minor inaccuracies | Fully accurate, technically rigorous |
| Clarity | Confusing, jargon-heavy, poor structure | Clear to experts, some structure | Accessible to target audience, well-organized |
| Practical Value | No actionable guidance, theoretical only | Some examples, limited applicability | Concrete examples, immediately applicable |
| Originality | Rehashes common knowledge, no new insight | Some fresh perspective, builds on existing | Novel approach, advances understanding |
Scoring: Post A [4, 5, 3, 2] = 3.5 avg. Post B [5, 4, 5, 4] = 4.5 avg. Feedback for Post A: "Strong clarity (5) and good accuracy (4), but needs more practical examples (3) and offers less original insight (2)."
Copy this checklist and track your progress:
Rubric Development Progress:
- [ ] Step 1: Define purpose and scope
- [ ] Step 2: Identify evaluation criteria
- [ ] Step 3: Design the scale
- [ ] Step 4: Write performance descriptors
- [ ] Step 5: Test and calibrate
- [ ] Step 6: Use and iterate
Step 1: Define purpose and scope
Clarify what you're evaluating, who evaluates, who uses results, what decisions depend on scores. See resources/template.md for scoping questions.
Step 2: Identify evaluation criteria
Brainstorm quality dimensions, prioritize most important/observable, balance coverage vs. simplicity (4-8 criteria typical). See resources/template.md for brainstorming framework.
Step 3: Design the scale
Choose number of levels (1-5, 1-4, 1-10), scale type (numeric, qualitative), anchors (what does each level mean?). See resources/methodology.md for scale selection guidance.
Step 4: Write performance descriptors
For each criterion × level, write observable description of what that performance looks like. See resources/template.md for writing guidelines.
Step 5: Test and calibrate
Have multiple reviewers score sample work, compare scores, discuss discrepancies, refine rubric. See resources/methodology.md for inter-rater reliability testing.
Step 6: Use and iterate
Apply rubric, collect feedback from evaluators and evaluatees, revise criteria/descriptors as needed. Validate using resources/evaluators/rubric_evaluation_rubrics.json. Minimum standard: Average score ≥ 3.5.
Pattern 1: Analytic Rubric (Most Common)
Pattern 2: Holistic Rubric
Pattern 3: Single-Point Rubric
Pattern 4: Checklist (Binary)
Pattern 5: Standards-Based Rubric
Criteria should be observable and measurable: Not "good attitude" (subjective), but "arrives on time, volunteers for tasks, helps teammates" (observable). Test: Can two independent reviewers score this criterion consistently?
Descriptors should distinguish levels clearly: Each level needs concrete differences from adjacent levels. Avoid "5=very good, 4=good, 3=okay". Better: "5=zero bugs, meets all requirements, 4=1-2 minor bugs, meets 90% requirements."
Use appropriate scale granularity: 1-3 is too coarse, 1-10 is too fine. Sweet spot: 1-4 (forced choice, no middle) or 1-5 (allows neutral middle). Match granularity to actual observable differences.
Balance comprehensiveness with simplicity: Aim for 4-8 criteria covering essential quality dimensions. If >10 criteria, consider grouping or prioritizing.
Calibrate for inter-rater reliability: Have multiple reviewers score same work, measure agreement (Kappa, ICC). If <70% agreement, refine descriptors.
Provide examples at each level: Include concrete examples of work at each level (anchor papers, reference designs, code samples) to calibrate reviewers.
Share rubric before evaluation: If evaluatees see the rubric only after being scored, it is grading not guidance. Share upfront so people know expectations and can self-assess.
Weight criteria appropriately: If "Security" matters more than "Code style", weight it (Security x3, Style x1). Or use thresholds (score >=4 on Security to pass, regardless of other scores).
Common pitfalls:
Key resources:
Scale Selection Guide:
| Scale | Use When | Pros | Cons |
|---|---|---|---|
| 1-3 | Need quick categorization, clear tiers | Fast, forces clear decision | Too coarse, less feedback |
| 1-4 | Want forced choice (no middle) | Avoids central tendency, clear differentiation | No neutral option, feels binary |
| 1-5 | General purpose, most common | Allows neutral, familiar, good granularity | Central tendency bias (everyone gets 3) |
| 1-10 | Need fine gradations, large sample | Maximum differentiation, statistical analysis | False precision, hard to distinguish adjacent levels |
| Qualitative (Novice/Proficient/Expert) | Educational, skill development | Intuitive, growth-oriented | Less quantitative, harder to aggregate |
| Binary (Yes/No, Pass/Fail) | Compliance, gatekeeping | Objective, simple | No gradations, misses quality differences |
Criteria Types:
Inter-Rater Reliability Benchmarks:
Typical Rubric Development Time:
When to escalate beyond rubrics:
Inputs required:
Outputs produced:
evaluation-rubrics.md: Purpose, criteria definitions, scale with descriptors, usage instructions, weighting/thresholds, calibration notes