From thinking-frameworks-skills
Designs structured evaluation rubrics with criteria, performance scales, and descriptors for consistent, transparent quality assessment of work like code or docs. Use for objective scoring, bias reduction, or comparing alternatives.
npx claudepluginhub lyndonkl/claude --plugin thinking-frameworks-skillsThis skill uses the workspace's default tool permissions.
- [Workflow](#workflow)
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Scenario: Evaluating technical blog posts (1-5 scale)
| Criterion | 1 (Poor) | 3 (Adequate) | 5 (Excellent) |
|---|---|---|---|
| Technical Accuracy | Multiple factual errors, misleading | Mostly correct, minor inaccuracies | Fully accurate, technically rigorous |
| Clarity | Confusing, jargon-heavy, poor structure | Clear to experts, some structure | Accessible to target audience, well-organized |
| Practical Value | No actionable guidance, theoretical only | Some examples, limited applicability | Concrete examples, immediately applicable |
| Originality | Rehashes common knowledge, no new insight | Some fresh perspective, builds on existing | Novel approach, advances understanding |
Scoring: Post A [4, 5, 3, 2] = 3.5 avg. Post B [5, 4, 5, 4] = 4.5 avg. Feedback for Post A: "Strong clarity (5) and good accuracy (4), but needs more practical examples (3) and offers less original insight (2)."
Copy this checklist and track your progress:
Rubric Development Progress:
- [ ] Step 1: Define purpose and scope
- [ ] Step 2: Identify evaluation criteria
- [ ] Step 3: Design the scale
- [ ] Step 4: Write performance descriptors
- [ ] Step 5: Test and calibrate
- [ ] Step 6: Use and iterate
Step 1: Define purpose and scope
Clarify what you're evaluating, who evaluates, who uses results, what decisions depend on scores. See resources/template.md for scoping questions.
Step 2: Identify evaluation criteria
Brainstorm quality dimensions, prioritize most important/observable, balance coverage vs. simplicity (4-8 criteria typical). See resources/template.md for brainstorming framework.
Step 3: Design the scale
Choose number of levels (1-5, 1-4, 1-10), scale type (numeric, qualitative), anchors (what does each level mean?). See resources/methodology.md for scale selection guidance.
Step 4: Write performance descriptors
For each criterion × level, write observable description of what that performance looks like. See resources/template.md for writing guidelines.
Step 5: Test and calibrate
Have multiple reviewers score sample work, compare scores, discuss discrepancies, refine rubric. See resources/methodology.md for inter-rater reliability testing.
Step 6: Use and iterate
Apply rubric, collect feedback from evaluators and evaluatees, revise criteria/descriptors as needed. Validate using resources/evaluators/rubric_evaluation_rubrics.json. Minimum standard: Average score ≥ 3.5.
Pattern 1: Analytic Rubric (Most Common)
Pattern 2: Holistic Rubric
Pattern 3: Single-Point Rubric
Pattern 4: Checklist (Binary)
Pattern 5: Standards-Based Rubric
Criteria should be observable and measurable: Not "good attitude" (subjective), but "arrives on time, volunteers for tasks, helps teammates" (observable). Test: Can two independent reviewers score this criterion consistently?
Descriptors should distinguish levels clearly: Each level needs concrete differences from adjacent levels. Avoid "5=very good, 4=good, 3=okay". Better: "5=zero bugs, meets all requirements, 4=1-2 minor bugs, meets 90% requirements."
Use appropriate scale granularity: 1-3 is too coarse, 1-10 is too fine. Sweet spot: 1-4 (forced choice, no middle) or 1-5 (allows neutral middle). Match granularity to actual observable differences.
Balance comprehensiveness with simplicity: Aim for 4-8 criteria covering essential quality dimensions. If >10 criteria, consider grouping or prioritizing.
Calibrate for inter-rater reliability: Have multiple reviewers score same work, measure agreement (Kappa, ICC). If <70% agreement, refine descriptors.
Provide examples at each level: Include concrete examples of work at each level (anchor papers, reference designs, code samples) to calibrate reviewers.
Share rubric before evaluation: If evaluatees see the rubric only after being scored, it is grading not guidance. Share upfront so people know expectations and can self-assess.
Weight criteria appropriately: If "Security" matters more than "Code style", weight it (Security x3, Style x1). Or use thresholds (score >=4 on Security to pass, regardless of other scores).
Common pitfalls:
Key resources:
Scale Selection Guide:
| Scale | Use When | Pros | Cons |
|---|---|---|---|
| 1-3 | Need quick categorization, clear tiers | Fast, forces clear decision | Too coarse, less feedback |
| 1-4 | Want forced choice (no middle) | Avoids central tendency, clear differentiation | No neutral option, feels binary |
| 1-5 | General purpose, most common | Allows neutral, familiar, good granularity | Central tendency bias (everyone gets 3) |
| 1-10 | Need fine gradations, large sample | Maximum differentiation, statistical analysis | False precision, hard to distinguish adjacent levels |
| Qualitative (Novice/Proficient/Expert) | Educational, skill development | Intuitive, growth-oriented | Less quantitative, harder to aggregate |
| Binary (Yes/No, Pass/Fail) | Compliance, gatekeeping | Objective, simple | No gradations, misses quality differences |
Criteria Types:
Inter-Rater Reliability Benchmarks:
Typical Rubric Development Time:
When to escalate beyond rubrics:
Inputs required:
Outputs produced:
evaluation-rubrics.md: Purpose, criteria definitions, scale with descriptors, usage instructions, weighting/thresholds, calibration notes