Collaboratively build and refine paper screening rubrics through brainstorming, test-driven development, and iterative feedback
Build and validate paper screening rubrics through collaborative brainstorming, test-driven refinement, and iterative feedback. Use when starting literature searches with 50+ papers or when existing rubrics misclassify results.
/plugin marketplace add kthorn/research-superpower/plugin install kthorn-research-superpowers@kthorn/research-superpowerThis skill inherits all available tools. When active, it can use any tool Claude has access to.
Core principle: Build screening rubrics collaboratively through brainstorming → test → refine → automate → review → iterate.
Good rubrics come from understanding edge cases upfront and testing on real papers before bulk screening.
Use this skill when:
When NOT to use:
Ask domain-agnostic questions to understand what makes papers relevant:
Core Concepts:
Data Types & Artifacts:
Paper Types:
Relationships & Context:
Edge Cases:
Document responses in screening-criteria.json
Based on brainstorming, propose scoring logic:
Scoring (0-10):
Keywords Match (0-3 pts):
- Core term 1: +1 pt
- Core term 2 OR synonym: +1 pt
- Related term: +1 pt
Data Type Match (0-4 pts):
- Measurement type (IC50, Ki, EC50, etc.): +2 pts
- Dataset/code available: +1 pt
- Methods described: +1 pt
Specificity (0-3 pts):
- Primary research: +3 pts
- Methods paper: +2 pts
- Review: +1 pt
Special Rules:
- If mentions exclusion term: score = 0
Threshold: ≥7 = relevant, 5-6 = possibly relevant, <5 = not relevant
Present to user and ask: "Does this logic match your expectations?"
Save initial rubric to screening-criteria.json:
{
"version": "1.0.0",
"created": "2025-10-11T15:30:00Z",
"keywords": {
"core_terms": ["term1", "term2"],
"synonyms": {"term1": ["alt1", "alt2"]},
"related_terms": ["related1", "related2"],
"exclusion_terms": ["exclude1", "exclude2"]
},
"data_types": {
"measurements": ["IC50", "Ki", "MIC"],
"datasets": ["GEO:", "SRA:", "PDB:"],
"methods": ["protocol", "synthesis", "assay"]
},
"scoring": {
"keywords_max": 3,
"data_type_max": 4,
"specificity_max": 3,
"relevance_threshold": 7
},
"special_rules": [
{
"name": "scaffold_analogs",
"condition": "mentions target scaffold AND (analog OR derivative)",
"action": "add 3 points"
}
]
}
Do a quick PubMed search to get candidate papers:
# Search for 20 papers using initial keywords
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=YOUR_QUERY&retmax=20&retmode=json"
Fetch abstracts for first 10-15 papers:
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=PMID1,PMID2,...&retmode=xml&rettype=abstract"
Present abstracts to user one at a time:
Paper 1/10:
Title: [Title]
PMID: [12345678]
DOI: [10.1234/example]
Abstract:
[Full abstract text]
Is this paper RELEVANT to your research question? (y/n/maybe)
Record user judgments in test-set.json:
{
"test_papers": [
{
"pmid": "12345678",
"doi": "10.1234/example",
"title": "Paper title",
"abstract": "Full abstract text...",
"user_judgment": "relevant",
"timestamp": "2025-10-11T15:45:00Z"
}
]
}
Continue until have 5-10 papers with clear judgments
Apply rubric to each test paper:
for paper in test_papers:
score = calculate_score(paper['abstract'], rubric)
predicted_status = "relevant" if score >= 7 else "not_relevant"
paper['predicted_score'] = score
paper['predicted_status'] = predicted_status
Calculate accuracy:
correct = sum(1 for p in test_papers
if p['predicted_status'] == p['user_judgment'])
accuracy = correct / len(test_papers)
Present classification report:
RUBRIC TEST RESULTS (5 papers):
✓ PMID 12345678: Score 9 → relevant (user: relevant) ✓
✗ PMID 23456789: Score 4 → not_relevant (user: relevant) ← FALSE NEGATIVE
✓ PMID 34567890: Score 8 → relevant (user: relevant) ✓
✓ PMID 45678901: Score 3 → not_relevant (user: not_relevant) ✓
✗ PMID 56789012: Score 7 → relevant (user: not_relevant) ← FALSE POSITIVE
Accuracy: 60% (3/5 correct)
Target: ≥80%
--- FALSE NEGATIVE: PMID 23456789 ---
Title: "Novel analogs of compound X with improved potency"
Score breakdown:
- Keywords: 1 pt (matched "compound X")
- Data type: 2 pts (mentioned IC50 values)
- Specificity: 1 pt (primary research)
- Total: 4 pts → not_relevant
Why missed: Paper discusses "analogs" but didn't trigger scaffold_analogs rule
Abstract excerpt: "We synthesized 12 analogs of compound X..."
--- FALSE POSITIVE: PMID 56789012 ---
Title: "Review of kinase inhibitors"
Score breakdown:
- Keywords: 2 pts
- Data type: 3 pts
- Specificity: 2 pts (review, not primary)
- Total: 7 pts → relevant
Why wrong: Review paper, user wants primary research only
Ask user for adjustments:
Current accuracy: 60% (below 80% threshold)
Suggestions to improve rubric:
1. Strengthen scaffold_analogs rule - should "synthesized N analogs" always trigger?
2. Lower points for review papers (currently 2 pts, maybe 0 pts?)
3. Add more synonym terms for core concepts?
What would you like to adjust?
Update screening-criteria.json based on feedback
Example update:
{
"special_rules": [
{
"name": "scaffold_analogs",
"condition": "mentions target scaffold AND (analog OR derivative OR synthesized)",
"action": "add 3 points"
}
],
"paper_types": {
"primary_research": 3,
"methods": 2,
"review": 0 // Changed from 1
}
}
Re-score test papers with updated rubric
Show new results:
UPDATED RUBRIC TEST RESULTS (5 papers):
✓ PMID 12345678: Score 9 → relevant (user: relevant) ✓
✓ PMID 23456789: Score 7 → relevant (user: relevant) ✓ (FIXED!)
✓ PMID 34567890: Score 8 → relevant (user: relevant) ✓
✓ PMID 45678901: Score 3 → not_relevant (user: not_relevant) ✓
✓ PMID 56789012: Score 5 → not_relevant (user: not_relevant) ✓ (FIXED!)
Accuracy: 100% (5/5 correct) ✓
Target: ≥80% ✓
Rubric is ready for bulk screening!
If accuracy ≥80%: Proceed to bulk screening If <80%: Continue iterating
Once rubric validated on test set:
{
"10.1234/example": {
"pmid": "12345678",
"title": "Paper title",
"abstract": "Full abstract text...",
"fetched": "2025-10-11T16:00:00Z"
}
}
{
"10.1234/example": {
"pmid": "12345678",
"status": "relevant",
"score": 9,
"source": "pubmed_search",
"timestamp": "2025-10-11T16:00:00Z",
"rubric_version": "1.0.0"
}
}
Screened 127 papers using validated rubric:
- Highly relevant (≥8): 12 papers
- Relevant (7): 18 papers
- Possibly relevant (5-6): 23 papers
- Not relevant (<5): 74 papers
All abstracts cached for re-screening.
Results saved to papers-reviewed.json.
Review offline and provide feedback if any misclassifications found.
User reviews papers offline, identifies issues:
User: "I reviewed the results. Three papers were misclassified:
- PMID 23456789 scored 4 but is actually relevant (discusses scaffold analogs)
- PMID 34567890 scored 8 but not relevant (wrong target)
- PMID 45678901 scored 6 but is highly relevant (has key dataset)
Can we update the rubric?"
Update rubric based on feedback:
Re-screening workflow:
# Load all abstracts from abstracts-cache.json
# Apply updated rubric to each
# Generate change report
RUBRIC UPDATE: v1.0.0 → v1.1.0
Changes:
- Added "derivative" to scaffold_analogs rule
- Increased dataset bonus from +1 to +2 pts
Re-screening 127 cached papers...
Status changes:
not_relevant → relevant: 3 papers
- PMID 23456789 (score 4→7)
- PMID 45678901 (score 6→8)
relevant → not_relevant: 1 paper
- PMID 34567890 (score 8→6)
Updated papers-reviewed.json with new scores.
New summary:
- Highly relevant: 13 papers (+1)
- Relevant: 19 papers (+1)
research-sessions/YYYY-MM-DD-topic/
├── screening-criteria.json # Rubric definition (weights, rules, version)
├── test-set.json # Ground truth papers used for validation
├── abstracts-cache.json # Full abstracts for all screened papers
├── papers-reviewed.json # Simple tracking: DOI, score, status
└── rubric-changelog.md # History of rubric changes and why
Before evaluating-paper-relevance:
When creating helper scripts:
During answering-research-questions:
score = 0
score += count_keyword_matches(abstract, keywords) # 0-3 pts
score += count_data_type_matches(abstract, data_types) # 0-4 pts
score += specificity_score(paper_type) # 0-3 pts
# Apply special rules
if matches_special_rule(abstract, rule):
score += rule['bonus_points']
return score
Medicinal chemistry:
{
"special_rules": [
{
"name": "scaffold_analogs",
"keywords": ["target_scaffold", "analog|derivative|series"],
"bonus": 3
},
{
"name": "sar_data",
"keywords": ["IC50|Ki|MIC", "structure-activity|SAR"],
"bonus": 2
}
]
}
Genomics:
{
"special_rules": [
{
"name": "public_data",
"keywords": ["GEO:|SRA:|ENA:", "accession"],
"bonus": 3
},
{
"name": "differential_expression",
"keywords": ["DEG|differentially expressed", "RNA-seq|microarray"],
"bonus": 2
}
]
}
Computational methods:
{
"special_rules": [
{
"name": "code_available",
"keywords": ["github|gitlab|bitbucket", "code available|software"],
"bonus": 3
},
{
"name": "benchmark",
"keywords": ["benchmark|comparison", "performance|accuracy"],
"bonus": 2
}
]
}
Skipping test-driven validation: Bulk screen without testing rubric → Many misclassifications, wasted time Not caching abstracts: Re-fetch from PubMed when rescreening → Slow, hits rate limits No ground truth testing: Can't measure rubric accuracy → Don't know if it's working Too few test papers: Test on 2-3 papers → Rubric overfits, doesn't generalize Too complex rubric: Boolean logic with 10+ rules → Hard to debug, update, explain Not documenting changes: Update rubric without tracking why → Can't reproduce, learn from mistakes Setting threshold too high: Require 95% accuracy → Iterate forever, diminishing returns Not saving rubric with session: Rubric lost when session ends → Can't reproduce screening
Rubric is ready when:
| Task | File | Contents |
|---|---|---|
| Define rubric | screening-criteria.json | Keywords, weights, special rules |
| Test rubric | test-set.json | Ground truth papers with user judgments |
| Cache abstracts | abstracts-cache.json | Full abstract text for re-screening |
| Track results | papers-reviewed.json | DOI, score, status, rubric version |
| Document changes | rubric-changelog.md | History of updates and rationale |
Agent: "I'll help you build a screening rubric. Let's start with some questions about what makes papers relevant..."
[Brainstorming phase - ask 5 question categories]
User: "I'm looking for papers on tuberculosis drug resistance with MIC data for bedaquiline and clofazimine. Papers about analogs are also relevant."
Agent: "Got it. Here's the initial rubric:
- Keywords: tuberculosis, drug resistance, bedaquiline, clofazimine (0-3 pts)
- Data: MIC values, IC50, in vitro activity (0-4 pts)
- Specificity: Primary research (0-3 pts)
- Special rule: If mentions target drugs + 'analog/derivative' → +3 bonus pts
- Threshold: ≥7 = relevant
Does this look right?"
User: "Yes, let's test it."
Agent: "I'll search PubMed for 20 papers and show you abstracts to classify..."
[Shows 10 abstracts, user classifies them]
Agent: "Testing rubric on your classifications...
Accuracy: 70% (7/10 correct)
- 2 false negatives (missed relevant papers)
- 1 false positive (marked review as relevant)
False negative example: Paper discusses 'derivatives of bedaquiline' but scored only 5 pts because didn't explicitly say 'analog'.
Suggestion: Add 'derivative' and 'compound' to special rule trigger terms?"
User: "Yes, and lower review paper points."
Agent: "Updated rubric. Re-testing...
Accuracy: 90% (9/10 correct) ✓
Ready to run bulk screening on 127 papers?"
User: "Yes, go ahead."
Agent: [Screens 127 papers, saves results]
"Done! Found 18 relevant papers. All abstracts cached for re-screening.
Review offline and let me know if any misclassifications."
[Later...]
User: "Found 2 misclassifications. Can we update the rubric?"
Agent: "Sure! What did I miss?"
User: "PMID 12345 discusses pretomanid combination therapy with bedaquiline - should be relevant."
Agent: "Got it. Adding 'combination therapy' as related term with +2 bonus pts.
Re-screening all 127 cached papers...
Status changes: 3 papers now relevant (including PMID 12345).
Updated papers-reviewed.json."
After building rubric:
Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.