From open-science-skills
Guides LLM text classification for survey data: codebook design, zero/few-shot/fine-tuning selection, model choice, human-LLM hybrids, validation, reproducibility.
npx claudepluginhub scdenney/open-science-skills --plugin open-science-skillsThis skill uses the workspace's default tool permissions.
- Treat codebook design as the most consequential decision in the classification pipeline. LLMs struggle with loose instructions and revert to general-purpose definitions rather than following researcher-specific operationalizations (Halterman & Keith 2025).
Evaluates LLM apps using automated metrics (BLEU, ROUGE, BERTScore, MRR), human feedback, and LLM-as-judge. For testing performance, benchmarking, and regressions.
Generates and tests LLM-driven hypotheses on labeled tabular datasets using HypoGeniC (data-driven), HypoRefine (literature+data), and Union methods. Supports iterative refinement, Redis caching, and multi-hypothesis inference for scientific ideation.
Generates new prompts or improves existing ones for Anthropic Claude models using techniques like XML tags, Chain of Thought, prefilling, role assignment, and few-shot examples.
Share bugs, ideas, or general feedback.
none_of_above or uncodeable) for responses that are too vague, too short, or off-topic. Define this category as precisely as the substantive codes (Halterman & Keith 2025).Follow the decision framework from Chae & Davidson (2025), which maps document characteristics and available resources to the appropriate approach:
Zero-shot prompting: Use when classifying short documents with a large decoder model (GPT-4o, Llama3-70B+) and no labeled training data. Best for rapid prototyping and tasks where constructs are well-defined. GPT-4o achieves the best zero-shot performance across tasks (Chae & Davidson 2025).
Few-shot prompting: Add labeled examples to the prompt. Results are inconsistent — adding examples helps some models but degrades others (Chae & Davidson 2025). Always compare few-shot against zero-shot on a held-out sample before committing. Select diverse examples covering edge cases, not just prototypical instances.
Fine-tuning: Train a model on labeled data. Effective with as few as 100 hand-coded examples for smaller models (Chae & Davidson 2025). Fine-tuned smaller models (Llama3-8B, GPT-3 Davinci) can match GPT-4o zero-shot performance. Prefer this when you have labeled data and need cost-effective classification at scale.
Instruction-tuning: Combine detailed prompting with fine-tuning on paired instruction-output examples. Most powerful regime for complex tasks — instruction-tuned Llama3-70B surpasses GPT-4o zero-shot on stance detection (Chae & Davidson 2025). Requires more technical infrastructure but yields the highest accuracy.
When resources permit, test multiple regimes on the same pilot sample and select based on empirical performance, not assumptions.
gpt-4o-2024-08-06), not the model family name. Commercial models are modified or deprecated without notice — GPT-3 was withdrawn from OpenAI's API entirely (Barrie, Palmer & Spirling 2025; Chae & Davidson 2025)."Code this response:\n\n{text}").