Skill

research-principles

Guides hypothesis-driven LLM research with principles for explicit hypotheses, red-teaming results, rigorous documentation, uncertainty reduction, and prompting escalation ladder. For AI experiments.

OpenAI

ai-ml

npx claudepluginhub butanium/claude-lab --plugin clab

Tool Access

This skill uses the workspace's default tool permissions.

Preview

There is no "success" or "failure" in research, only insights and confidence levels.

SKILL.md

Similar Skills

research-collaborator

Acts as AI/ML research collaborator: searches literature with query variations, analyzes codebases/logs, designs minimal falsification experiments, records predictions, and audits bugs.

4 files

research-collaborator

autoresearch

Core autonomous research loop. Reads research.md, proposes hypotheses, runs experiments, evaluates results mechanically, keeps improvements, discards failures, and iterates until the target metric is achieved or the iteration budget is exhausted. TRIGGER when: user invokes "autoresearch" (no subcommand); research.md exists; user wants the 5-stage loop; user wants iterative optimization overnight.

2 files6 tools

autoresearch

ml-iterate

155

Generates ranked, citation-grounded next steps for ML fine-tuning experiments when stuck after trying approaches, using KB or web fetches from HF docs and model cards.

superml

Stats

Parent Repo Stars12

Parent Repo Forks0

Last CommitFeb 24, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Research Principles

There is no "success" or "failure" in research, only insights and confidence levels.

Hypothesis-Driven Exploration

State hypotheses explicitly before running experiments
Pre-register predictions to avoid post-hoc rationalization
Document negative results - they're data too

Red-team Your Results

Define the scope: how sensitive is your finding to prompt variations, tasks, models?
Actively seek disconfirming evidence
A single observation is an anecdote, not a conclusion

Documentation Standards

Log everything: commands run, parameters used, timestamps
Verbatim outputs over paraphrasing
Separate observations from interpretations

Rigor Over Speed

Quick proxies (keyword grep, eyeballing samples) are fine for early triage — deciding what's worth investigating. But any result that feeds into hypothesis updates or gets reported must be analyzed rigorously.
Use LLM judges for subjective classification, not regex/keyword heuristics. Regex misses nuance and produces misleading stats. Reserve regex only for purely mechanical checks (e.g. "contains non-ASCII characters").
Audit before scaling: run judges on a small batch first, verify the scores match your intuition, then scale.
Report effect sizes with context: sample sizes, variance, whether the effect is prompt-specific or general.
Include verbatim examples alongside aggregates — numbers without examples are uninterpretable.

Reduce Uncertainty at the Fastest Possible Rate

The goal of research is not to run experiments — it's to update your beliefs. Every decision should optimize for information gain per unit time.

Before launching a heavy experiment, ask: is there a cheaper way to get the same signal?
A single message to a model via API call can sometimes resolve a question that would otherwise take a day-long experiment.
Prefer many small, fast experiments over one large, slow one.

The Escalation Ladder

When trying to get a model to do something, try approaches in this order. Only escalate when simpler methods fail or plateau:

Zero-shot prompting — Try it in a chat interface. Send 10-100 messages, iterate on the prompt.
Few-shot prompting — Add 1-10 gold examples of what you want.
Many-shot prompting — Fill the context window with labeled examples.
Best-of-N sampling — Sample N times, pick the best via a judge or reward model. No training needed.
Supervised fine-tuning — Only when prompting hits a ceiling. Start with an API (e.g. OpenAI) for fast iteration.
RL/RLHF — Last resort. Slower iteration, more complex code, harder to debug.

Each step is roughly an order of magnitude more expensive in time and complexity. Don't skip steps.

Cache LLM Responses

Any experiment involving LLM API calls should cache responses to disk. This lets you:

Kill and restart scripts without losing progress
Tweak analysis code without re-running inference
Resume from where you left off after errors

Use one file per response keyed by a deterministic hash of the request (model, prompt, temperature, etc.). Use hashlib.md5, not Python's built-in hash() (which is non-deterministic across runs).

Avoid Common Pitfalls

Confirmation bias: Actively seek disconfirming evidence
Cherry-picking: Don't ignore results that don't fit
Over-interpreting: Single observations are anecdotes, not conclusions
Blind fixing: Don't try random fixes without understanding the root cause
False precision: A regex classifier giving "43.1% human fabrication" looks precise but the methodology is sloppy — prefer proper LLM evaluation with transparent criteria