Generates a rigorous experiment design given a hypothesis. Use when asked to design experiments, plan experiments, create an experimental setup, or figure out how to test a research hypothesis. Covers controls, baselines, ablations, metrics, statistical tests, and compute estimates.
/plugin marketplace add GhostScientist/skills/plugin install writing-skills@GhostScientist-skillsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
Prevent the "I ran experiments for 3 months and they're meaningless" disaster through rigorous upfront design.
Before running ANY experiment, you should be able to answer:
Convert your research question into falsifiable predictions:
Template:
If [intervention/method], then [measurable outcome], because [mechanism].
Examples:
Null hypothesis: What does "no effect" look like? This is what you're trying to reject.
Independent Variables (what you manipulate):
| Variable | Levels | Rationale |
|---|---|---|
| [Var 1] | [Level A, B, C] | [Why these levels] |
Dependent Variables (what you measure):
| Metric | How Measured | Why This Metric |
|---|---|---|
| [Metric 1] | [Procedure] | [Justification] |
Control Variables (what you hold constant):
| Variable | Fixed Value | Why Fixed |
|---|---|---|
| [Var 1] | [Value] | [Prevents confound X] |
Every experiment needs comparisons. No result is meaningful in isolation.
Baseline Hierarchy:
Random/Trivial Baseline
Simple Baseline
Standard Baseline
State-of-the-Art Baseline
Ablated Self
For each baseline, document:
Ablations answer: "Is each component necessary?"
Ablation Template:
| Variant | What's Removed/Changed | Expected Effect | If No Effect... |
|---|---|---|---|
| Full Model | Nothing | Best performance | - |
| w/o Component A | Remove A | Performance drops X% | A isn't helping |
| w/o Component B | Remove B | Performance drops Y% | B isn't helping |
| Component A only | Only A, no B | Shows A's isolated contribution | - |
Good ablations are:
Things that could explain your results OTHER than your hypothesis:
Common Confounds:
| Confound | How to Check | How to Control |
|---|---|---|
| Hyperparameter tuning advantage | Same tuning budget for all | Report tuning procedure |
| Compute advantage | Matched FLOPs/params | Report compute used |
| Data leakage | Check train/test overlap | Strict separation |
| Random seed luck | Multiple seeds | Report variance |
| Implementation bugs (baseline) | Verify baseline numbers | Use official implementations |
| Cherry-picked examples | Random or systematic selection | Pre-register selection criteria |
Sample Size:
What to Report:
Appropriate Tests:
| Comparison | Test | Assumptions |
|---|---|---|
| Two methods, normal data | t-test | Normality, equal variance |
| Two methods, unknown dist | Mann-Whitney U | Ordinal data |
| Multiple methods | ANOVA + post-hoc | Normality |
| Multiple methods, unknown | Kruskal-Wallis | Ordinal data |
| Paired comparisons | Wilcoxon signed-rank | Same test instances |
Avoid:
Before running, estimate:
| Component | Estimate | Notes |
|---|---|---|
| Single training run | X GPU-hours | [Details] |
| Hyperparameter search | Y runs × X hours | [Search strategy] |
| Baselines | Z runs × W hours | [Which baselines] |
| Ablations | N variants × X hours | [Which ablations] |
| Seeds | M seeds × above | [How many seeds] |
| Total | T GPU-hours | Buffer: 1.5-2x |
Go/No-Go Decision: Is this feasible with available resources?
Write down BEFORE running:
This prevents unconscious goal-post moving.
# Experiment Design: [Title]
## Hypothesis
[Precise statement]
## Variables
### Independent
[Table]
### Dependent
[Table]
### Controls
[Table]
## Baselines
1. [Baseline 1]: [Source, details]
2. [Baseline 2]: [Source, details]
## Ablations
[Table]
## Confound Mitigation
[Table]
## Statistical Plan
- Seeds: [N]
- Tests: [Which tests for which comparisons]
- Significance threshold: [α level]
## Compute Budget
[Table with total estimate]
## Success Criteria
- Primary: [What must be true]
- Secondary: [Nice to have]
## Timeline
- Phase 1: [What, when]
- Phase 2: [What, when]
## Known Risks
1. [Risk 1]: [Mitigation]
2. [Risk 2]: [Mitigation]
🚩 "We'll figure out the metrics later" 🚩 "One run should be enough" 🚩 "We don't need baselines, it's obviously better" 🚩 "Let's just see what happens" 🚩 "We can always run more if it's not significant" 🚩 No compute estimate before starting 🚩 Vague success criteria
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.