/hypothesis-generation - Generate Research Hypotheses

Generate hypotheses from accumulated knowledge, create a Research PRD, convert to JSON, and prepare for Ralph Loop execution.

Usage

/hypothesis-generation [--target <score>] [--benchmark <name>]

Examples

/hypothesis-generation
/hypothesis-generation --target 75.0 --benchmark VideoMME

Process

IMPORTANT: This command MUST use Plan Mode. Create a plan first, get user approval, then execute.

When user invokes /hypothesis-generation, follow this process:

Phase 1: Planning (Plan Mode)

Step 1: Load All Knowledge

Use EnterPlanMode, then read accumulated skills:

# Literature knowledge
cat research/skills/literature/_overview.md

# Domain knowledge
cat research/skills/domain/_overview.md

# Benchmark knowledge
cat research/skills/benchmarks/_overview.md

# Any previous learnings
cat research/skills/learned/_overview.md

# Current project state
cat research/orion.json

Step 2: Synthesize Insights

Create a synthesis of all knowledge:

## Knowledge Synthesis

### From Literature
- Method A from P001: [key insight]
- Method B from P002: [key insight]
- Gap identified: [what papers don't solve]

### From Domain Expertise
- User intuition: [relevant ideas]
- Constraints: [what limits our approach]

### From Benchmarks
- Target: [benchmark] at [score]
- Current SOTA: [method] at [score]
- Gap to close: [X points]

### Promising Directions
1. [Direction 1] - combines [Method A] with [Domain insight]
2. [Direction 2] - addresses [gap] using [technique]
3. [Direction 3] - [reasoning]

Step 3: Generate Hypotheses

Based on synthesis, propose 5-10 hypotheses:

## Proposed Hypotheses

### H-001: [Title]
- **Rationale**: [Why this might work - cite knowledge source]
- **From**: Literature (P001) + Domain intuition
- **Implementation**: [Specific changes needed]
- **Estimated Impact**: [Low/Medium/High]
- **Complexity**: [Low/Medium/High]
- **Priority**: [Score = Impact × 1/Complexity]

### H-002: [Title]
...

### Priority Order
| Rank | ID | Title | Impact | Complexity | Priority |
|------|-----|-------|--------|------------|----------|
| 1 | H-003 | [Title] | High | Low | 9 |
| 2 | H-001 | [Title] | High | Medium | 6 |
| 3 | H-002 | [Title] | Medium | Low | 6 |

Step 4: Present to User

Generated [N] hypotheses from accumulated knowledge:

1. H-001: [Title] (Priority: High)
   Rationale: [Brief]

2. H-002: [Title] (Priority: Medium)
   Rationale: [Brief]

...

Options:
A. Accept all hypotheses as proposed
B. I want to modify/reorder (will show editor)
C. Add my own hypothesis
D. Regenerate with different focus

Which option?

Step 5: User Curation

If user wants to modify (Option B):

Show full hypothesis list
Allow user to add, remove, reorder, modify

If user wants to add (Option C):

Ask for hypothesis details
Add to list

Use ExitPlanMode after finalizing hypothesis list.

Phase 2: Create Research PRD

Step 6: Generate Research PRD

Create research/research-prd.md:

# Research PRD: [Project Name]

## 1. Objective

**Goal**: [Primary research objective]
**Target**: Beat [benchmark] score of [X] (current SOTA: [Y])
**Success Metric**: [metric] > [target]

## 2. Background

### Literature Summary
[Key methods and insights from papers]

### Domain Context
[Relevant domain knowledge]

### Current State
- Baseline: [method] at [score]
- Gap: [X points]

## 3. Hypotheses (User Stories)

### H-001: [Title]
**As a** researcher
**I want to** [implement hypothesis]
**So that** [expected improvement]

**Rationale**: [Why this should work]
**Source**: [Literature/Domain/Intuition]

**Acceptance Criteria**:
- [ ] Implementation complete
- [ ] Subset test shows improvement (>= baseline)
- [ ] Full benchmark run if subset promising
- [ ] Results documented
- [ ] Typecheck passes (if code changes)

**Priority**: 1
**Estimated Complexity**: [Low/Medium/High]

### H-002: [Title]
...

## 4. Evaluation Plan

### Benchmarks
| Benchmark | Metric | Subset Size | Full Size |
|-----------|--------|-------------|-----------|
| [Name] | [Metric] | [N] | [M] |

### Testing Protocol
1. Run subset test (10% data)
2. If subset >= baseline: run full benchmark
3. If subset < baseline: analyze failure, skip or iterate

### Success Criteria
- **Hypothesis passes**: Full benchmark > baseline
- **Project succeeds**: Full benchmark >= target

## 5. Codebase Setup

**Repository**: [URL or path]
**Branch Strategy**:
- `main`: Stable baseline
- `orion/hXXX-name`: Per-hypothesis experiments

## 6. Non-Goals
- [What we're NOT trying to do]
- [Scope boundaries]

## 7. Risks
- [Risk 1]: [Mitigation]
- [Risk 2]: [Mitigation]

## 8. Timeline
[Not time-based, but order of operations]
1. Test H-001
2. If pass, merge; if fail, analyze
3. Test H-002
4. Continue until target or exhausted

Step 7: Convert to JSON

Create research/research-prd.json:

{
  "project": "[project-name]",
  "branchName": "orion/research",
  "description": "[Research goal]",
  "target": {
    "benchmark": "[benchmark]",
    "metric": "[metric]",
    "score": [target-score]
  },
  "baseline": {
    "method": "[method]",
    "score": [baseline-score]
  },
  "userStories": [
    {
      "id": "H-001",
      "title": "[Hypothesis title]",
      "description": "As a researcher, I want to [hypothesis] so that [benefit]",
      "rationale": "[Why this should work]",
      "source": "[Literature/Domain/Intuition]",
      "implementation": "[Specific changes]",
      "acceptanceCriteria": [
        "Implementation complete",
        "Subset test shows improvement (>= baseline)",
        "Full benchmark run if subset promising",
        "Results documented in skills/learned/",
        "Typecheck passes"
      ],
      "priority": 1,
      "complexity": "medium",
      "status": "pending",
      "subset_result": null,
      "full_result": null,
      "analysis": "",
      "branch": "orion/h001-[slug]"
    }
  ],
  "completed": false,
  "best_result": null
}

Step 8: Update orion.json

{
  "phases": {
    "hypothesis_generation": "complete",
    "experimentation": "ready"
  },
  "hypotheses": [...],
  "target": {
    "benchmark": "[benchmark]",
    "score": [target]
  }
}

Step 9: Summary and Next Steps

Research PRD Generated!

Target: [benchmark] >= [target] (baseline: [baseline])

Hypotheses ready for testing:
| Priority | ID | Title | Complexity |
|----------|-----|-------|------------|
| 1 | H-001 | [Title] | Medium |
| 2 | H-002 | [Title] | Low |
| 3 | H-003 | [Title] | High |

Files created:
- research/research-prd.md (human readable)
- research/research-prd.json (for Ralph Loop)

Ready to start experiments!

Next steps:
1. Setup codebase: /orion-setup <repo-url>
2. Or start directly: /ralph-loop research/research-prd.json

Recommended command:
/ralph-loop "Test hypotheses in research/research-prd.json. For each: implement, subset test, full test if promising, document learnings. Output <promise>RESEARCH_COMPLETE</promise> when target achieved or all hypotheses tested." --max-iterations 50 --completion-promise "RESEARCH_COMPLETE"

Hypothesis Quality Guidelines

Good Hypotheses

Atomic: One testable change
Grounded: Based on literature, domain knowledge, or clear reasoning
Measurable: Clear success criterion
Implementable: Specific code/config changes defined

Bad Hypotheses (avoid)

"Make it better" - too vague
"Try everything" - not atomic
"I feel like this might work" - needs grounding
"Rewrite the whole system" - too complex

Priority Scoring

Priority = Impact × (1 / Complexity)

Impact:
- High (3): Could achieve target alone
- Medium (2): Meaningful improvement expected
- Low (1): Incremental improvement

Complexity:
- Low (1): Config change or small code edit
- Medium (2): New component or significant changes
- High (3): Major architecture change

Tools Used

Read: Load all skill files
Write: Create PRD files
EnterPlanMode/ExitPlanMode: Planning workflow
AskUserQuestion: Hypothesis curation

Integration with Ralph Loop

After this command, run:

/ralph-loop "Implement research hypotheses from research/research-prd.json..." --max-iterations 50

Ralph will:

Read research-prd.json
For each hypothesis:
- Create branch
- Implement changes
- Run subset test
- Run full test if promising
- Document learnings
- Update status
Continue until target or exhausted