/benchmark-setup - Setup Evaluation Benchmarks

Download, configure, and study benchmarks for evaluation. Builds benchmark skills for hypothesis testing.

Usage

/benchmark-setup [benchmark-name]

Examples

/benchmark-setup                    # Setup benchmarks identified from literature
/benchmark-setup VideoMME           # Setup specific benchmark
/benchmark-setup --from-folder ./benchmarks  # Study already downloaded benchmarks

Process

IMPORTANT: This command MUST use Plan Mode. Create a plan first, get user approval, then execute.

When user invokes /benchmark-setup, follow this process:

Phase 1: Planning (Plan Mode)

Step 1: Identify Benchmarks Needed

Use EnterPlanMode, then read literature overview:

cat research/skills/literature/_overview.md

Extract benchmarks mentioned:

Benchmarks identified from literature:
1. VideoMME - Used by P001, P002
2. MMLU - Used by P001
3. Custom dataset from P002

Step 2: Research Each Benchmark

For each benchmark, gather information (use WebSearch if needed):

## Benchmark: VideoMME

### Basic Info
- Full name: Video Multi-Modal Evaluation
- Task: Video understanding QA
- Size: ~2000 videos, ~6000 questions
- Modality: Video + Text

### Download Options
A. Official source: [URL]
B. HuggingFace: [HF dataset path]
C. Manual download required

### Storage Requirements
- Download size: ~50GB
- Extracted size: ~100GB

### Dependencies
- ffmpeg for video processing
- Python packages: [list]

Step 3: Create Setup Plan

## Benchmark Setup Plan

### Benchmarks to Setup
| Benchmark | Source | Size | Priority |
|-----------|--------|------|----------|
| VideoMME | HuggingFace | 50GB | High |
| MMLU | HuggingFace | 1GB | Medium |

### Download Strategy
1. VideoMME
   - Source: huggingface.co/datasets/...
   - Command: `huggingface-cli download ...`
   - Location: research/benchmarks/videomme/

2. MMLU
   - Source: huggingface.co/datasets/...
   - Command: `huggingface-cli download ...`
   - Location: research/benchmarks/mmlu/

### User Decisions Needed
- [ ] Confirm storage location has space
- [ ] Approve download of [X GB] total
- [ ] Prefer subset or full benchmark?

### After Download
- Study benchmark structure
- Create evaluation scripts
- Run baseline if codebase ready

Step 4: User Approval

Present download plan and ask:

Benchmark Setup Plan

Benchmarks to download:
1. VideoMME (50GB) - High priority
2. MMLU (1GB) - Medium priority

Total: 51GB

Options:
A. Download all (51GB)
B. Download subsets only (~5GB)
C. I'll download manually, just study existing
D. Modify plan

Which option?

Use ExitPlanMode after getting user's decision.

Phase 2: Execution (After Approval)

Step 5: Download Benchmarks (if approved)

For each approved benchmark:

# Create directory
mkdir -p research/benchmarks/<name>

# Download (example for HuggingFace)
huggingface-cli download <dataset-path> --local-dir research/benchmarks/<name>

# Or wget/curl for direct downloads
wget -P research/benchmarks/<name> <url>

If user chose to download manually:

Please download benchmarks to:
- VideoMME → research/benchmarks/videomme/
- MMLU → research/benchmarks/mmlu/

Run `/benchmark-setup --from-folder research/benchmarks` when ready.

Step 6: Study Benchmark Structure

For each benchmark (downloaded or provided):

# Explore structure
ls -la research/benchmarks/<name>/
find research/benchmarks/<name> -type f | head -20

Step 7: Create Benchmark Skills

Create research/skills/benchmarks/<name>.md:

# Benchmark: [Name]

## Overview
- **Full Name**: [Name]
- **Task**: [Task description]
- **Paper**: [Citation if applicable]
- **URL**: [Official URL]

## Data Structure

<name>/ ├── videos/ # X videos ├── questions.json # Y questions ├── annotations/ # Ground truth └── metadata.json


## Sample Format

### Input
```json
{
  "video_id": "001",
  "question": "What is happening in the video?",
  "options": ["A", "B", "C", "D"]
}

Expected Output

{
  "video_id": "001",
  "answer": "A"
}

Evaluation Metrics

Metric	Description	Computation
Accuracy	% correct answers	correct / total
[Other]	[Description]	[How computed]

Subset for Quick Testing

Full: [N] samples
Recommended subset: [M] samples ([X]%)
Subset selection: [Random / Stratified / First N]

Evaluation Script

# Example evaluation command
python eval.py --benchmark <name> --predictions pred.json

Known Baselines

Method	Score	Source
[Method]	[Score]	[Paper]

Notes

[Any quirks or gotchas]
[Preprocessing requirements]


#### Step 8: Create Evaluation Helper

Create `research/benchmarks/<name>/evaluate.py` or document how to evaluate:

```python
# research/benchmarks/<name>/evaluate.py
"""
Evaluation script for [Benchmark Name]

Usage:
    python evaluate.py --predictions pred.json --ground-truth gt.json
"""

import json
import argparse

def compute_accuracy(predictions, ground_truth):
    correct = sum(p == g for p, g in zip(predictions, ground_truth))
    return correct / len(ground_truth)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--predictions', required=True)
    parser.add_argument('--ground-truth', required=True)
    args = parser.parse_args()

    # Load and evaluate
    with open(args.predictions) as f:
        preds = json.load(f)
    with open(args.ground_truth) as f:
        gt = json.load(f)

    accuracy = compute_accuracy(preds, gt)
    print(f"Accuracy: {accuracy:.2%}")

if __name__ == '__main__':
    main()

Step 9: Update Benchmarks Overview

Update research/skills/benchmarks/_overview.md:

# Benchmarks

## Available Benchmarks
| Name | Task | Metrics | Size | Status |
|------|------|---------|------|--------|
| VideoMME | Video QA | Accuracy | 6K | Ready |
| MMLU | Knowledge QA | Accuracy | 14K | Ready |

## Quick Reference

### VideoMME
- Location: `research/benchmarks/videomme/`
- Eval: `python research/benchmarks/videomme/evaluate.py`
- Subset: First 600 samples (10%)

### MMLU
- Location: `research/benchmarks/mmlu/`
- Eval: `python research/benchmarks/mmlu/evaluate.py`
- Subset: 1400 samples (10%)

## Baseline Results
| Benchmark | Method | Score | Date |
|-----------|--------|-------|------|
| [To be filled after /orion-setup runs baseline] |

---
*Run /hypothesis-generation to create hypotheses to test*

Step 10: Update orion.json

{
  "phases": {
    "benchmark_setup": "complete"
  },
  "benchmarks": [
    {
      "name": "VideoMME",
      "path": "research/benchmarks/videomme",
      "status": "ready",
      "subset_size": 600
    }
  ]
}

Step 11: Summary Output

Benchmark Setup Complete!

Benchmarks ready:
├── VideoMME
│   ├── Location: research/benchmarks/videomme/
│   ├── Size: 6000 samples
│   ├── Subset: 600 (10%)
│   └── Skill: research/skills/benchmarks/videomme.md
│
├── MMLU
│   ├── Location: research/benchmarks/mmlu/
│   ├── Size: 14000 samples
│   ├── Subset: 1400 (10%)
│   └── Skill: research/skills/benchmarks/mmlu.md

View details: /knowledge benchmarks
Next step: /hypothesis-generation

Key Principles

Plan Mode First: Always plan downloads, get approval
User Controls Downloads: Don't download without permission
Study Structure: Understand format before experiments
Create Evaluation Helpers: Make testing reproducible
Support Subset Testing: Enable quick validation

Tools Used

Read: Explore benchmark files
Write: Create skill files, evaluation scripts
Bash: Download, explore directories
WebSearch: Find benchmark info, download links
EnterPlanMode/ExitPlanMode: Planning workflow
AskUserQuestion: Get download approval

/benchmark-setup