Download, configure, and study benchmarks for evaluation. Builds benchmark skills for hypothesis testing.
Downloads, configures, and studies evaluation benchmarks for hypothesis testing.
/plugin marketplace add hdubey-debug/orion/plugin install hdubey-debug-orion@hdubey-debug/orionDownload, configure, and study benchmarks for evaluation. Builds benchmark skills for hypothesis testing.
/benchmark-setup [benchmark-name]
/benchmark-setup # Setup benchmarks identified from literature
/benchmark-setup VideoMME # Setup specific benchmark
/benchmark-setup --from-folder ./benchmarks # Study already downloaded benchmarks
IMPORTANT: This command MUST use Plan Mode. Create a plan first, get user approval, then execute.
When user invokes /benchmark-setup, follow this process:
Use EnterPlanMode, then read literature overview:
cat research/skills/literature/_overview.md
Extract benchmarks mentioned:
Benchmarks identified from literature:
1. VideoMME - Used by P001, P002
2. MMLU - Used by P001
3. Custom dataset from P002
For each benchmark, gather information (use WebSearch if needed):
## Benchmark: VideoMME
### Basic Info
- Full name: Video Multi-Modal Evaluation
- Task: Video understanding QA
- Size: ~2000 videos, ~6000 questions
- Modality: Video + Text
### Download Options
A. Official source: [URL]
B. HuggingFace: [HF dataset path]
C. Manual download required
### Storage Requirements
- Download size: ~50GB
- Extracted size: ~100GB
### Dependencies
- ffmpeg for video processing
- Python packages: [list]
## Benchmark Setup Plan
### Benchmarks to Setup
| Benchmark | Source | Size | Priority |
|-----------|--------|------|----------|
| VideoMME | HuggingFace | 50GB | High |
| MMLU | HuggingFace | 1GB | Medium |
### Download Strategy
1. VideoMME
- Source: huggingface.co/datasets/...
- Command: `huggingface-cli download ...`
- Location: research/benchmarks/videomme/
2. MMLU
- Source: huggingface.co/datasets/...
- Command: `huggingface-cli download ...`
- Location: research/benchmarks/mmlu/
### User Decisions Needed
- [ ] Confirm storage location has space
- [ ] Approve download of [X GB] total
- [ ] Prefer subset or full benchmark?
### After Download
- Study benchmark structure
- Create evaluation scripts
- Run baseline if codebase ready
Present download plan and ask:
Benchmark Setup Plan
Benchmarks to download:
1. VideoMME (50GB) - High priority
2. MMLU (1GB) - Medium priority
Total: 51GB
Options:
A. Download all (51GB)
B. Download subsets only (~5GB)
C. I'll download manually, just study existing
D. Modify plan
Which option?
Use ExitPlanMode after getting user's decision.
For each approved benchmark:
# Create directory
mkdir -p research/benchmarks/<name>
# Download (example for HuggingFace)
huggingface-cli download <dataset-path> --local-dir research/benchmarks/<name>
# Or wget/curl for direct downloads
wget -P research/benchmarks/<name> <url>
If user chose to download manually:
Please download benchmarks to:
- VideoMME → research/benchmarks/videomme/
- MMLU → research/benchmarks/mmlu/
Run `/benchmark-setup --from-folder research/benchmarks` when ready.
For each benchmark (downloaded or provided):
# Explore structure
ls -la research/benchmarks/<name>/
find research/benchmarks/<name> -type f | head -20
Create research/skills/benchmarks/<name>.md:
# Benchmark: [Name]
## Overview
- **Full Name**: [Name]
- **Task**: [Task description]
- **Paper**: [Citation if applicable]
- **URL**: [Official URL]
## Data Structure
<name>/ ├── videos/ # X videos ├── questions.json # Y questions ├── annotations/ # Ground truth └── metadata.json
## Sample Format
### Input
```json
{
"video_id": "001",
"question": "What is happening in the video?",
"options": ["A", "B", "C", "D"]
}
{
"video_id": "001",
"answer": "A"
}
| Metric | Description | Computation |
|---|---|---|
| Accuracy | % correct answers | correct / total |
| [Other] | [Description] | [How computed] |
# Example evaluation command
python eval.py --benchmark <name> --predictions pred.json
| Method | Score | Source |
|---|---|---|
| [Method] | [Score] | [Paper] |
#### Step 8: Create Evaluation Helper
Create `research/benchmarks/<name>/evaluate.py` or document how to evaluate:
```python
# research/benchmarks/<name>/evaluate.py
"""
Evaluation script for [Benchmark Name]
Usage:
python evaluate.py --predictions pred.json --ground-truth gt.json
"""
import json
import argparse
def compute_accuracy(predictions, ground_truth):
correct = sum(p == g for p, g in zip(predictions, ground_truth))
return correct / len(ground_truth)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--predictions', required=True)
parser.add_argument('--ground-truth', required=True)
args = parser.parse_args()
# Load and evaluate
with open(args.predictions) as f:
preds = json.load(f)
with open(args.ground_truth) as f:
gt = json.load(f)
accuracy = compute_accuracy(preds, gt)
print(f"Accuracy: {accuracy:.2%}")
if __name__ == '__main__':
main()
Update research/skills/benchmarks/_overview.md:
# Benchmarks
## Available Benchmarks
| Name | Task | Metrics | Size | Status |
|------|------|---------|------|--------|
| VideoMME | Video QA | Accuracy | 6K | Ready |
| MMLU | Knowledge QA | Accuracy | 14K | Ready |
## Quick Reference
### VideoMME
- Location: `research/benchmarks/videomme/`
- Eval: `python research/benchmarks/videomme/evaluate.py`
- Subset: First 600 samples (10%)
### MMLU
- Location: `research/benchmarks/mmlu/`
- Eval: `python research/benchmarks/mmlu/evaluate.py`
- Subset: 1400 samples (10%)
## Baseline Results
| Benchmark | Method | Score | Date |
|-----------|--------|-------|------|
| [To be filled after /orion-setup runs baseline] |
---
*Run /hypothesis-generation to create hypotheses to test*
{
"phases": {
"benchmark_setup": "complete"
},
"benchmarks": [
{
"name": "VideoMME",
"path": "research/benchmarks/videomme",
"status": "ready",
"subset_size": 600
}
]
}
Benchmark Setup Complete!
Benchmarks ready:
├── VideoMME
│ ├── Location: research/benchmarks/videomme/
│ ├── Size: 6000 samples
│ ├── Subset: 600 (10%)
│ └── Skill: research/skills/benchmarks/videomme.md
│
├── MMLU
│ ├── Location: research/benchmarks/mmlu/
│ ├── Size: 14000 samples
│ ├── Subset: 1400 (10%)
│ └── Skill: research/skills/benchmarks/mmlu.md
View details: /knowledge benchmarks
Next step: /hypothesis-generation
Download fails:
Benchmark format unknown:
Storage full: