From chrisvoncsefalvay-funsloth
Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.
npx claudepluginhub joshuarweaver/cascade-ai-ml-engineering --plugin chrisvoncsefalvay-funslothThis skill uses the workspace's default tool permissions.
Validate datasets before fine-tuning with Unsloth.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Validate datasets before fine-tuning with Unsloth.
For automated validation, use the script:
python scripts/validate_dataset.py --dataset "dataset-id" --model llama-3.1-8b --lora-rank 16
Ask for: HF dataset ID (e.g., mlabonne/FineTome-100k) or local path (e.g., ./data.jsonl)
Auto-detect format from structure. See DATA_FORMATS.md for details.
| Format | Detection | Key Fields |
|---|---|---|
| Raw | text only | text |
| Alpaca | instruction + output | instruction, output |
| ShareGPT | conversations array | from, value |
| ChatML | messages array | role, content |
Check required fields exist. Report issues with fix suggestions.
Display 2-3 examples for visual verification.
Report statistics: total tokens, min/max/mean/median sequence length.
Flag concerns:
Ask for target model and LoRA rank, then calculate:
| Chinchilla Fraction | Interpretation |
|---|---|
| < 0.5x | Dataset may be too small |
| 0.5x - 2.0x | Good range |
| > 2.0x | Large dataset, may take longer |
Based on analysis, suggest:
standardize_sharegpt() for ShareGPT dataOffer to upload local datasets to Hub.
Pass context to funsloth-train:
dataset_id: "mlabonne/FineTome-100k"
format_type: "sharegpt"
total_tokens: 15000000
target_model: "llama-3.1-8b"
use_lora: true
lora_rank: 16
chinchilla_fraction: 1.2