Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.
/plugin marketplace add chrisvoncsefalvay/funsloth/plugin install funsloth@funslothThis skill inherits all available tools. When active, it can use any tool Claude has access to.
DATA_FORMATS.mdscripts/validate_dataset.pyValidate datasets before fine-tuning with Unsloth.
For automated validation, use the script:
python scripts/validate_dataset.py --dataset "dataset-id" --model llama-3.1-8b --lora-rank 16
Ask for: HF dataset ID (e.g., mlabonne/FineTome-100k) or local path (e.g., ./data.jsonl)
Auto-detect format from structure. See DATA_FORMATS.md for details.
| Format | Detection | Key Fields |
|---|---|---|
| Raw | text only | text |
| Alpaca | instruction + output | instruction, output |
| ShareGPT | conversations array | from, value |
| ChatML | messages array | role, content |
Check required fields exist. Report issues with fix suggestions.
Display 2-3 examples for visual verification.
Report statistics: total tokens, min/max/mean/median sequence length.
Flag concerns:
Ask for target model and LoRA rank, then calculate:
| Chinchilla Fraction | Interpretation |
|---|---|
| < 0.5x | Dataset may be too small |
| 0.5x - 2.0x | Good range |
| > 2.0x | Large dataset, may take longer |
Based on analysis, suggest:
standardize_sharegpt() for ShareGPT dataOffer to upload local datasets to Hub.
Pass context to funsloth-train:
dataset_id: "mlabonne/FineTome-100k"
format_type: "sharegpt"
total_tokens: 15000000
target_model: "llama-3.1-8b"
use_lora: true
lora_rank: 16
chinchilla_fraction: 1.2
Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.