Skill

funsloth-check

Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.

Install

npx claudepluginhub joshuarweaver/cascade-ai-ml-engineering --plugin chrisvoncsefalvay-funsloth

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Validate datasets before fine-tuning with Unsloth.

Supporting Assets

DATA_FORMATS.mdscripts/validate_dataset.py

SKILL.md

Similar Skills

design-system

Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.

team-skills-platform

167.4k

ui-demo

Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.

team-skills-platform

167.4k

kotlin-patterns

Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.

team-skills-platform

167.4k

Stats

Stars5

Forks0

Last CommitDec 17, 2025

Actions

View Source View Plugin View on GitHub View README

Dataset Validation for Unsloth Fine-tuning

Validate datasets before fine-tuning with Unsloth.

Quick Start

For automated validation, use the script:

python scripts/validate_dataset.py --dataset "dataset-id" --model llama-3.1-8b --lora-rank 16

Workflow

1. Get Dataset Source

Ask for: HF dataset ID (e.g., mlabonne/FineTome-100k) or local path (e.g., ./data.jsonl)

2. Load and Detect Format

Auto-detect format from structure. See DATA_FORMATS.md for details.

Format	Detection	Key Fields
Raw	`text` only	`text`
Alpaca	`instruction` + `output`	`instruction`, `output`
ShareGPT	`conversations` array	`from`, `value`
ChatML	`messages` array	`role`, `content`

3. Validate Schema

Check required fields exist. Report issues with fix suggestions.

4. Show Samples

Display 2-3 examples for visual verification.

5. Token Analysis

Report statistics: total tokens, min/max/mean/median sequence length.

Flag concerns:

Sequences > 4096 tokens
Sequences < 10 tokens

6. Chinchilla Analysis

Ask for target model and LoRA rank, then calculate:

Chinchilla Fraction	Interpretation
< 0.5x	Dataset may be too small
0.5x - 2.0x	Good range
> 2.0x	Large dataset, may take longer

7. Recommendations

Based on analysis, suggest:

standardize_sharegpt() for ShareGPT data
Sequence length adjustments
Learning rate for small datasets

8. Optional: HF Upload

Offer to upload local datasets to Hub.

9. Handoff

Pass context to funsloth-train:

dataset_id: "mlabonne/FineTome-100k"
format_type: "sharegpt"
total_tokens: 15000000
target_model: "llama-3.1-8b"
use_lora: true
lora_rank: 16
chinchilla_fraction: 1.2

Bundled Resources

scripts/validate_dataset.py - Automated validation script
DATA_FORMATS.md - Dataset format reference