Skill

funsloth-local

Training manager for local GPU training - validate CUDA, manage GPU selection, monitor progress, handle checkpoints

Install

npx claudepluginhub joshuarweaver/cascade-ai-ml-engineering --plugin chrisvoncsefalvay-funsloth

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Run Unsloth training on your local GPU.

Supporting Assets

notebooks/sft_template.ipynbreferences/HARDWARE_GUIDE.mdreferences/TROUBLESHOOTING.mdscripts/train_sft.py

SKILL.md

Similar Skills

design-system

Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.

team-skills-platform

167.4k

ui-demo

Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.

team-skills-platform

167.4k

kotlin-patterns

Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.

team-skills-platform

167.4k

Stats

Stars5

Forks0

Last CommitDec 18, 2025

Actions

View Source View Plugin View on GitHub View README

Local GPU Training Manager

Run Unsloth training on your local GPU.

Prerequisites Check

1. Verify CUDA

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

If CUDA not available:

Check NVIDIA drivers: nvidia-smi
Check CUDA: nvcc --version
Reinstall PyTorch: pip install torch --index-url https://download.pytorch.org/whl/cu121

2. Check VRAM

See references/HARDWARE_GUIDE.md for requirements:

VRAM	Recommended Setup
8GB	7B, 4-bit, batch=1, LoRA r=8
12GB	7B, 4-bit, batch=2, LoRA r=16
16GB	7-13B, 4-bit, batch=2, LoRA r=16-32
24GB	7-14B, 4-bit, batch=4, LoRA r=32

3. Check Dependencies

pip install unsloth torch transformers trl peft datasets accelerate bitsandbytes

Docker Option

Use the official Unsloth Docker image for a pre-configured environment (supports all GPUs including Blackwell/50-series):

docker run -d \
  -e JUPYTER_PASSWORD="unsloth" \
  -p 8888:8888 \
  -v $(pwd)/work:/workspace/work \
  --gpus all \
  unsloth/unsloth

Access Jupyter at http://localhost:8888. Example notebooks are in /workspace/unsloth-notebooks/.

Environment variables:

JUPYTER_PASSWORD - Jupyter auth (default: unsloth)
JUPYTER_PORT - Port (default: 8888)
USER_PASSWORD - User/sudo password (default: unsloth)

Run Training

Option 1: Notebook

jupyter notebook notebooks/sft_template.ipynb

Option 2: Script

# Edit configuration in script, then run
python scripts/train_sft.py

GPU Selection (Multi-GPU)

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use first GPU

Monitor Training

Terminal

# Watch GPU usage
watch -n 1 nvidia-smi

# Or use nvitop (more detailed)
pip install nvitop && nvitop

WandB (Optional)

export WANDB_API_KEY="your-key"
# Add report_to="wandb" in TrainingArguments

Troubleshooting

OOM Error

Try in order:

Reduce batch_size (to 1)
Increase gradient_accumulation
Reduce max_seq_length
Reduce LoRA rank
torch.cuda.empty_cache()

Loss Not Decreasing

Check learning rate (try higher or lower)
Verify chat template matches model
Inspect data format

Training Too Slow

Enable bf16 if supported
Use packing=True for short sequences
Reduce logging_steps

See references/TROUBLESHOOTING.md for more solutions.

Resume from Checkpoint

TrainingArguments(
    resume_from_checkpoint=True,  # Auto-find latest
    # Or: resume_from_checkpoint="outputs/checkpoint-500"
)

Save Model

Training script automatically saves:

outputs/lora_adapter/ - LoRA weights
outputs/merged_16bit/ - Merged model (optional)

Test Inference

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained("outputs/lora_adapter")
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Handoff

Offer funsloth-upload for Hub upload with model card.

Tips

Close other GPU apps before training
Monitor temps - keep under 85C
Use UPS for long runs
Save frequently with save_steps

Bundled Resources

notebooks/sft_template.ipynb - Notebook template
scripts/train_sft.py - Script template
references/HARDWARE_GUIDE.md - VRAM requirements
references/TROUBLESHOOTING.md - Common issues