From skymcp
Use when selecting a training framework, comparing NeMo vs Axolotl vs torchtune vs TRL vs DeepSpeed vs Megatron, choosing between FSDP and DeepSpeed, deciding how to fine-tune or pretrain a model, or configuring LoRA/QLoRA/full fine-tuning - the definitive framework selection guide for ML training at any scale
How this skill is triggered — by the user, by Claude, or both
Slash command
/skymcp:ml-training-frameworksThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
| Use Case | Framework | Why |
| Use Case | Framework | Why |
|---|---|---|
| Pretraining 100B+ params | NeMo 2.0 / Megatron-LM | Highest MFU, 5 parallelism axes (TP/PP/DP/EP/CP), FP8/FP4 |
| Pretraining 1B-70B params | torchtune / Axolotl + FSDP2 | Portable, good torch.compile support, no vendor lock |
| Fine-tuning (SFT) | Axolotl | YAML-driven, widest model support, LoRA/QLoRA/full |
| Post-training (DPO/GRPO/PPO) | TRL v0.28+ | Standard HF ecosystem, SFTTrainer/DPOTrainer/GRPOTrainer |
| Memory-constrained (huge model, small GPU) | DeepSpeed ZeRO-3 + CPU offload | Unbeatable memory savings for giant models on limited VRAM |
| Inference serving | vLLM | PagedAttention, 2-24x throughput vs naive, continuous batching |
| Research / custom architectures | torchtune | PyTorch-native, deep compile integration, clean recipe system |
| Multi-modal fine-tuning | Axolotl | Vision + language support, multipack, sample packing |
Use Axolotl:
# axolotl.yaml
base_model: meta-llama/Meta-Llama-3-8B
model_type: LlamaForCausalLM
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_target_linear: true
dataset_prepared_path: last_run_prepared
datasets:
- path: dataset.jsonl
type: alpaca
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 2e-4
bf16: auto
flash_attention: true
accelerate launch -m axolotl.cli.train axolotl.yaml
Use torchtune with FSDP2:
tune run --nproc_per_node 8 full_finetune_distributed \
--config recipes/llama3/7B_full.yaml \
model.compile=True \
training.enable_activation_checkpointing=True
Use TRL:
from trl import DPOTrainer, DPOConfig
config = DPOConfig(
model_name_or_path="meta-llama/Llama-3-8B-SFT",
learning_rate=5e-7,
beta=0.1,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
bf16=True,
)
trainer = DPOTrainer(config=config, train_dataset=dataset)
trainer.train()
Use NeMo 2.0:
from nemo.collections.llm import GPTModel, MegatronStrategy
from nemo.lightning import NeMoTrainer
strategy = MegatronStrategy(
tensor_model_parallel_size=8,
pipeline_model_parallel_size=4,
context_parallel_size=2,
expert_model_parallel_size=1,
sequence_parallel=True,
)
trainer = NeMoTrainer(strategy=strategy, max_steps=100000)
| Nodes | GPUs Total | Strategy |
|---|---|---|
| 1 | 1-2 | FSDP2 or DeepSpeed ZeRO-2 |
| 1 | 4-8 | FSDP2 with activation checkpointing |
| 2-8 | 16-64 | FSDP2 + gradient checkpointing |
| 8-32 | 64-256 | Megatron TP + FSDP DP |
| 32+ | 256+ | Megatron TP + PP + DP + CP |
Rule-of-thumb for model memory (bf16 training):
| Component | Per-Parameter Cost |
|---|---|
| Parameters | 2 bytes (bf16) |
| Gradients | 2 bytes (bf16) |
| Optimizer (AdamW) | 8 bytes (fp32 moments) |
| Activations | ~2 bytes (varies with seq length) |
| Total | ~14 bytes per parameter |
A 7B model requires ~98 GB for full fine-tuning. LoRA reduces this to ~16-24 GB (only adapter parameters need optimizer states).
| Method | Memory | Speed | Quality | When to Use |
|---|---|---|---|---|
| Full fine-tune | Highest (14B/param) | Slowest | Best | Unlimited compute budget |
| LoRA (r=32) | ~30% of full | 2-3x faster | 95-99% of full | Default for most tasks |
| QLoRA (4-bit + LoRA) | ~15% of full | 2x faster | 93-97% of full | Limited VRAM |
| QAT (quantize-aware) | ~40% of full | Slower | Best at inference | Deploying quantized |
| Feature | NeMo 2.0 | Axolotl | torchtune | TRL | DeepSpeed |
|---|---|---|---|---|---|
| LoRA | Yes | Yes | Yes | Yes | Yes |
| QLoRA | No | Yes | Yes | Yes | Yes |
| FSDP2 | No (Megatron) | Yes | Yes | Yes | No (own sharding) |
| torch.compile | Partial | Partial | Full | Partial | No |
| Flash Attention | Yes | Yes | Yes | Yes | Yes |
| Multi-node | Yes | Yes | Yes | Yes | Yes |
| Multimodal | Yes | Yes | Partial | Partial | Yes |
| RLHF/DPO | Yes | GRPO | DPO | Full suite | With TRL |
See references/framework-details.md for configuration deep dives. See references/axolotl-config.md for complete Axolotl YAML reference. See references/deepspeed-config.md for ZeRO configuration reference.
npx claudepluginhub slapglif/skymcp --plugin skymcpCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.