From curry-train
Lightning Fabric integration recipe — minimal 5-line setup that gives DDP / FSDP / mixed precision / mixed-precision while keeping a raw PyTorch training loop. Activate when the user asks "Lightning Fabric", "torchrun", "DDP setup", "FSDP setup", "mixed precision", or wires up the launch script.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
curryTrain uses **Lightning Fabric**, not the Lightning Trainer. Fabric is a minimal `Fabric` class (~5 lines of integration) that gives DDP / FSDP / mixed precision / device placement, while leaving the user in full control of the training loop.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
curryTrain uses Lightning Fabric, not the Lightning Trainer. Fabric is a minimal Fabric class (~5 lines of integration) that gives DDP / FSDP / mixed precision / device placement, while leaving the user in full control of the training loop.
Subagent B's research is unambiguous: Trainer locks you into its loop abstraction. curryTrain wants methodology recipes (pre-validate, sanity, runs-diff) to drive the loop. The Trainer fights that. Fabric supports the recipes naturally.
import lightning as L
from omegaconf import DictConfig
@hydra.main(version_base=None, config_path="configs", config_name="config")
def main(cfg: DictConfig):
# 1. Fabric setup
fabric = L.Fabric(
accelerator="gpu",
devices=cfg.parallelism.devices,
strategy=cfg.parallelism.strategy, # "ddp", "fsdp", "deepspeed_stage_3"
precision=cfg.training.precision, # "bf16-mixed", "16-mixed", "32"
)
fabric.launch()
# 2. Build model + optimizer (still raw PyTorch)
model = build_model(cfg.model)
optimizer = torch.optim.AdamW(model.parameters(), lr=cfg.training.lr)
scheduler = warmup_cosine_schedule(optimizer, ...)
# 3. One Fabric call wraps both
model, optimizer = fabric.setup(model, optimizer)
train_loader = fabric.setup_dataloaders(build_loader(cfg.data))
# 4. Raw training loop
with Run(cfg) as run:
for batch in train_loader:
optimizer.zero_grad()
loss = loss_fn(model(batch.x), batch.y)
fabric.backward(loss) # replaces loss.backward()
optimizer.step()
scheduler.step()
run.log_metric(step=step, loss=loss.item(), ...)
The full diff vs. plain PyTorch:
fabric.launch() instead of manual torch.distributed.init_process_group.fabric.setup(model, optimizer) instead of manual DDP wrapping.fabric.setup_dataloaders(loader) instead of manual DistributedSampler.fabric.backward(loss) instead of loss.backward().fabric.print(...), fabric.save(...), fabric.load(...) for rank-aware utilities.cfg.parallelism.strategy | What it gives |
|---|---|
"auto" | Sensible default for the visible hardware. |
"ddp" | DistributedDataParallel; smallest overhead, no sharding. |
"ddp_find_unused_parameters_true" | DDP allowing unused params (avoid in production). |
"fsdp" | FSDP with default sharding; can be configured via dataclass. |
"deepspeed_stage_3" | DeepSpeed ZeRO-3 (heavier; only when FSDP is insufficient). |
For full control of FSDP options:
from lightning.fabric.strategies import FSDPStrategy
from torch.distributed.fsdp import MixedPrecision
strategy = FSDPStrategy(
auto_wrap_policy=...,
mixed_precision=MixedPrecision(...),
activation_checkpointing_policy=...,
backward_prefetch="BACKWARD_PRE",
)
fabric = L.Fabric(strategy=strategy, ...)
curryTrain does not ship a custom launcher. Use torchrun:
torchrun --nproc_per_node=8 train.py experiment.name=...
Multi-node:
torchrun --nnodes=2 --nproc_per_node=8 \
--rdzv-id=run42 --rdzv-backend=c10d --rdzv-endpoint=$MASTER:29500 \
train.py experiment.name=...
Confirm Lightning is installed (pip install lightning). The Fabric class is at lightning.Fabric.
Convert their existing single-process PyTorch training script to use Fabric using the 4-line diff above. Do not refactor toward LightningModule.
Confirm DDP works first (strategy="ddp"), then escalate to FSDP only if memory needs it.
For mixed precision, default to "bf16-mixed" on Ampere/Hopper. fp16 is brittle; use only for older GPUs.
After conversion, invoke the bench skill (e.g. ask Claude to "smoke-test the runtime for 5 steps") to confirm everything works end-to-end on the chosen strategy.
primitive-experts).loss.backward() instead of fabric.backward(loss) → silent bug under mixed precision and certain strategies.precision ("32" when intending "32-true") → bf16-mixed is the modern default; "32" is full fp32.fabric.setup_dataloaders(...) → no DistributedSampler, ranks see overlapping data.skills/infra-hydra-config — the parallelism config group.skills/primitive-distributed-optimizer — how Fabric's FSDP relates to the primitive.skills/stage4-parallel-primitive-intro — when to escalate strategy.