Provides PyTorch 2.6–2.11 updates: torch.load weights_only=True default, FSDP2 fully_shard, torch.compile mega cache/hierarchical/control flow, varlen_attn, FlexAttention FA4, TorchScript deprecated. Load before PyTorch code.
npx claudepluginhub nevaberry/nevaberry-plugins --plugin pytorch-knowledge-patchThis skill uses the workspace's default tool permissions.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Reviews prose for communication issues impeding comprehension, outputs minimal fixes in a three-column table per Microsoft Writing Style Guide. Useful for 'review prose' or 'improve prose' requests.
Claude's baseline knowledge covers PyTorch through 2.5. This skill provides changes from PyTorch 2.6 through 2.11 (2025-01 to 2026-03).
| Feature | API | Since |
|---|---|---|
| Safe loading default | torch.load() now weights_only=True | 2.6 |
| Compile stance control | torch.compiler.set_stance("eager_on_recompile") | 2.6 |
| Custom Triton ops | @torch.library.triton_op("lib::name", mutates_args=()) | 2.6 |
| Auto dynamic shapes | Dim.AUTO in torch.export | 2.6 |
| Mega cache (portable) | torch.compiler.save_cache_artifacts() / load_cache_artifacts() | 2.7 |
| Context parallelism | context_parallel(mesh) context manager for SDPA | 2.7 |
| Foreach map | torch._foreach_map(fn, tensors, ...) | 2.7 |
| Control flow ops | cond, while_loop, scan, associative_scan | 2.8 |
| Hierarchical compile | torch.compiler.nested_compile_region() | 2.8 |
| DCP SafeTensors | dcp.FileSystemWriter(path, format="safetensors") | 2.8 |
| FSDP1 deprecated | Use fully_shard() (FSDP2) instead | 2.8 |
| Symmetric memory | torch.ops.symm_mem for in-kernel collectives | 2.9 |
| Graph break errors | torch._dynamo.error_on_graph_break() | 2.9 |
| Variable-length attn | varlen_attn(q, k, v, cu_seqlens_q, ...) | 2.10 |
| TorchScript deprecated | Use torch.export instead | 2.10 |
| Deterministic compile | torch.use_deterministic_algorithms(True) applies to compile | 2.10 |
| DebugMode | torch.debugging.DebugMode() for numerical debugging | 2.10 |
| Differentiable collectives | Functional collectives support backprop | 2.11 |
| FlexAttention + FA4 | Auto FA4 kernels on Hopper/Blackwell | 2.11 |
| CUDA 13 default | CUDA 12.8 via download.pytorch.org/whl/cu128 | 2.11 |
torch.load("file.pt") now uses weights_only=True by default. Loading full nn.Module objects will fail.
# Old code that breaks:
model = torch.load("model.pt") # fails if saved with torch.save(model)
# Fix: load state_dict (recommended)
model.load_state_dict(torch.load("model.pt", weights_only=True))
# Fix: explicitly opt into unsafe loading
model = torch.load("model.pt", weights_only=False)
For tensor subclasses/numpy arrays, use torch.serialization.safe_globals to allowlist classes.
FSDP1 (FullyShardedDataParallel wrapper) is deprecated since 2.8. Use FSDP2:
from torch.distributed.fsdp import fully_shard
model = Transformer()
for layer in model.layers:
fully_shard(layer) # Shard each layer
fully_shard(model) # Shard root
# Parameters become DTensors, sharded on dim-0
# Optimizer constructed AFTER fully_shard
optim = torch.optim.Adam(model.parameters(), lr=1e-2)
See references/distributed-training.md for context parallelism, symmetric memory, differentiable collectives, and SafeTensors DCP support.
artifacts = torch.compiler.save_cache_artifacts()
# Save to disk, transfer to another machine...
torch.compiler.load_cache_artifacts(artifacts)
@torch.compile
def model_forward(x):
for layer in layers:
with torch.compiler.nested_compile_region():
x = layer(x) # Compiled once, reused for all layers
return x
Five operators: cond, while_loop, scan, associative_scan, map.
from torch._higher_order_ops.cond import cond
from torch._higher_order_ops.scan import scan
result = cond(pred_tensor, true_fn, false_fn, operands)
carry, outputs = scan(combine_fn, init_carry, xs)
with torch._dynamo.error_on_graph_break():
# Errors on graph breaks here (unlike fullgraph which is all-or-nothing)
compiled_fn(x)
See references/torch-compile.md for set_stance and deterministic mode.
from torch.export import export, Dim
ep = export(model, (x,), dynamic_shapes={"x": {0: Dim.AUTO}})
# Automatically infers min/max ranges, relations between dims, static/dynamic behavior
@torch.library.triton_op("mylib::add", mutates_args=())
def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
output = torch.empty_like(x)
# launch triton kernel...
return output
See references/export-and-ops.md for foreach_map and TorchScript deprecation details.
from torch.nn.attention.varlen import varlen_attn
# q, k, v are packed (total_tokens, num_heads, head_dim)
# cu_seqlens marks sequence boundaries: [0, seq1_len, seq1_len+seq2_len, ...]
output = varlen_attn(q, k, v, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k)
# Supports forward + backward, torch.compile-able. Requires A100+, BF16/FP16.
FlexAttention on Hopper/Blackwell GPUs automatically uses FA4 kernels: 1.2x–3.2x speedup over Triton backend on compute-bound workloads. No code changes needed — automatic via flex_attention().
See references/attention.md for details.
from torch.debugging import DebugMode
with DebugMode():
output = model(x)
# Logs all dispatched ops with tensor hashes
# Compare hashes between two runs to find divergence point
torch.export instead of torch.jit.script/torch.jit.trace. Use ExecuTorch for embedded runtime.fully_shard() (FSDP2) instead of FullyShardedDataParallel.download.pytorch.org/whl/cu128.torch.use_deterministic_algorithms(True) now applies to torch.compile.See references/environment.md for details on all compatibility changes.
| File | Contents |
|---|---|
torch-compile.md | set_stance, mega cache, hierarchical compilation, control flow ops, error_on_graph_break, deterministic mode |
distributed-training.md | FSDP2 fully_shard, context parallelism, symmetric memory, differentiable collectives, SafeTensors DCP |
export-and-ops.md | Dim.AUTO, triton_op, TorchScript deprecation, foreach_map |
attention.md | varlen_attn for packed sequences, FlexAttention + FA4 backend |
environment.md | weights_only=True breaking change, CUDA 13 default, Python 3.14, DebugMode |