Skill

pytorch-knowledge-patch

Provides PyTorch 2.6–2.11 updates: torch.load weights_only=True default, FSDP2 fully_shard, torch.compile mega cache/hierarchical/control flow, varlen_attn, FlexAttention FA4, TorchScript deprecated. Load before PyTorch code.

Python

ai-ml

Install

npx claudepluginhub nevaberry/nevaberry-plugins --plugin pytorch-knowledge-patch

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

references/attention.mdreferences/distributed-training.mdreferences/environment.mdreferences/export-and-ops.mdreferences/torch-compile.md

SKILL.md

Similar Skills

cache-components

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

138.8k

claude-opus-4-5-migration

2 files

Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.

claude-opus-4-5-migration

83.2k

bmad-editorial-review-prose

Reviews prose for communication issues impeding comprehension, outputs minimal fixes in a three-column table per Microsoft Writing Style Guide. Useful for 'review prose' or 'improve prose' requests.

bmad-pro-skills

43.8k

Stats

Parent Repo Stars14

Parent Repo Forks0

Last CommitApr 6, 2026

Actions

View Source View Plugin View on GitHub View README

PyTorch Knowledge Patch

Claude's baseline knowledge covers PyTorch through 2.5. This skill provides changes from PyTorch 2.6 through 2.11 (2025-01 to 2026-03).

Quick Reference — Key API Changes

Feature	API	Since
Safe loading default	`torch.load()` now `weights_only=True`	2.6
Compile stance control	`torch.compiler.set_stance("eager_on_recompile")`	2.6
Custom Triton ops	`@torch.library.triton_op("lib::name", mutates_args=())`	2.6
Auto dynamic shapes	`Dim.AUTO` in `torch.export`	2.6
Mega cache (portable)	`torch.compiler.save_cache_artifacts()` / `load_cache_artifacts()`	2.7
Context parallelism	`context_parallel(mesh)` context manager for SDPA	2.7
Foreach map	`torch._foreach_map(fn, tensors, ...)`	2.7
Control flow ops	`cond`, `while_loop`, `scan`, `associative_scan`	2.8
Hierarchical compile	`torch.compiler.nested_compile_region()`	2.8
DCP SafeTensors	`dcp.FileSystemWriter(path, format="safetensors")`	2.8
FSDP1 deprecated	Use `fully_shard()` (FSDP2) instead	2.8
Symmetric memory	`torch.ops.symm_mem` for in-kernel collectives	2.9
Graph break errors	`torch._dynamo.error_on_graph_break()`	2.9
Variable-length attn	`varlen_attn(q, k, v, cu_seqlens_q, ...)`	2.10
TorchScript deprecated	Use `torch.export` instead	2.10
Deterministic compile	`torch.use_deterministic_algorithms(True)` applies to compile	2.10
DebugMode	`torch.debugging.DebugMode()` for numerical debugging	2.10
Differentiable collectives	Functional collectives support backprop	2.11
FlexAttention + FA4	Auto FA4 kernels on Hopper/Blackwell	2.11
CUDA 13 default	CUDA 12.8 via `download.pytorch.org/whl/cu128`	2.11

BREAKING: torch.load defaults to weights_only=True (2.6)

torch.load("file.pt") now uses weights_only=True by default. Loading full nn.Module objects will fail.

# Old code that breaks:
model = torch.load("model.pt")  # fails if saved with torch.save(model)

# Fix: load state_dict (recommended)
model.load_state_dict(torch.load("model.pt", weights_only=True))

# Fix: explicitly opt into unsafe loading
model = torch.load("model.pt", weights_only=False)

For tensor subclasses/numpy arrays, use torch.serialization.safe_globals to allowlist classes.

FSDP2: fully_shard (replaces FSDP1)

FSDP1 (FullyShardedDataParallel wrapper) is deprecated since 2.8. Use FSDP2:

from torch.distributed.fsdp import fully_shard

model = Transformer()
for layer in model.layers:
    fully_shard(layer)  # Shard each layer
fully_shard(model)       # Shard root

# Parameters become DTensors, sharded on dim-0
# Optimizer constructed AFTER fully_shard
optim = torch.optim.Adam(model.parameters(), lr=1e-2)

See references/distributed-training.md for context parallelism, symmetric memory, differentiable collectives, and SafeTensors DCP support.

torch.compile Improvements

Mega Cache — Portable Compilation Artifacts (2.7)

artifacts = torch.compiler.save_cache_artifacts()
# Save to disk, transfer to another machine...
torch.compiler.load_cache_artifacts(artifacts)

Hierarchical Compilation — Compile Once, Reuse (2.8)

@torch.compile
def model_forward(x):
    for layer in layers:
        with torch.compiler.nested_compile_region():
            x = layer(x)  # Compiled once, reused for all layers
    return x

Control Flow Without Graph Breaks (2.8)

Five operators: cond, while_loop, scan, associative_scan, map.

from torch._higher_order_ops.cond import cond
from torch._higher_order_ops.scan import scan

result = cond(pred_tensor, true_fn, false_fn, operands)
carry, outputs = scan(combine_fn, init_carry, xs)

error_on_graph_break() — Targeted Graph Break Errors (2.9)

with torch._dynamo.error_on_graph_break():
    # Errors on graph breaks here (unlike fullgraph which is all-or-nothing)
    compiled_fn(x)

See references/torch-compile.md for set_stance and deterministic mode.

torch.export & Custom Ops

Dim.AUTO — Automatic Dynamic Shapes (2.6)

from torch.export import export, Dim
ep = export(model, (x,), dynamic_shapes={"x": {0: Dim.AUTO}})
# Automatically infers min/max ranges, relations between dims, static/dynamic behavior

torch.library.triton_op — Custom Triton Kernels (2.6)

@torch.library.triton_op("mylib::add", mutates_args=())
def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    output = torch.empty_like(x)
    # launch triton kernel...
    return output

See references/export-and-ops.md for foreach_map and TorchScript deprecation details.

Attention Ops

varlen_attn() — Variable-Length Sequences (2.10)

from torch.nn.attention.varlen import varlen_attn

# q, k, v are packed (total_tokens, num_heads, head_dim)
# cu_seqlens marks sequence boundaries: [0, seq1_len, seq1_len+seq2_len, ...]
output = varlen_attn(q, k, v, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k)
# Supports forward + backward, torch.compile-able. Requires A100+, BF16/FP16.

FlexAttention + FlashAttention-4 Backend (2.11)

FlexAttention on Hopper/Blackwell GPUs automatically uses FA4 kernels: 1.2x–3.2x speedup over Triton backend on compute-bound workloads. No code changes needed — automatic via flex_attention().

See references/attention.md for details.

Numerical Debugging — DebugMode (2.10)

from torch.debugging import DebugMode

with DebugMode():
    output = model(x)
# Logs all dispatched ops with tensor hashes
# Compare hashes between two runs to find divergence point

Deprecations & Compatibility

TorchScript (2.10): Use torch.export instead of torch.jit.script/torch.jit.trace. Use ExecuTorch for embedded runtime.
FSDP1 (2.8): Use fully_shard() (FSDP2) instead of FullyShardedDataParallel.

Environment

CUDA 13 is the default since 2.11. CUDA 12.8 builds available via download.pytorch.org/whl/cu128.
Python 3.14 supported since 2.10. Python 3.14t (free-threaded) experimentally supported.
Deterministic compile (2.10): torch.use_deterministic_algorithms(True) now applies to torch.compile.

See references/environment.md for details on all compatibility changes.

Reference Files

File	Contents
`torch-compile.md`	set_stance, mega cache, hierarchical compilation, control flow ops, error_on_graph_break, deterministic mode
`distributed-training.md`	FSDP2 fully_shard, context parallelism, symmetric memory, differentiable collectives, SafeTensors DCP
`export-and-ops.md`	Dim.AUTO, triton_op, TorchScript deprecation, foreach_map
`attention.md`	varlen_attn for packed sequences, FlexAttention + FA4 backend
`environment.md`	weights_only=True breaking change, CUDA 13 default, Python 3.14, DebugMode