primitive-recompute | curry-train | ClaudePluginHub

Skill

primitive-recompute

From curry-train

Activation checkpointing — recompute forward activations during backward instead of storing them, trading compute for memory. Activate when the user asks "activation checkpointing", "recompute", "OOM during backward", "gradient checkpointing", or needs to fit a larger model into memory.

$

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Trade ~1.3× forward compute for ~50–80% activation memory savings. The first thing to try when a model OOMs.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Primitive · Recompute

Trade ~1.3× forward compute for ~50–80% activation memory savings. The first thing to try when a model OOMs.

What it does

Wraps a module so its forward activations are not stored at forward time; on backward, the forward is re-executed to recompute them. Effectively segments the activation memory into checkpoints.

Interface (V1 stub)

from curry_train.primitives import Recompute

# Wrap a transformer block
block = Recompute(TransformerBlock(...), enabled=cfg.recompute)

# Or as a flag on the model itself
class MyModel(nn.Module):
    def __init__(self, ..., recompute: bool = False):
        ...
        self.recompute = recompute

    def forward(self, x):
        for layer in self.layers:
            if self.recompute and self.training:
                x = torch.utils.checkpoint.checkpoint(layer, x, use_reentrant=False)
            else:
                x = layer(x)
        return x

Recommended granularity

Per transformer block (most common): wrap each Block independently. Best memory/compute trade.
Selective: wrap only the blocks closer to the input, where activations are largest.
Whole-model: wrap the entire model; rare; useful only when memory is desperate.

When to use

OOM during backward at a model size you want to keep.
Trying to fit a larger model on the same hardware before reaching for parallelism (stage4-parallel-primitive-intro).

When NOT to use

Model fits comfortably; recompute just costs you 30% throughput for no benefit.
Inference: recompute is a backward-pass concept; in eval() mode it's a no-op (PyTorch checkpoint is forward-only).

Interaction

Disable for stage2-grad-flow-viz runs; activation hooks fire during recomputation, double-counting.
DDP: works transparently.
FSDP: works transparently with use_reentrant=False.
TP: per-block wrapping is fine; whole-model wrapping interacts badly with TP collectives.

Boundaries

Cost is real (~30% slowdown). Don't enable unconditionally.
use_reentrant=True (PyTorch's default before 2.0) has subtle bugs with FSDP and BF16. Always use use_reentrant=False unless you know you need otherwise.
Recompute increases the wallclock variance (more work in backward). Throughput estimates need to account for this.

Implementation status

V1: stub. Reference at template/curry_train/primitives/recompute.py exposes the interface, raises NotImplementedError until populated. PyTorch's torch.utils.checkpoint.checkpoint is the recommended implementation.

Related

skills/stage4-parallel-primitive-intro — the order in which to add primitives; recompute comes first.
skills/stage2-grad-flow-viz — disable recompute when probing.
PyTorch docs: torch.utils.checkpoint.checkpoint.