Skill

nsa-integrator

Integration plan for Native Sparse Attention in a long-context pre-training run. Use when you need help with nsa integrator.

npx claudepluginhub anubhavg-icpl/vibe --plugin vibe

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/vibe:nsa-integrator

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Given a long-context pre-training run specification (target context, base architecture, training tokens available, GPU topology, deployment target), produce an NSA integration plan.

SKILL.md

33 lines · ~710 tokens

Similar Skills

receiving-code-review

221.0k

Guides technical evaluation of code review feedback: read fully, restate for understanding, verify against codebase, respond with reasoning or pushback before implementing.

superpowers

Stats

LanguagePython

Stars1

Forks1

MaintenanceExcellent

Last CommitMay 24, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Given a long-context pre-training run specification (target context, base architecture, training tokens available, GPU topology, deployment target), produce an NSA integration plan.

Produce:

Compression block size l. Pick 32, 64, or 128. Justify against target context: l = 32 for 16k-32k, l = 64 for 64k-128k, l = 128 for 256k-plus. Larger l means fewer compressed keys but coarser routing signal.
Top-k selection count. Pick between 8 and 32. The paper's default is 16. Justify against the target task mix: reasoning-heavy tasks (math, code) benefit from higher k because selection precision matters more. Retrieval-heavy tasks work at lower k.
Sliding window W. Pick 256, 512, or 1024. Default 512. Shorter for heavily structured content (code) where local context is enough; longer for prose.
Gate MLP. Specify width and initialization. Default: linear layer from hidden to 3, with sigmoid or softplus activation. Warn if gate weights collapse to favor one branch — this indicates l, k, or W is mistuned.
Kernel choice. Confirm Triton or CUDA kernel availability for the target accelerator. Reject fallback to dense attention at inference (the whole point of NSA is to save decode compute). If only forward kernels exist and not backward, refuse pre-training and recommend continued training on existing dense checkpoints.

Hard rejects:

NSA on a model pre-trained with dense attention without continued pre-training. Cannot be bolted on at inference.
Target context under 16k. The three-branch overhead dominates.
Inference-only deployments on stacks without NSA kernel support. Recommend MLA or sliding-window attention instead.

Refusal rules:

If long-context evaluation data (RULER, LongBench, needle-in-haystack) is not available, refuse and request calibration data first.
If the training-data context distribution is dominated by short sequences, refuse and recommend data reweighting before integrating NSA.
If the accelerator is older than A100, refuse — NSA's kernel advantages assume H100/H200/MI300 memory hierarchies.

Output: a one-page integration plan listing l, k, W, gate config, kernel path, and expected compute savings at target context. End with a "success criterion" paragraph: the specific RULER or LongBench number (percentage points vs a matched dense-attention baseline) that justifies keeping NSA. Include a rollback trigger — the metric threshold below which the architecture should be reverted to MLA or dense GQA.

nsa-integrator

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

nsa-integrator

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve