Skill

lvsa-quickstart

Installs LVSA and generates long videos with block-sparse attention. Use when setting up LVSA from scratch, choosing SDPA vs FlashInfer backend, configuring reference latent frames per model, or verifying sparse path engagement.

Python

PyTorch

ai-ml

npx claudepluginhub jiusiserve/longvideosparseattention --plugin lvsa-quickstart

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/lvsa-quickstart:lvsa-quickstart

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

LVSA (Long Video Sparse Attention) is a training-free block-sparse attention engine for video diffusion transformers (Wan 2.x, HunyuanVideo 1.5, CogVideoX). It accelerates long-video generation 1.4–3.8× and enables generation beyond the training horizon where dense attention OOMs on 80 GB GPUs. Code: <https://github.com/JiusiServe/LongVideoSparseAttention>.

SKILL.md

140 lines · ~1.5k tokens

Similar Skills

lvsa-troubleshooting

Diagnoses LVSA failure modes: no speedup vs Dense, silent fallback, OOM at long sequences, missing mp4 in Docker, quality regression, and env var issues.

lvsa-troubleshooting

lvsa-vllm-omni

Configures and runs the LVSA vllm-omni serving plugin for video diffusion models. Set per-model env vars (LVSA_WAN_HOOK, LVSA_REFERENCE_LATENT_FRAMES), debug silent Wan fallbacks, or override geometry for non-default resolutions.

lvsa-vllm-omni

lvsa-add-model

Adds LVSA support for a new video diffusion model by implementing the ModelAdapter ABC. Covers geometry, QKV extraction, RoPE, and output projection for single-stream, dual-stream, or joint-attention DiTs.

lvsa-add-model

Stats

LanguagePython

Parent stars12

Parent forks3

MaintenanceGood

Last CommitJun 1, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

LVSA Quickstart

Overview

LVSA (Long Video Sparse Attention) is a training-free block-sparse attention engine for video diffusion transformers (Wan 2.x, HunyuanVideo 1.5, CogVideoX). It accelerates long-video generation 1.4–3.8× and enables generation beyond the training horizon where dense attention OOMs on 80 GB GPUs. Code: https://github.com/JiusiServe/LongVideoSparseAttention.

This skill covers: installation, choosing a backend, setting the per-model reference, running a first generation, and verifying that LVSA is actually engaged (not silently falling back to dense).

Install

git clone https://github.com/JiusiServe/LongVideoSparseAttention
cd LVSA

# Use uv (or any venv tool)
uv venv --python 3.12
source .venv/bin/activate

# Core library
uv pip install -e ".[diffusers,hunyuan,dev]"

# FlashInfer backend (optional but recommended on CUDA — fastest at long sequences)
uv pip install -e ".[flashinfer]"
# Requires nvcc + CUDA toolkit for JIT compilation.

# vllm-omni plugin (optional — only for serving)
uv pip install -e lvsa-vllm-omni/
uv pip install "vllm==0.18.0" "vllm-omni==0.18.0"   # validated pair

# VQeval (optional — for quality benchmarking)
uv pip install -e vqeval/

Verify:

pytest tests/ lvsa-vllm-omni/tests/ -v
# Expect 280+ tests passing (CPU-only, no GPU needed)

Pick a backend

Backend	When to use	Hardware
SDPA (default)	Always available; reasonable at training horizon	CUDA + Ascend NPU
FlashInfer	Fastest at extension (T_lat ≥ 49)	CUDA only (needs nvcc)

The --flashinfer flag enables FlashInfer; otherwise SDPA. If FlashInfer install fails on your box (no nvcc / wrong CUDA), the dispatcher will refuse to fall back silently — it'll raise a clear error and you can re-run without --flashinfer.

Pick the right reference per model

This is the single most common LVSA configuration mistake. Always set explicitly.

Model	`LVSA_REFERENCE_LATENT_FRAMES`	Video frames at 1×
Wan 2.1 / 2.2 (1.3B, 14B)	`21`	81
HunyuanVideo 1.5	`33`	129
CogVideoX 5B	`13`	49

For the standalone example scripts the value is wired automatically by lvsa/adapters/<model>.py::reference_latent_frames(). Override only if your fork uses non-default geometry.

Run a first generation

Single GPU, training horizon (no extension)

python examples/wan_generate.py \
    --model /path/to/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "A dog running in the forest." \
    --num-frames 81 \
    --lvsa --auto-keyframes \
    --output-name dog_1x.mp4

At T_lat ≤ reference, the auto-scheduler returns kfi=1 → fully-dense attention via the LVSA path. You get the implementation-bypass speedup (~1.5–2×) but no pattern-driven sparsity.

Single GPU, 4× horizon (the headline case)

python examples/wan_generate.py \
    --model /path/to/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "A dog running in the forest." \
    --num-frames 321 \
    --lvsa --flashinfer --rotate-keyframes --auto-keyframes \
    --output-name dog_4x.mp4

Add --riflex --riflex-s 4.0 to stack RIFLEx RoPE rescaling on top.

Multi-GPU (Ulysses context parallel)

torchrun --nproc_per_node=2 examples/wan_generate.py \
    --model /path/to/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "A dog running in the forest." \
    --num-frames 481 \
    --lvsa --flashinfer --rotate-keyframes --auto-keyframes \
    --output-name dog_6x.mp4

Constraint: seq_len = T_lat × patches_per_frame must be divisible by nproc_per_node.

Verify LVSA engaged

After the run, look for these [LVSA] lines (the LVSA prints prefix with [LVSA] for standalone, [LVSA] for vllm-omni):

[LVSA] --rotate-keyframes: computed key_frame_interval=6 (latent frames)
[LVSA] total_lat_frames=81 local_seq=126360 rank0_frames~[0,80] window=3 n_first=1 kfi=6 global_count=14 attended_per_frame=21/81
[LVSA] installed on 30 blocks num_patches=1560 total_lat_frames=81 backend=FlashInfer CSR: MB=81 nnz=1701 density=25.9% block_size=1560 compact=81/81frames

Read it as:

attended_per_frame=N/T — N out of T frames are in each query's attention set. Sparsity = 1 − N/T. At training reference, N==T (dense).
backend=FlashInfer|SDPA — confirms the requested backend was selected.
installed on N blocks — number of attention layers LVSA wrapped.

If you see no [LVSA] lines, the --lvsa flag wasn't passed. If attended_per_frame=N/T shows N == T at extension lengths, your --num-frames is below the training reference (no real sparsity engaged) — or the geometry detection failed.

Adding NPU support

On Ascend NPU, the SDPA path works automatically through torch_npu. lvsa/device.py detects NPU and routes memory probes through torch.npu.*. FlashInfer is CUDA-only and won't run on NPU.

No NPU-specific custom kernels ship in v1.0 — only the SDPA path is exercised on NPU.

Common first-run gotchas

Symptom	Fix
"ModuleNotFoundError: lvsa"	`pip install -e .` from repo root
FlashInfer JIT fails ("Could not find nvcc")	Install CUDA toolkit (nvcc) or drop `--flashinfer`
Generated mp4 missing in Docker	Pass relative paths; bind-mount the repo; see `lvsa-troubleshooting` skill
`attended_per_frame=N/T` shows N=T at 4×	`LVSA_REFERENCE_LATENT_FRAMES` is too high for the model — verify against the table above

See lvsa-troubleshooting for the full failure-mode catalog.

lvsa-quickstart

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

lvsa-quickstart

Popularity

Invocation

Context Preview

SKILL.md

LVSA Quickstart

Overview

Install

Pick a backend

Pick the right reference per model

Run a first generation

Single GPU, training horizon (no extension)

Single GPU, 4× horizon (the headline case)

Multi-GPU (Ulysses context parallel)

Verify LVSA engaged

Adding NPU support

Common first-run gotchas

Similar Skills

Help us improve

LVSA Quickstart

Overview

Install

Pick a backend

Pick the right reference per model

Run a first generation

Single GPU, training horizon (no extension)

Single GPU, 4× horizon (the headline case)

Multi-GPU (Ulysses context parallel)

Verify LVSA engaged

Adding NPU support

Common first-run gotchas