Help us improve
Share bugs, ideas, or general feedback.
From lvsa-add-model
Adds LVSA support for a new video diffusion model by implementing the ModelAdapter ABC. Covers geometry, QKV extraction, RoPE, and output projection for single-stream, dual-stream, or joint-attention DiTs.
npx claudepluginhub jiusiserve/longvideosparseattention --plugin lvsa-add-modelHow this skill is triggered — by the user, by Claude, or both
Slash command
/lvsa-add-model:lvsa-add-modelThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
LVSA's engine is model-agnostic; per-model behavior lives in `lvsa/adapters/<model>.py` as a subclass of `ModelAdapter` (defined in `lvsa/adapters/base.py`). Adding a new model is **one adapter file** (~200-330 lines) plus a tiny example wrapper.
Installs LVSA and generates long videos with block-sparse attention. Use when setting up LVSA from scratch, choosing SDPA vs FlashInfer backend, configuring reference latent frames per model, or verifying sparse path engagement.
Configures and runs the LVSA vllm-omni serving plugin for video diffusion models. Set per-model env vars (LVSA_WAN_HOOK, LVSA_REFERENCE_LATENT_FRAMES), debug silent Wan fallbacks, or override geometry for non-default resolutions.
Guides adaptation of Wan-series video diffusion models from NVIDIA CUDA to Huawei Ascend NPU, covering device layer, operator replacement, distributed parallelism, attention optimization, VAE parallelization, and quantization.
Share bugs, ideas, or general feedback.
LVSA's engine is model-agnostic; per-model behavior lives in lvsa/adapters/<model>.py as a subclass of ModelAdapter (defined in lvsa/adapters/base.py). Adding a new model is one adapter file (~200-330 lines) plus a tiny example wrapper.
Reference adapters:
| Model | File | Style | LoC | Use as template if |
|---|---|---|---|---|
| Wan 2.1 / 2.2 | lvsa/adapters/wan.py | Single-stream | ~220 | Your model concatenates text+video before projection |
| HunyuanVideo 1.5 | lvsa/adapters/hunyuan_video.py | Dual-stream | ~330 | Text and video have separate Q/K/V projections |
| CogVideoX 5B | lvsa/adapters/cogvideox.py | Joint-attention | ~190 | Shared Q/K/V projection over joint text+video |
The ModelAdapter ABC groups methods into 4 families.
def patches_per_frame(self, height: int, width: int, pipe: Any) -> int:
"""Spatial tokens per latent frame after VAE + patchify."""
# For 480x832, VAE factor 8, patch size 2:
# (480 / 8 / 2) * (832 / 8 / 2) = 1560
...
def latent_frames(self, num_frames: int, pipe: Any) -> int:
"""(num_frames - 1) // vae_temporal + 1. Wan/HV use 4."""
...
def reference_latent_frames(self, pipe: Any) -> int:
"""Training horizon in latent frames. CRITICAL — anchors the auto scheduler.
Wan 1.3B/14B = 21, HunyuanVideo 1.5 = 33, CogVideoX 5B = 13."""
...
def extract_qkv(self, attn, hidden, encoder) -> tuple[Tensor, Tensor, Tensor]:
"""Return Q, K, V for the joint sequence. Single-stream: concat text+video
before projection. Dual-stream: each modality has its own linear."""
...
def extract_cross_attn_kv(self, attn, encoder) -> Optional[tuple[Tensor, Tensor]]:
"""Return (K, V) for an I2V-style image-conditioned cross-attention, or None."""
...
def split_encoder_for_cross_attn(self, attn, encoder) -> tuple[Tensor, Optional[Tensor]]:
"""For dual-modal encoders that split text vs image tokens."""
...
def apply_rotary(self, q, k, rotary_emb, local_seq, rank, world) -> tuple[Tensor, Tensor]:
"""Apply RoPE. Wan uses 1D RoPE; HunyuanVideo uses 3D (cos, sin) tuples."""
...
def output_projection(self, attn, hidden, query_dtype) -> Tensor:
"""Final linear (and any layernorm) after attention."""
...
def extract_encoder_qkv(self, attn, encoder) -> Optional[tuple[Tensor, Tensor, Tensor]]:
"""Dual-stream models project encoder Q/K/V through a separate linear.
Return None for single-stream."""
...
def format_output(self, attn, hidden, encoder_output, encoder_seq_len, dtype):
"""Single-stream: return a tensor. Dual-stream: return (hidden, encoder) tuple."""
...
def cross_attention(self, attn, query, encoder_image, backend) -> Optional[Tensor]:
"""I2V cross-attention call, or None."""
...
def install_processor(self, pipe, processor) -> int:
"""Walk the transformer, replace each self-attention's processor with
`DistributedLVSAProcessor`. Return number of blocks installed on."""
...
def setup_context_parallel(self, transformer, world) -> None:
"""Attach a CP plan (which dim to split, gather strategy)."""
...
def patch_rotary_for_cp(self, rank: int, world: int) -> None:
"""Monkey-patch the model's RoPE to be rank-aware for context-parallel runs."""
...
# Single-stream model
cp lvsa/adapters/wan.py lvsa/adapters/<your_model>.py
# Dual-stream model
cp lvsa/adapters/hunyuan_video.py lvsa/adapters/<your_model>.py
def patches_per_frame(self, height: int, width: int, pipe: Any) -> int:
vae_sf = pipe.vae.config.scaling_factor # spatial compression
patch = pipe.transformer.config.patch_size # patchify factor
h = height // vae_sf // patch
w = width // vae_sf // patch
return h * w
def latent_frames(self, num_frames: int, pipe: Any) -> int:
vae_tf = pipe.vae.config.temporal_compression_ratio
return (num_frames - 1) // vae_tf + 1
def reference_latent_frames(self, pipe: Any) -> int:
return 21 # CHANGE: read from model card / paper
Identify the model's attention class (usually <ModelName>Attention or <ModelName>SelfAttn). Adapt extract_qkv and apply_rotary to match its forward signature.
For Wan-style 1D RoPE:
def apply_rotary(self, q, k, rotary_emb, local_seq, rank, world):
from .rope import apply_rotary_1d
return apply_rotary_1d(q, k, rotary_emb, local_seq, rank, world)
For HunyuanVideo-style 3D (cos, sin) RoPE:
def apply_rotary(self, q, k, rotary_emb, local_seq, rank, world):
from .rope import apply_rotary_3d_cos_sin
return apply_rotary_3d_cos_sin(q, k, rotary_emb, local_seq, rank, world)
def install_processor(self, pipe, processor) -> int:
transformer = pipe.transformer
n = 0
for block in transformer.blocks: # adapt path
block.attn1.processor = processor # adapt attribute
n += 1
return n
In lvsa/adapters/__init__.py:
def get_adapter(name: str) -> ModelAdapter:
if name == "<your_model>":
from .<your_model> import YourModelAdapter
return YourModelAdapter()
...
cp examples/wan_generate.py examples/<your_model>_generate.py
Edit:
WanAdapter references with your adapter--num-frames to the model's training horizonIn tests/test_adapters.py, add:
class TestYourModelAdapter:
def test_geometry(self):
adapter = YourModelAdapter()
assert adapter.reference_latent_frames(MockPipe()) == 21
# ... etc
Run:
pytest tests/test_adapters.py::TestYourModelAdapter -v
If your model has special pre-attention behavior in vllm-omni (sequence sharding, custom RoPE), add a hook at lvsa-vllm-omni/lvsa_vllm_omni/<your_model>_hook.py modeled on wan_hook.py or hunyuan_hook.py. Register it in register.py and document the env var (LVSA_<YOUR_MODEL>_HOOK=1) in lvsa-vllm-omni/README.md.
After your adapter compiles and tests pass on CPU, run a small GPU smoke:
python examples/<your_model>_generate.py \
--model /path/to/your_model_checkpoint \
--prompt "test" \
--num-frames <training_horizon> \
--lvsa --auto-keyframes \
--output-name smoke.mp4
Look for [LVSA] installed on N blocks in the log — that confirms the adapter found your attention layers.
reference_latent_frames — leads to silent sparsity at training horizon. Verify against the model card.patches_per_frame doesn't divide seq_len — geometry detection fails. Triple-check VAE spatial factor and patch size against the model config.install_processor — the walk path (pipe.transformer.blocks[i].attn1) differs per model. Print the model structure first.seq_len % world_size == 0. If not divisible, add a constraint check at adapter init.