Help us improve
Share bugs, ideas, or general feedback.
From lvsa-vllm-omni
Configures and runs the LVSA vllm-omni serving plugin for video diffusion models. Set per-model env vars (LVSA_WAN_HOOK, LVSA_REFERENCE_LATENT_FRAMES), debug silent Wan fallbacks, or override geometry for non-default resolutions.
npx claudepluginhub jiusiserve/longvideosparseattention --plugin lvsa-vllm-omniHow this skill is triggered — by the user, by Claude, or both
Slash command
/lvsa-vllm-omni:lvsa-vllm-omniThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
`lvsa-vllm-omni` is a separate pip package that registers itself as a vllm-omni attention backend. Zero changes required in vllm-omni core — everything plugs in via the plugin entry-point system.
Installs LVSA and generates long videos with block-sparse attention. Use when setting up LVSA from scratch, choosing SDPA vs FlashInfer backend, configuring reference latent frames per model, or verifying sparse path engagement.
Adds LVSA support for a new video diffusion model by implementing the ModelAdapter ABC. Covers geometry, QKV extraction, RoPE, and output projection for single-stream, dual-stream, or joint-attention DiTs.
Diagnoses LVSA failure modes: no speedup vs Dense, silent fallback, OOM at long sequences, missing mp4 in Docker, quality regression, and env var issues.
Share bugs, ideas, or general feedback.
lvsa-vllm-omni is a separate pip package that registers itself as a vllm-omni attention backend. Zero changes required in vllm-omni core — everything plugs in via the plugin entry-point system.
Path: lvsa-vllm-omni/
pip install -e . # LVSA core
pip install -e lvsa-vllm-omni/ # The plugin (registers entry-point)
pip install "vllm==0.18.0" "vllm-omni==0.18.0" # validated pair (mismatched minor versions emit warnings)
After install, the plugin auto-loads when vllm-omni starts.
DIFFUSION_ATTENTION_BACKEND=LVSA \
LVSA_AUTO_KEYFRAMES=1 \
LVSA_REFERENCE_LATENT_FRAMES=33 \
LVSA_ROTATE_KEYFRAMES=1 \
vllm serve --omni --model HunyuanVideo-1.5-Diffusers-480p_t2v
HunyuanVideo's hook auto-activates when DIFFUSION_ATTENTION_BACKEND=LVSA.
DIFFUSION_ATTENTION_BACKEND=LVSA \
LVSA_WAN_HOOK=1 \
LVSA_AUTO_KEYFRAMES=1 \
LVSA_REFERENCE_LATENT_FRAMES=21 \
LVSA_ROTATE_KEYFRAMES=1 \
vllm serve --omni --model Wan2.2-T2V-14B
Wan requires LVSA_WAN_HOOK=1 explicitly. Without it, Wan's _sp_plan pre-shards the sequence and geometry detection fails silently.
A shell wrapper that sets the right env vars per model family:
examples/vllm_omni_serve.sh wan /path/to/Wan2.1-T2V-1.3B-Diffusers
examples/vllm_omni_serve.sh hunyuan /path/to/HunyuanVideo-1.5-Diffusers-480p_t2v
# Override port / dtype
PORT=8200 DTYPE=float16 examples/vllm_omni_serve.sh wan ...
| Var | Default | Purpose |
|---|---|---|
DIFFUSION_ATTENTION_BACKEND | (unset) | Set to LVSA to engage |
LVSA_REFERENCE_LATENT_FRAMES | 21 | Per-model training horizon. CRITICAL. Wan=21, HV=33, Cog=13. |
LVSA_AUTO_KEYFRAMES | 1 | Auto-derive keyframe interval from frame count |
LVSA_WAN_HOOK | 0 | Required for Wan. Off for HunyuanVideo. |
LVSA_ROTATE_KEYFRAMES | 0 | Shift keyframe grid each denoising step (recommended at extension) |
| Var | Default | Purpose |
|---|---|---|
LVSA_SPARSITY_SCALE | 1.0 | Multiplier on attention budget. See lvsa-tuning skill. |
LVSA_WINDOW_SIZE | 12 | Local window half-width (video frames) |
LVSA_N_FIRST_FRAMES | 4 | Leading global frames |
LVSA_KEY_FRAME_INTERVAL | 16 | Manual keyframe interval (ignored if AUTO_KEYFRAMES) |
LVSA_EXPAND_WINDOW | 1 | Extend local window when globals overlap |
| Var | Default | Purpose |
|---|---|---|
LVSA_PATCHES_PER_FRAME | 1560 (480×832) | Tokens per latent frame |
LVSA_VIDEO_HEIGHT | (unset) | Use with VAE_SPATIAL_FACTOR + PATCH_SIZE for derivation |
LVSA_VIDEO_WIDTH | (unset) | Same |
LVSA_VAE_SPATIAL_FACTOR | 8 | VAE spatial compression |
LVSA_PATCH_SIZE | 2 | Patchify factor |
LVSA_VAE_TEMPORAL_FACTOR | 4 | VAE temporal compression |
For 720p or larger, set LVSA_VIDEO_HEIGHT/WIDTH and the plugin will derive patches_per_frame automatically.
| Var | Default | Purpose |
|---|---|---|
LVSA_BACKEND | sdpa | sdpa or flashinfer |
LVSA_TOTAL_LATENT_FRAMES | (unset) | Override auto-detected T_lat (rarely needed) |
LVSA_MEM_LOG | 0 | Per-step memory log ([LVSA-MEM] step=N alloc=… reserved=… peak=…) |
After startup + first request, look for:
[LVSA] Step counter: n_blocks=N (from env)
[LVSA-hook] Installed LVSA hook on HunyuanVideo15Attention ...
[LVSA] Geometry detected: T_lat=33 P=1560 text_tokens=512
[LVSA] reference_latent_frames=33 target_latent_frames=33 extension_ratio=1.00x
If you see [LVSA-FALLBACK] origin=forward_cuda reason=geometry_detect ..., the plugin couldn't infer the geometry from the seq_len it received. Walk through:
seq_len = T_lat * P + enc_tokens for any P in candidate_patches_per_frame()?LVSA_PATCHES_PER_FRAME set if you're at a non-default resolution?LVSA_WAN_HOOK=1?Standard Ulysses context parallel. Set --ulysses-degree N in the vllm-omni CLI; the plugin handles the global K/V gather automatically.
DIFFUSION_ATTENTION_BACKEND=LVSA \
LVSA_REFERENCE_LATENT_FRAMES=33 \
LVSA_AUTO_KEYFRAMES=1 \
vllm serve --omni --model HunyuanVideo-1.5 --ulysses-degree 2
Constraint: seq_len = T_lat × patches_per_frame must be divisible by N.
lvsa-vllm-omni/
├── lvsa_vllm_omni/
│ ├── __init__.py # Entry-point: register_lvsa_backend()
│ ├── backend.py # LVSABackend (vllm attention backend)
│ ├── attention_impl.py # LVSAAttentionImpl → sparse_windowed_attention()
│ ├── wan_hook.py # Wan-specific monkey patch
│ ├── hunyuan_hook.py # HunyuanVideo-specific monkey patch
│ ├── register.py # Hook auto-install based on env vars
│ ├── config.py # LVSAConfig dataclass (env var parsing)
│ ├── global_kv.py # Global K/V gather helpers
│ ├── step_tracker.py # Per-step state
│ └── _fallback.py # Silent-fallback warnings
└── tests/ # CPU-only integration tests (~50 tests)
| Symptom | Fix |
|---|---|
No [LVSA] log lines | Check DIFFUSION_ATTENTION_BACKEND=LVSA; for Wan also LVSA_WAN_HOOK=1 |
[LVSA-FALLBACK] reason=geometry_detect | Set LVSA_PATCHES_PER_FRAME (or HEIGHT/WIDTH) for your resolution |
| Quality regression at 1× | LVSA_REFERENCE_LATENT_FRAMES wrong for the model |
| No speedup despite engagement | At T_lat ≤ ref, kfi=1 → fully dense. Lower LVSA_SPARSITY_SCALE to see real sparsity |
See lvsa-troubleshooting for the full failure-mode catalog and docs/VLLM_OMNI_INTEGRATION.md for the architecture details.