Launches, monitors, stops, and manages LLM inference workloads on NVIDIA DGX Spark using sparkrun CLI. Covers status checks, logs, VRAM preview, benchmarks, kernel tuning, proxy management.
From sparkrunnpx claudepluginhub spark-arena/sparkrun --plugin sparkrunThis skill uses the workspace's default tool permissions.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
<Use_When>
<Do_Not_Use_When>
# Single host
sparkrun run <recipe> --tp 1 --no-follow
sparkrun run <recipe> --hosts <ip> --no-follow
# Multi-node cluster
sparkrun run <recipe> --cluster <name> --no-follow
sparkrun run <recipe> --hosts <ip1>,<ip2>,... --no-follow
sparkrun run <recipe> --tp <N> --no-follow
# With overrides
sparkrun run <recipe> --port 9000 --gpu-mem 0.8 --no-follow
sparkrun run <recipe> -o max_model_len=8192 -o attention_backend=triton --no-follow
sparkrun run <recipe> --served-model-name my-model --no-follow
sparkrun run <recipe> --pp 2 --tp 2 --no-follow # pipeline + tensor parallelism
sparkrun run <recipe> --max-model-len 32768 --no-follow
# Idempotent launch (exit 0 if already running)
sparkrun run <recipe> --ensure --no-follow
# With Docker options
sparkrun run <recipe> --restart unless-stopped --no-follow
sparkrun run <recipe> --no-rm --no-follow # keep containers after stop
sparkrun run <recipe> --transfer-mode push --no-follow # force push-based distribution
# Dry-run (show what would happen)
sparkrun run <recipe> --dry-run
CRITICAL: Always use --no-follow when running from an agent/skill context to avoid blocking on log streaming. Then use sparkrun cluster status or sparkrun logs separately to check on the job.
# Show all sparkrun containers across cluster hosts
# NOTE: `sparkrun status` is an alias for `sparkrun cluster status` — only run one.
sparkrun cluster status
sparkrun cluster status --cluster <name>
sparkrun cluster status --hosts <ip1>,<ip2>,...
# Output shows:
# - Grouped containers by job (with recipe name if cached)
# - Container role, host, status, and image
# - Ready-to-use logs and stop commands for each job
# - Pending operations (downloads/distributions in progress)
# - Idle hosts
# Check if a specific job is running (scripting-friendly)
sparkrun cluster check-job <recipe> --cluster <name>
sparkrun cluster check-job <cluster_id> --hosts <ip1>,<ip2>
sparkrun cluster check-job <recipe> --check-http-models # also verify /v1/models endpoint
sparkrun cluster check-job <recipe> --json # JSON output
# Exit code: 0 = running, 1 = not running
# Interactive TUI (requires interactive terminal)
sparkrun cluster monitor --cluster <name>
# Plain-text output (agent-friendly)
sparkrun cluster monitor --cluster <name> --simple
# JSON streaming (automation-friendly)
sparkrun cluster monitor --cluster <name> --json
# Custom interval
sparkrun cluster monitor --cluster <name> --interval 5
# By recipe name
sparkrun logs <recipe> --cluster <name>
sparkrun logs <recipe> --hosts <ip1>,<ip2>,...
sparkrun logs <recipe> --tp <N>
# By cluster ID (from sparkrun status output)
sparkrun logs <cluster_id>
sparkrun logs <cluster_id> --cluster <name>
# Control log tail length
sparkrun logs <recipe> --tail 200
# By recipe name
sparkrun stop <recipe> --cluster <name>
sparkrun stop <recipe> --hosts <ip1>,<ip2>,...
sparkrun stop <recipe> --tp <N>
# By cluster ID (from sparkrun status output)
sparkrun stop <cluster_id>
sparkrun stop <cluster_id> --cluster <name>
# Stop all sparkrun containers (no recipe needed)
sparkrun stop --all --cluster <name>
sparkrun stop --all --hosts <ip1>,<ip2>,...
# Dry-run
sparkrun stop <recipe> --dry-run
# List all available recipes (no filter)
sparkrun list
# List with filters
sparkrun list --all # include hidden registry recipes
sparkrun list --registry <name> # filter by registry
sparkrun list --runtime vllm # filter by runtime
sparkrun list <query> # filter by name
# Search for recipes by name, model, runtime, or description (contains-match)
sparkrun recipe search <query>
sparkrun recipe search <query> --registry <name> --runtime sglang
# Inspect a specific known recipe (by exact name or file path)
sparkrun recipe show <recipe>
sparkrun recipe show <recipe> --tp <N>
# Validate or estimate VRAM
sparkrun recipe validate <recipe>
sparkrun recipe vram <recipe> --tp <N> --max-model-len 32768
# Export a recipe
sparkrun recipe export <recipe>
sparkrun recipe export <recipe> --json
sparkrun recipe export <recipe> --save out.yaml
Use sparkrun recipe search as the first attempt when looking for a particular recipe. Use sparkrun recipe show when given a specific recipe name or file -- it may not appear in search results.
# Full flow: launch inference -> benchmark -> stop
sparkrun benchmark <recipe> --tp 1
sparkrun benchmark <recipe> --cluster <name>
sparkrun benchmark <recipe> --tp 2 --profile <profile_name>
# Benchmark an already-running instance (skip launch)
sparkrun benchmark <recipe> --skip-run --tp 1
# Keep inference running after benchmark completes
sparkrun benchmark <recipe> --no-stop --tp 1
# Override benchmark args (use -b for benchmark args, -o for recipe overrides)
sparkrun benchmark <recipe> -b depth=0,2048,4096 -b tg=32,128
# Specify framework and timeout
sparkrun benchmark <recipe> --framework llama-benchy --timeout 3600
# Dry-run
sparkrun benchmark <recipe> --dry-run
# Tune SGLang fused MoE kernels
sparkrun tune sglang <recipe> --hosts <ip>
sparkrun tune sglang <recipe> --cluster <name> --tp 1 --tp 2 --tp 4
sparkrun tune sglang <recipe> -H <ip> --parallel 2
# Tune vLLM fused MoE kernels
sparkrun tune vllm <recipe> --hosts <ip>
sparkrun tune vllm <recipe> --cluster <name> --tp 4
# Start the unified OpenAI-compatible proxy
sparkrun proxy start --cluster <name>
sparkrun proxy start --port 4000
# Check proxy status and registered models
sparkrun proxy status
sparkrun proxy models
sparkrun proxy models --refresh
# Load/unload models through the proxy
sparkrun proxy load <recipe> --cluster <name>
sparkrun proxy unload <recipe> --cluster <name>
# Manage model aliases
sparkrun proxy alias add my-model "Qwen/Qwen3-1.7B"
sparkrun proxy alias remove my-model
sparkrun proxy alias list
# Stop the proxy
sparkrun proxy stop
</Steps>
<Tool_Usage> All sparkrun commands are executed via the Bash tool. No MCP tools are required.
When running workloads:
--no-follow flag with sparkrun runsparkrun cluster status to confirm containers are running--simple or --json mode (TUI requires interactive terminal)
</Tool_Usage><Key_Options>
sparkrun run options:
| Option | Description |
|---|---|
--hosts, -H | Comma-separated host list |
--hosts-file | File with hosts (one per line) |
--cluster | Use a saved cluster |
--tp, --tensor-parallel | Override tensor parallelism (= node count) |
--pp, --pipeline-parallel | Override pipeline parallelism |
--port | Override serve port |
--gpu-mem | GPU memory utilization (0.0-1.0) |
--max-model-len | Override maximum model context length |
--served-model-name | Override the served model name |
--image | Override container image |
-o KEY=VALUE | Override any recipe default |
--ensure | Only launch if not already running; exit 0 if already up |
--no-rm | Don't auto-remove containers on exit |
--restart POLICY | Docker restart policy (no, always, unless-stopped, on-failure[:N]) |
--transfer-mode | Resource transfer mode (auto, local, push, delegated) |
--dry-run, -n | Show what would be done |
--no-follow | Don't attach to logs after launch |
--foreground | Run in foreground (blocking) |
</Key_Options>
<Important_Notes>
--no-follow when running from an automated/agent context to avoid blockingsparkrun cluster status after launching to confirm containers are running--tp N must match the number of hosts (DGX Spark = 1 GPU per host)sparkrun stop and sparkrun logs accept both recipe names and cluster IDs as targets--tp, --port, or --served-model-name were used during run, pass the same values to stop and logssparkrun show <recipe> --tp N to preview VRAM estimates before runningsparkrun_{hash}_{role} where the hash is derived from runtime + model + sorted hosts + overridessparkrun stop --all to stop all sparkrun containers without specifying a recipe--solo is deprecated; use --tp 1 instead@registry/name syntax for explicit registry selectionsparkrun update upgrades sparkrun itself (if installed via uv) and updates all registries
</Important_Notes>Task: {{ARGUMENTS}}