From submit-slurm-job
Generates and submits sbatch scripts for GPU compute jobs on Slurm clusters. Handles partition, GPU types (A100_40G, V100, A800), node selection, Python paths, and cluster rules.
npx claudepluginhub quantumbfs/claude-code-skills --plugin submit-slurm-jobThis skill uses the workspace's default tool permissions.
Generate and submit sbatch scripts for GPU compute jobs on the cluster. Handles all cluster-specific details: partition, GPU types, node selection, python path.
Provisions and manages on-demand/reserved GPU clusters (H100/H200/B200) on Together AI with Kubernetes/Slurm orchestration, shared storage, and scaling for distributed ML training, multi-node inference, and HPC jobs needing infrastructure control.
Runs Python workloads on Hugging Face Jobs with managed CPUs, GPUs, TPUs, secrets, and Hub persistence. Use for batch inference, data processing, ML experiments, and testing without local GPU setup.
Launches GPU/TPU clusters, training jobs, and inference servers across 25+ clouds, Kubernetes, Slurm using SkyPilot; debugs YAML, optimizes costs.
Share bugs, ideas, or general feedback.
Generate and submit sbatch scripts for GPU compute jobs on the cluster. Handles all cluster-specific details: partition, GPU types, node selection, python path.
Before using this skill, set the following in your project's CLAUDE.md or environment:
| Variable | Example | Description |
|---|---|---|
PYTHON_PATH | /path/to/miniconda3/envs/myenv/bin/python3 | Full path to Python interpreter |
PROJECT_DIR | /home/user_xxx/private/homefile | Must be under ~/private/homefile; Slurm submission is only allowed from this path |
PARTITION | home | Slurm partition name (the only compute partition) |
Cluster rule: scripts and submissions must live under ~/private/homefile. Data should live under ~/private/datafile. Run sbatch/srun only after cd ~/private/homefile/....
Ask the user (with AskUserQuestion) what they want to run. Key parameters:
| Parameter | Default | Description |
|---|---|---|
job_name | (required) | Short job name for SBATCH |
gpu_type | (required) | GPU model, must be explicit (e.g., A100_40G, V100, A800) |
n_gpu | 1 | Number of GPUs |
time | 24:00:00 | Wall time limit |
mem | 32G | Memory |
cpus | 4 | CPUs per task |
script | (required) | Python script path (must live under ~/private/homefile) |
args | (required) | Script arguments |
output_dir | {PROJECT_DIR} | Directory for log files (keep under ~/private/homefile) |
--gres: use gpu:MODEL:N (e.g., gpu:A100_40G:1, gpu:A800:2). gpu:1 will be rejected.slurm_gpustat (cluster-provided wheel) or scontrol show nodes -o to see available GPU models.--nodelist=node.Important: ensure the script is written under ~/private/homefile/... and run sbatch from that directory (cluster enforcement).
Generate the sbatch script following this template:
#!/bin/bash
#SBATCH --partition={PARTITION} # home
#SBATCH --cpus-per-task={cpus}
#SBATCH --mem={mem}
#SBATCH --gres=gpu:{gpu_type}:{n_gpu}
#SBATCH --nodes=1
#SBATCH --time={time}
#SBATCH --job-name={job_name}
#SBATCH -o {output_dir}/{job_name}_%j.out
echo The current job ID is $SLURM_JOB_ID
echo Running on $SLURM_JOB_NODELIST
echo CUDA devices: $CUDA_VISIBLE_DEVICES
echo ==== Job started at `date` ====
nvidia-smi
echo
{PYTHON_PATH} \
{script} \
{args}
echo
echo ==== Job finished at `date` ====
Key rules:
home (the only compute partition).~/private/homefile/....{PYTHON_PATH} (do NOT conda activate).-o for combined stdout+stderr.--gres.--nodelist unless the user explicitly requests a specific node.Write the script to {output_dir}/{job_name}.sh (under ~/private/homefile), then submit with sbatch from the same directory.
After submission, report:
{output_dir}/{job_name}_{JOBID}.outsqueue -u $USER, tail -f {log_file}scancel {JOBID}scancel, verify you submitted from the correct project (~/private/homefile matching the web UI project) and that requested resources are within quota.If the user wants to submit multiple jobs (e.g., different datasets on different GPUs):
.sh scripts for each jobfor f in script1.sh script2.sh ...; do sbatch $f; donePut multiple python commands in a single script, separated by echo markers.
Create separate scripts, each requesting one GPU. Submit all scripts independently.
--nodelist for specific nodesOnly add #SBATCH --nodelist=n004 when the user explicitly wants a specific node. Otherwise let Slurm schedule based on GPU type availability.