From full
Generates correct SLURM sbatch job scripts with MPI/OpenMP layout guidance, resource validation, and conflict detection. Use when preparing cluster submissions or debugging job failures.
How this skill is triggered — by the user, by Claude, or both
Slash command
/full:slurm-job-script-generatorThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generate a correct, copy-pasteable SLURM job script (`.sbatch`) for running a simulation, and surface common configuration mistakes (bad walltime format, conflicting memory flags, oversubscription hints).
Generate a correct, copy-pasteable SLURM job script (.sbatch) for running a simulation, and surface common configuration mistakes (bad walltime format, conflicting memory flags, oversubscription hints).
| Input | Description | Example |
|---|---|---|
| Job name | Short identifier for the job | phasefield-strong-scaling |
| Walltime | SLURM time limit | 00:30:00 |
| Partition | Cluster partition/queue (if required) | compute |
| Account | Project/account (if required) | matsim |
| Nodes | Number of nodes to allocate | 2 |
| MPI tasks | Total tasks, or tasks per node | 128 or 64 per node |
| Threads | CPUs per task (OpenMP threads) | 2 |
| Memory | --mem or --mem-per-cpu (cluster policy dependent) | 32G |
| GPUs | GPUs per node (optional) | 4 |
| Working directory | Where the run should execute | $SLURM_SUBMIT_DIR |
| Modules | Environment modules to load (optional) | gcc/12, openmpi/4.1 |
| Run command | The command to launch under SLURM | ./simulate --config cfg.json |
Does the code use OpenMP / threading?
├── NO → Use MPI-only: cpus-per-task=1
└── YES → Use hybrid: set cpus-per-task = threads per MPI rank
and export OMP_NUM_THREADS = cpus-per-task
Rule of thumb: if you see diminishing strong-scaling efficiency at high MPI ranks, try fewer ranks with more threads per rank (and measure).
--mem (per node) or --mem-per-cpu (per CPU), not both.--mem units are integer MB by default, or an integer with suffix K/M/G/T (and --mem=0 commonly means “all memory on node”).--launcher srun) the generator prepends srun --ntasks=N --cpus-per-task=T to your run command so it inherits SLURM task placement.srun, mpirun, or mpiexec, pass --launcher none so the generator does not double-wrap it. Wrapping srun around mpirun launches N independent copies of mpirun (each spawning its own MPI world); wrapping srun around srun is malformed.srun, mpirun, mpiexec, mpiexec.hydra, orterun, aprun, jsrun) and falls back to no-wrap, emitting a warning that recommends --launcher none.total_ranks / (nodes * gpus_per_node). When --gpus-per-node is set the generator reports results.derived.total_gpus and results.derived.ranks_per_gpu.ntasks is not divisible by the total number of GPUs, the generator emits a "task-to-GPU ratio is not an integer" warning; either adjust ntasks/GPUs or document intentional sharing (e.g. NVIDIA MPS).--gpu-bind=closest or --ntasks-per-gpu (passed via your run command / --srun-extra). GPU and QoS policies are site-specific — confirm with your cluster docs.| Script | Key Outputs |
|---|---|
scripts/slurm_script_generator.py | results.script, results.directives, results.derived, results.warnings, results.run_line |
results.derived reports ntasks, ntasks_per_node, cpus_total_requested, and (when applicable) cores_per_node, cpus_per_node_requested, total_gpus, and ranks_per_gpu. results.warnings may include CPU oversubscription, task-to-GPU ratio, and double-launcher warnings.
slurm_script_generator.py.job.sbatch.sbatch job.sbatch and monitor with squeue.# Preview a job script (prints to stdout)
python3 skills/hpc-deployment/slurm-job-script-generator/scripts/slurm_script_generator.py \
--job-name phasefield \
--time 00:10:00 \
--partition compute \
--nodes 1 \
--ntasks-per-node 8 \
--cpus-per-task 2 \
--mem 16G \
--module gcc/12 \
--module openmpi/4.1 \
-- \
./simulate --config config.json
# Write to a file and also emit structured JSON
python3 skills/hpc-deployment/slurm-job-script-generator/scripts/slurm_script_generator.py \
--job-name phasefield \
--time 00:10:00 \
--nodes 1 \
--ntasks 16 \
--cpus-per-task 1 \
--out job.sbatch \
--json \
-- \
/bin/echo hello
User: I need an sbatch script for my MPI simulation. I want 2 nodes, 64 ranks per node, 2 OpenMP threads per rank, and 2 hours.
Agent workflow:
python3 scripts/slurm_script_generator.py --job-name run --time 02:00:00 --nodes 2 --ntasks-per-node 64 --cpus-per-task 2 -- ./simulate
OMP_NUM_THREADS=2)--cores-per-node.| Error | Cause | Resolution |
|---|---|---|
time must be HH:MM:SS or D-HH:MM:SS | Bad walltime format | Use 00:30:00 or 1-00:00:00 |
nodes must be positive | Non-positive nodes | Provide --nodes >= 1 |
Provide either --mem or --mem-per-cpu, not both | Conflicting memory directives | Choose one memory style |
Provide a run command after -- | Missing launch command | Add -- ./simulate ... |
--partition must match /^[A-Za-z0-9]... | Partition/account/qos/constraint/reservation contains spaces or shell metacharacters | Use a plain identifier |
module must match /^[A-Za-z0-9]... | Module name contains shell metacharacters | Use e.g. gcc/12, openmpi/4.1 |
nodes must be <= 100000 (got ...) | Integer request exceeds the sanity upper bound | Re-check the requested value |
results.script places every #SBATCH directive immediately after the shebang and before set -euo pipefail (open the script and check the first real command line) — a directive after the first command is silently ignored by SLURM.results.warnings and confirmed it is empty, or recorded each warning (CPU oversubscription, non-integer task-to-GPU ratio, double-launcher) with a deliberate justification for ignoring it.results.derived.cpus_total_requested (= ntasks * cpus-per-task) and, when --cores-per-node was supplied, confirmed cpus_per_node_requested <= cores_per_node so the node is not oversubscribed.export OMP_NUM_THREADS value equals results.derived cpus_per_task (the generator sets them equal — confirm that matches the intended threads-per-rank).results.derived.total_gpus and ranks_per_gpu and confirmed ranks_per_gpu is the intended integer (or documented intentional sharing such as MPS).results.run_line is not double-wrapped: if the run command already starts with srun/mpirun/mpiexec/orterun/aprun/jsrun, confirmed --launcher none was used (or the auto-detect warning fired) so SLURM does not launch N independent copies.--mem vs --mem-per-cpu), and GPU directive against the actual cluster's documented policy — the generator only validates internal consistency, never site policy.| Tempting shortcut | Why it's wrong / what to do |
|---|---|
| "It generated a script with no errors, so the resources are correct." | The generator only checks internal consistency — it never queries the cluster. Validate partition/account/QoS/memory style and GPU directives against your site's actual docs before submitting. |
"I'll keep the srun/mpirun already in my run command and let the generator wrap it." | Wrapping srun around mpirun launches N independent mpirun processes (each its own MPI world); srun around srun is malformed. Pass --launcher none, and confirm the auto-detect warning fired in results.warnings. |
| "I requested the nodes I want, so the job will use them all." | If any #SBATCH directive slips below the first command it is silently dropped and the job falls back to cluster defaults. Re-read the generated script and confirm all directives precede set -euo pipefail. |
| "ntasks doesn't divide the GPU count, but it ran, so it's fine." | A non-integer ranks_per_gpu means ranks map unevenly to devices (idle/oversubscribed GPUs). The generator emits a task-to-GPU warning — fix ntasks/GPUs or explicitly document MPS sharing. |
"I set high ntasks-per-node because more ranks is faster." | Without --cores-per-node the generator can't catch oversubscription, and ntasks-per-node*cpus-per-task exceeding physical cores degrades performance. Pass --cores-per-node and check cpus_per_node_requested. |
"I'll set both --mem and --mem-per-cpu to be safe." | These are mutually exclusive; the generator rejects supplying both. Pick the one your cluster's policy enforces. |
| "OMP_NUM_THREADS doesn't matter for an MPI-only code." | The generator always exports OMP_NUM_THREADS=cpus-per-task; for MPI-only runs keep --cpus-per-task=1 so threaded libraries don't silently oversubscribe cores. |
--time is validated against strict HH:MM:SS or D-HH:MM:SS format via regex (minutes/seconds in [00,59])--nodes, --ntasks, --ntasks-per-node, --cpus-per-task, --gpus-per-node, --cores-per-node are validated as positive integers with generous upper bounds (e.g. nodes ≤ 100000, ntasks ≤ 10000000, cpus-per-task ≤ 4096, gpus-per-node ≤ 64); the derived total ntasks = nodes * ntasks-per-node is also bounds-checked--mem and --mem-per-cpu are validated against SLURM's accepted format (^[0-9]+([KMGT])?$); providing both simultaneously is rejected--job-name is validated against ^[A-Za-z0-9][A-Za-z0-9._-]{0,63}$ (no spaces or shell metacharacters)--partition, --account, --qos, --constraint, --reservation, and --gpu-type are validated against a safe-character allowlist ^[A-Za-z0-9][A-Za-z0-9._:,+-]{0,127}$ before being emitted into #SBATCH directives--module values are validated against ^[A-Za-z0-9][A-Za-z0-9._/+-]{0,127}$ to prevent shell injection (no ;, |, &, backticks, $, or whitespace)--env keys must be valid shell identifiers (^[A-Za-z_][A-Za-z0-9_]*$); values are shell-quoted in the generated export lines--srun-extra is tokenized with shlex.split and each token is re-quoted with shlex.quote, so it cannot inject shell syntax (;, |, &, $, ...) into the run line2--out writes the generated sbatch script to a single specified file path#SBATCH directives; it contains no dynamically generated codeslurm_script_generator.py with explicit argument lists; the generated script itself is NOT executed by the agent.sbatch file; writes are scoped to the user's working directoryeval(), exec(), or dynamic code generationshell=True)--) is shell-quoted token-by-token into the generated script but is never executed by the skill itself--srun-extra tokens, and identifier-style fields are sanitized/quoted to prevent injection into module load, the srun invocation, or #SBATCH directives#SBATCH directives immediately after the shebang (before any executable command, so SLURM does not stop parsing them) and use set -euo pipefail for safe shell execution on the clusterreferences/slurm_directives.md - Common #SBATCH directives and mapping tips#SBATCH lines now precede set -euo pipefail), avoided double-launcher wrapping when the run command already starts with srun/mpirun, added GPU task-to-GPU ratio warning and layout guidance, hardened input validation (integer upper bounds, partition/account/qos/constraint/reservation/gpu-type allowlists, module sanitization, --srun-extra quoting), escaped %j in --help, and corrected the Security and output documentation.npx claudepluginhub heshamfs/materials-simulation-skills --plugin core-numericalGenerates and submits sbatch scripts for GPU compute jobs on Slurm clusters. Handles partition, GPU types (A100_40G, V100, A800), node selection, Python paths, and cluster rules.
Diagnoses HPC runtime and scheduler problems for failed or slow jobs on clusters, covering MPI/OpenMP/GPU layout, modules, CUDA/Kokkos, scratch paths, walltime, job arrays, restart strategy, and resource mismatch.