What is SLURM?
SLURM stands for Simple Linux Utility for Resource Management and has been used on many of the world's largest computers. The primary task of SLURM is to allocate resources within a cluster for each submitted job. When there are more jobs than resources, SLURM will create queues to hold all incoming jobs and manage a fair-share resource allocation.
You may consider SLURM as the "glue" between computer nodes and parallel jobs. The communication between nodes and parallel jobs are typically managed by MPI, a message passing interface system utilized by programs to communicate between separate nodes. The design goal of SLURM is to facilitate simple and easy job management on a cluster. Ideally, it should make executing a program as easy as if it is was on a PC.
For more detailed information about SLURM, please refer to the official SLURM website.
Partitions on BioHPC (Nucleus)
SLURM partitions are separate collections of nodes. BioHPC has a total of 500 compute nodes as of 2018, of which 494 are addressable by SLURM. These nodes are classified into 9 partitions based on their hardware and memory capacity, with some partitions overlapping nodes with others. These partitions are as follows:
Partition | Nodes | Node List | CPU | Physical (Logical) Cores | Memory Capacity (GB) | GPU |
32GB | 280 | NucleusA[002-241],NucleusB[002-041] | Intel E5-2680 | 16 (32) | 32 | N/A |
128GB | 24 | Nucleus[010-033] | Intel E5-2670 | 16 (32) | 128 | N/A |
256GB | 78 | Nucleus[034-041, 050-081, 084-121] | Intel E5-2680v3 | 24 (48) | 256 | N/A |
256GBv1 | 48 | Nucleus[126-157,174-189] | Intel E5-2680v4 | 28 (56) | 256 | N/A |
384GB | 2 | Nucleus[082-083] | Intel E5-2670 | 16 (32) | 384 | N/A |
GPU | 40 | Nucleus[042-049],NucleusC[002-033] | various | various | various | Tesla K20/K40/P4/P40 |
GPUp4 | 16 | NucleusC[002-017] | Intel Gold 6140 | 36 (72) | 384 | Tesla P4 |
GPUp40 | 16 | NucleusC[018-033] | Intel Gold 6140 | 36 (72) | 384 | Tesla P40 |
GPUp100 | 12 | Nucleus[162-173] | Intel E5-2680v4 | 28 (56) | 256 | Tesla P100 (2X) |
GPUv100 | 2 | NucleusC[034-035] | Intel Gold 6140 | 36 (72) | 384 | Tesla V100 16GB (2x) |
GPUv100s | 10 | NucleusC[036-045] | Intel Gold 6140 | 36 (72) | 384 | Tesla V100 32GB (1x) |
GPU4v100 | 12 | NucleusC[070-081] | Intel Gold 6240 | 36(72) | 376 | Tesla V100 32GB (4x) |
GPUA100 | 16 | NucleusC[086-101] | Intel Gold 6240 | 36(72) | 1423 | Tesla A100 40GB (1x) |
GPU4A100 | 10 | NucleusC[102-111] | Intel Gold 6354 | 36(72) | 977 | Tesla A100 80GB (4x) |
PHG | 8 | Nucleus[122-125, 158-161] | Intel E5-2680v3 | 24 (48) | 256 | N/A |
super | 432 | All non-GPU and non-PHG nodes | various | various | various | N/A |
If partition is not explicitly specified upon job submission, SLURM will allocate your job to the 128GB partition by default. The PHG partition is only available for the PHG group.
Node Terminology
Node: A hardware unit that contains a motherboard, CPU, random-access memory, and possibly a GPU. Each compute node contains two sockets.
Socket: A collection of physical CPU cores with direct access to random-access memory.
Core: A single processor unit within the CPU capable of performing computations. Each physical core comprises of two logical cores to account that each physical core can process two threads simultaneously.
Thread: A sequence of computer instructions that can be processed by a single logical core. Each physical core can process two concurrent threads simultaneously.
In the above example, a node similar to one of the 32GB nodes has two sockets, each containing an eight-core CPU. Each physical core can run two concurrent threads. The maximum concurrent threads the node can process simultaneously at any one time is:
Basic SLURM commands
sinfo: View information about SLURM nodes and partitions
[s178337@Nucleus006 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 256GB up infinite 77 alloc Nucleus[034-041,050-063,065-081,084-121] 256GB up infinite 1 idle Nucleus064 384GB up infinite 2 alloc Nucleus[082-083] 256GBv1 up infinite 48 alloc Nucleus[126-157,174-189] super up infinite 1 drain* NucleusA013 super up infinite 203 alloc Nucleus[010-041,050-063,065-121,126-157,174-189],NucleusA[002-012,014-034,037,054-068,078-081] super up infinite 28 idle Nucleus064,NucleusA[035-036,038-053,069-077] GPU up infinite 16 alloc Nucleus[043-047,049,162-165,167-171,173] GPU up infinite 4 idle Nucleus[042,048,166,172] 128GB* up infinite 24 alloc Nucleus[010-033] PHG up infinite 7 alloc Nucleus[122-125,158-160] PHG up infinite 1 idle Nucleus161 GPUv1 up infinite 10 alloc Nucleus[162-165,167-171,173] GPUv1 up infinite 2 idle Nucleus[166,172] 32GB up infinite 1 drain* NucleusA013 32GB up infinite 52 alloc NucleusA[002-012,014-034,037,054-068,078-081] 32GB up infinite 27 idle NucleusA[035-036,038-053,069-077]
sbatch: Submit a script for later execution (i.e., batch mode)
squeue: Report job status
[s178337@Nucleus006 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 630904 256GBv1 webGUI s159113 R 1:17:37 1 Nucleus154 628850 32GB webGUI s175864 R 19:32:24 1 NucleusA061 628852 32GB webGUI s180052 R 19:29:18 1 NucleusA054 629113 32GB webGUI s164085 R 18:09:51 1 NucleusA009 630277 32GB webGUI s178722 R 2:47:10 1 NucleusA019 630354 32GB webGUI s170446 R 2:37:06 1 NucleusA012 630620 32GB webGUI s156240 R 1:43:04 1 NucleusA033 630621 32GB webGUI hatawang R 1:41:29 1 NucleusA002 630876 32GB webGUI s171489 R 1:40:07 1 NucleusA037 630898 32GB webGUI s177630 R 1:21:22 1 NucleusA059 621067 GPU D2K00008 ansir_fm R 8-03:11:57 1 Nucleus046
scontrol: View/update system, job, step, partition or reservation status
[ydu@biohpcws009 ~]$ scontrol show job 28312 JobId=28312 Name=remoteGUI UserId=ydu(158992) GroupId=biohpc_admin(1001) Priority=4294876515 Account=(null) QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=06:31:50 TimeLimit=20:00:00 TimeMin=N/A SubmitTime=2015-02-17T09:42:37 EligibleTime=2015-02-17T09:42:37 StartTime=2015-02-17T09:42:37 EndTime=2015-02-18T05:42:37 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=super AllocNode:Sid=Nucleus005:14594 ReqNodeList=(null) ExcNodeList=(null) NodeList=Nucleus038 BatchHost=Nucleus038 NumNodes=1 NumCPUs=32 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/cm/shared/examples/.remoteGUI.job
sview: Report/update system, job, step, partition or reservation status (GTK-based GUI)
sacct: Report accounting information by individual job and job step
[ydu@biohpcws009 ~]$ sacct -j 28066 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 28066 ReadCount+ super 32 COMPLETED 0:0 28066.batch batch 1 COMPLETED 0:0 28066.0 count_rea+ 2 COMPLETED 0:0
For more information on each command type man <command>. For example
Life cycle of a job
SLURM example scripts
#!/bin/bash #SBATCH --job-name=serialJob # job name #SBATCH --partition=super # select partion from 128GB, 256GB, 384GB, GPU and super #SBATCH --nodes=1 # number of nodes requested by user #SBATCH --time=0-00:00:30 # run time, format: D-H:M:S (max wallclock time) #SBATCH --output=serialJob.%j.out # standard output file name #SBATCH --error=serialJob.%j.time # standard error output file name #SBATCH --mail-user=username@utsouthwestern.edu # specify an email address #SBATCH --mail-type=ALL # send email when job status change (start, end, abortion and etc.) module add matlab/2014b # load software package matlab -nodisplay -nodesktop -nosplash < script.m # execute program
#!/bin/bash #SBATCH --job-name=mutiTaskJob # job name #SBATCH --partition=super # select partion from 128GB, 256GB, 384GB, GPU and super #SBATCH --nodes=2 # number of nodes requested by user #SBATCH --ntasks=64 # number of total tasks #SBATCH --time=0-00:00:30 # run time, format: D-H:M:S (max wallclock time) #SBATCH --output=mutiTaskJob.%j.out # standard output file name #SBATCH --error=mutiTaskJob.%j.time # standard error output file name #SBATCH --mail-user=username@utsouthwestern.edu # specify an email address #SBATCH --mail-type=ALL # send email when job status change (start, end, abortion and etc.) module add matlab/2014b # load software package let "ID=$SLURM_NODEID*$SLURM_NTASKS/$SLURM_NNODES+SLURM_LOCALID+1" # distribute tasks to 2 nodes based on their ID srun matlab -nodisplay -nodesktop -nosplash < script.m ID # execute program
#!/bin/bash #SBATCH --job-name=mutiThreading # job name #SBATCH --partition=super # select partion from 128GB, 256GB, 384GB, GPU and super #SBATCH --nodes=1 # number of nodes requested by user #SBATCH --ntasks=30 # number of total tasks #SBATCH --time=0-10:00:00 # run time, format: D-H:M:S (max wallclock time) #SBATCH --output=mutiThreading.%j.out # redirect both standard output and erro output to the same file #SBATCH --mail-user=username@utsouthwestern.edu # specify an email address #SBATCH --mail-type=ALL # send email when job status change (start, end, abortion and etc.) module add phenix/1.9 # load software package phenix.den_refine model.pdb data.mtz nproc=30 # execute program with 30 CPUs
#!/bin/bash #SBATCH --job-name=MPI # job name #SBATCH --partition=super # select partion from 128GB, 256GB, 384GB, GPU and super #SBATCH --nodes=2 # number of nodes requested by user #SBATCH --ntasks=64 # number of total tasks #SBATCH --time=0-00:00:10 # run time, format: D-H:M:S (max wallclock time) #SBATCH --output=MPI.%j.out # redirect both standard output and erro output to the same file #SBATCH --mail-user=username@utsouthwestern.edu # specify an email address #SBATCH --mail-type=ALL # send email when job status change (start, end, abortion and etc.) module add mvapich2/gcc/1.9 # load MPI library mpirun ./MPI_only # execute 64 MPI tasks across 2 nodes
#!/bin/bash #SBATCH --job-name=MPI_pthread # job name #SBATCH --partition=super # select partion from 128GB, 256GB, 384GB, GPU and super #SBATCH --nodes=4 # number of nodes requested by user #SBATCH --ntasks=8 # number of total MPI tasks #SBATCH --time=0-00:00:10 # run time, format: D-H:M:S (max wallclock time) #SBATCH --output=MPI_pthread.%j.out # redirect both standard output and erro output to the same file #SBATCH --mail-user=username@utsouthwestern.edu # specify an email address #SBATCH --mail-type=ALL # send email when job status change (start, end, abortion and etc.) module add mvapich2/gcc/1.9 # load MPI library let "NUM_THREADS=$SLURM_CPUS_ON_NODE/($SLURM_NTASKS/$SLURM_NNODES)" # calculate number of threads per MPI job mpirun ./MPI_pthread $NUM_THREADS # 8 MPI tasks across 4 nodes, each MPI executes with 16 threads
#!/bin/bash #SBATCH --job-name=cuda-test # job name #SBATCH --partition=GPU # select partion GPU #SBATCH --nodes=1 # number of nodes requested by user #SBATCH --gres=gpu:1 # use generic resource GPU, format: --gres=gpu:[n], n is the number of GPU card #SBATCH --time=0-00:00:10 # run time, format: D-H:M:S (max wallclock time) #SBATCH --output=cuda.%j.out # redirect both standard output and erro output to the same file #SBATCH --mail-user=username@utsouthwestern.edu # specify an email address #SBATCH --mail-type=ALL # send email when job status change (start, end, abortion and etc.) module add cuda65 # load cuda library ./matrixMul # execute GPU program
Sample scripts can be downloaded from Slurm Job Scheduler (Demo Files).