Using the SLURM Job Scheduler

 

What is SLURM?

SLURM stands for Simple Linux Utility for Resource Management and has been used on many of the world's largest computers. The primary task of SLURM is to allocate resources within a cluster for each submitted job. When there are more jobs than resources, SLURM will create queues to hold all incoming jobs and manage a fair-share resource allocation.

You may consider SLURM as the "glue" between computer nodes and parallel jobs. The communication between nodes and parallel jobs are typically managed by MPI, a message passing interface system utilized by programs to communicate between separate nodes. The design goal of SLURM is to facilitate simple and easy job management on a cluster. Ideally, it should make executing a program as easy as if it is was on a PC.

For more detailed information about SLURM, please refer to the official SLURM website.

 

Partitions on BioHPC (Nucleus)

SLURM partitions are separate collections of nodes. BioHPC has a total of 500 compute nodes as of 2018, of which 494 are addressable by SLURM. These nodes are classified into 9 partitions based on their hardware and memory capacity, with some partitions overlapping nodes with others. These partitions are as follows:

Partition Nodes Node List CPU Physical (Logical) Cores Memory Capacity (GB) GPU
32GB 280 NucleusA[002-241],NucleusB[002-041] Intel E5-2680 16 (32) 32 N/A
128GB 24 Nucleus[010-033] Intel E5-2670 16 (32) 128 N/A
256GB 78 Nucleus[034-041, 050-081, 084-121] Intel E5-2680v3 24 (48) 256 N/A
256GBv1 48 Nucleus[126-157,174-189] Intel E5-2680v4 28 (56) 256 N/A
384GB 2 Nucleus[082-083] Intel E5-2670 16 (32) 384 N/A
GPU 40 Nucleus[042-049],NucleusC[002-033] various various various Tesla K20/K40/P4/P40
GPUp4 16 NucleusC[002-017] Intel Gold 6140 36 (72) 384 Tesla P4
GPUp40 16 NucleusC[018-033] Intel Gold 6140 36 (72) 384 Tesla P40
GPUp100 12 Nucleus[162-173] Intel E5-2680v4 28 (56) 256 Tesla P100 (2X)
GPUv100 2 NucleusC[034-035] Intel Gold 6140 36 (72) 384 Tesla V100 16GB (2x)
GPUv100s 10 NucleusC[036-045] Intel Gold 6140 36 (72) 384 Tesla V100 32GB (1x)
GPU4v100 12 NucleusC[070-081] Intel Gold 6240 36(72) 376 Tesla V100 32GB (4x)
GPUA100 16 NucleusC[086-101] Intel Gold 6240 36(72) 1423 Tesla A100 40GB (1x)
GPU4A100 10 NucleusC[102-111] Intel Gold 6354 36(72) 977 Tesla A100 80GB (4x)
PHG 8 Nucleus[122-125, 158-161] Intel E5-2680v3 24 (48) 256 N/A
super 432 All non-GPU and non-PHG nodes various various various N/A

 

If partition is not explicitly specified upon job submission, SLURM will allocate your job to the 128GB partition by default. The PHG partition is only available for the PHG group.

 

Node Terminology

Node: A hardware unit that contains a motherboard, CPU, random-access memory, and possibly a GPU. Each compute node contains two sockets.

Socket: A collection of physical CPU cores with direct access to random-access memory.

Core: A single processor unit within the CPU capable of performing computations. Each physical core comprises of two logical cores to account that each physical core can process two threads simultaneously.

Thread: A sequence of computer instructions that can be processed by a single logical core. Each physical core can process two concurrent threads simultaneously.

 

 

In the above example, a node similar to one of the 32GB nodes has two sockets, each containing an eight-core CPU. Each physical core can run two concurrent threads. The maximum concurrent threads the node can process simultaneously at any one time is:

  • 2 sockets * 8 physical cores* 2 threads = 32 threads

 

Basic SLURM commands

Before Job Submission

sinfo: View information about SLURM nodes and partitions

[s178337@Nucleus006 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
256GB        up   infinite     77  alloc Nucleus[034-041,050-063,065-081,084-121]
256GB        up   infinite      1   idle Nucleus064
384GB        up   infinite      2  alloc Nucleus[082-083]
256GBv1      up   infinite     48  alloc Nucleus[126-157,174-189]
super        up   infinite      1 drain* NucleusA013
super        up   infinite    203  alloc Nucleus[010-041,050-063,065-121,126-157,174-189],NucleusA[002-012,014-034,037,054-068,078-081]
super        up   infinite     28   idle Nucleus064,NucleusA[035-036,038-053,069-077]
GPU          up   infinite     16  alloc Nucleus[043-047,049,162-165,167-171,173]
GPU          up   infinite      4   idle Nucleus[042,048,166,172]
128GB*       up   infinite     24  alloc Nucleus[010-033]
PHG          up   infinite      7  alloc Nucleus[122-125,158-160]
PHG          up   infinite      1   idle Nucleus161
GPUv1        up   infinite     10  alloc Nucleus[162-165,167-171,173]
GPUv1        up   infinite      2   idle Nucleus[166,172]
32GB         up   infinite      1 drain* NucleusA013
32GB         up   infinite     52  alloc NucleusA[002-012,014-034,037,054-068,078-081]
32GB         up   infinite     27   idle NucleusA[035-036,038-053,069-077]

Submit Job

          sbatch: Submit a script for later execution (i.e., batch mode)

          srun: Create a job allocation and launch a job (i.e., typically an multithreading job)

 

While Job is Running

          squeue: Report job status

[s178337@Nucleus006 ~]$ squeue 
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
 630904   256GBv1   webGUI  s159113   R    1:17:37      1 Nucleus154
 628850      32GB   webGUI  s175864   R   19:32:24      1 NucleusA061
 628852      32GB   webGUI  s180052   R   19:29:18      1 NucleusA054
 629113      32GB   webGUI  s164085   R   18:09:51      1 NucleusA009
 630277      32GB   webGUI  s178722   R    2:47:10      1 NucleusA019
 630354      32GB   webGUI  s170446   R    2:37:06      1 NucleusA012
 630620      32GB   webGUI  s156240   R    1:43:04      1 NucleusA033
 630621      32GB   webGUI hatawang   R    1:41:29      1 NucleusA002
 630876      32GB   webGUI  s171489   R    1:40:07      1 NucleusA037
 630898      32GB   webGUI  s177630   R    1:21:22      1 NucleusA059
 621067       GPU D2K00008 ansir_fm   R 8-03:11:57      1 Nucleus046

scontrol: View/update system, job, step, partition or reservation status

[ydu@biohpcws009 ~]$ scontrol show job 28312
JobId=28312 Name=remoteGUI
   UserId=ydu(158992) GroupId=biohpc_admin(1001)
   Priority=4294876515 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=06:31:50 TimeLimit=20:00:00 TimeMin=N/A
   SubmitTime=2015-02-17T09:42:37 EligibleTime=2015-02-17T09:42:37
   StartTime=2015-02-17T09:42:37 EndTime=2015-02-18T05:42:37
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=super AllocNode:Sid=Nucleus005:14594
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=Nucleus038
   BatchHost=Nucleus038
   NumNodes=1 NumCPUs=32 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/cm/shared/examples/.remoteGUI.job

 

          sview: Report/update system, job, step, partition or reservation status (GTK-based GUI)

 

After Job is Completed

          sacct: Report accounting information by individual job and job step

[ydu@biohpcws009 ~]$ sacct -j 28066
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
28066        ReadCount+      super                    32  COMPLETED      0:0 
28066.batch       batch                                1  COMPLETED      0:0 
28066.0      count_rea+                                2  COMPLETED      0:0 

 

For more information on each command type man <command>. For example

 

Life cycle of a job

 

 

SLURM example scripts

  • Serial job
  #!/bin/bash
  #SBATCH --job-name=serialJob                              # job name
  #SBATCH --partition=super                                 # select partion from 128GB, 256GB, 384GB, GPU and super
  #SBATCH --nodes=1                                         # number of nodes requested by user
  #SBATCH --time=0-00:00:30                                 # run time, format: D-H:M:S (max wallclock time)
  #SBATCH --output=serialJob.%j.out                         # standard output file name
  #SBATCH --error=serialJob.%j.time                         # standard error output file name
  #SBATCH --mail-user=username@utsouthwestern.edu           # specify an email address
  #SBATCH --mail-type=ALL                                   # send email when job status change (start, end, abortion and etc.)

  module add matlab/2014b                                   # load software package

  matlab -nodisplay -nodesktop -nosplash < script.m         # execute program

 

  • Muti-tasks job
  #!/bin/bash
  #SBATCH --job-name=mutiTaskJob                                      # job name
  #SBATCH --partition=super                                           # select partion from 128GB, 256GB, 384GB, GPU and super
  #SBATCH --nodes=2                                                   # number of nodes requested by user
  #SBATCH --ntasks=64                                                 # number of total tasks
  #SBATCH --time=0-00:00:30                                           # run time, format: D-H:M:S (max wallclock time)
  #SBATCH --output=mutiTaskJob.%j.out                                 # standard output file name
  #SBATCH --error=mutiTaskJob.%j.time                                 # standard error output file name
  #SBATCH --mail-user=username@utsouthwestern.edu                     # specify an email address
  #SBATCH --mail-type=ALL                                             # send email when job status change (start, end, abortion and etc.)

  module add matlab/2014b                                             # load software package
  
  let "ID=$SLURM_NODEID*$SLURM_NTASKS/$SLURM_NNODES+SLURM_LOCALID+1"  # distribute tasks to 2 nodes based on their ID

  srun matlab -nodisplay -nodesktop -nosplash < script.m ID            # execute program 

 

  • Muti-threading job
  #!/bin/bash
  #SBATCH --job-name=mutiThreading                                    # job name
  #SBATCH --partition=super                                           # select partion from 128GB, 256GB, 384GB, GPU and super
  #SBATCH --nodes=1                                                   # number of nodes requested by user
  #SBATCH --ntasks=30                                                 # number of total tasks
  #SBATCH --time=0-10:00:00                                           # run time, format: D-H:M:S (max wallclock time)
  #SBATCH --output=mutiThreading.%j.out                               # redirect both standard output and erro output to the same file
  #SBATCH --mail-user=username@utsouthwestern.edu                     # specify an email address
  #SBATCH --mail-type=ALL                                             # send email when job status change (start, end, abortion and etc.)

  module add phenix/1.9                                               # load software package

  phenix.den_refine model.pdb data.mtz nproc=30                       # execute program with 30 CPUs 

 

  • Muti-core job (MPI)
  #!/bin/bash
  #SBATCH --job-name=MPI                                              # job name
  #SBATCH --partition=super                                           # select partion from 128GB, 256GB, 384GB, GPU and super
  #SBATCH --nodes=2                                                   # number of nodes requested by user
  #SBATCH --ntasks=64                                                 # number of total tasks
  #SBATCH --time=0-00:00:10                                           # run time, format: D-H:M:S (max wallclock time)
  #SBATCH --output=MPI.%j.out                                         # redirect both standard output and erro output to the same file
  #SBATCH --mail-user=username@utsouthwestern.edu                     # specify an email address
  #SBATCH --mail-type=ALL                                             # send email when job status change (start, end, abortion and etc.)

  module add mvapich2/gcc/1.9                                         # load MPI library

  mpirun ./MPI_only                                                   # execute 64 MPI tasks across 2 nodes 

 

  • Hybrid muti-core/muti-threading job (MPI with pthread)
  #!/bin/bash
  #SBATCH --job-name=MPI_pthread                                       # job name
  #SBATCH --partition=super                                            # select partion from 128GB, 256GB, 384GB, GPU and super
  #SBATCH --nodes=4                                                    # number of nodes requested by user
  #SBATCH --ntasks=8                                                   # number of total MPI tasks
  #SBATCH --time=0-00:00:10                                            # run time, format: D-H:M:S (max wallclock time)
  #SBATCH --output=MPI_pthread.%j.out                                  # redirect both standard output and erro output to the same file
  #SBATCH --mail-user=username@utsouthwestern.edu                      # specify an email address
  #SBATCH --mail-type=ALL                                              # send email when job status change (start, end, abortion and etc.)

  module add mvapich2/gcc/1.9                                          # load MPI library

  let "NUM_THREADS=$SLURM_CPUS_ON_NODE/($SLURM_NTASKS/$SLURM_NNODES)"  # calculate number of threads per MPI job
  
  mpirun ./MPI_pthread $NUM_THREADS                                    # 8 MPI tasks across 4 nodes, each MPI executes with 16 threads 

 

  • GPU job
  #!/bin/bash
  #SBATCH --job-name=cuda-test                             # job name
  #SBATCH --partition=GPU                                  # select partion GPU
  #SBATCH --nodes=1                                        # number of nodes requested by user
  #SBATCH --gres=gpu:1                                     # use generic resource GPU, format: --gres=gpu:[n], n is the number of GPU card
  #SBATCH --time=0-00:00:10                                # run time, format: D-H:M:S (max wallclock time)
  #SBATCH --output=cuda.%j.out                             # redirect both standard output and erro output to the same file
  #SBATCH --mail-user=username@utsouthwestern.edu          # specify an email address
  #SBATCH --mail-type=ALL                                  # send email when job status change (start, end, abortion and etc.)

  module add cuda65                                        # load cuda library
  
  ./matrixMul                                              # execute GPU program 

 

Sample scripts can be downloaded from Slurm Job Scheduler (Demo Files).