Using the SLURM Job Scheduler

 

What is SLURM?

SLURM stands for Simple Linux Utility for Resource Management and has been used on many of world's largest computers. The major task of SLURM is to allocate resources within a cluster for each job. When there is more jobs than resources, SLURM will create waiting queues to hold all incoming jobs and manage a fair-share resource allocation.

You may consider SLURM as the "glue" between computer nodes and parallel jobs. The communication between nodes and parallel jobs are typically managed by MPI, a massege passing interface system to function on parllel computers. The desgin goal of SLURM is to make things simple and easy to use on supercomputers, ideally it should make executing a program as easy as if it is on a PC.

For more detailed information about SLURM, please refer to the official slurm website.

 

Partitions on BioHPC (nucleus)

 

partitions

 

Slurm partitions can be considered as collections of nodes. BioHPC has total 74 nodes in the middle of 2015. These 41 nodes had been classified into 4 differend partitions based on their RAM size (some of the partitions are overlaped) :

        128GB*:  Nucleus010-041

        256GB:   Nucleus042-049

        384GB:   Nucleus050

        GPU:      Nucleus048-049

        super:    Nucleus010-050

If partition is not explicitly specified upon job submission, SLURM will allocate your job to the 128GB partition.

 

Nodes on BioHPC (nucleus)

Node: refers to an entire computer node (e.g.: nucleus010). Each node contains two sockets.

Socket: refers to a collection of cores with a direct pipe to memory. Each socket contains eight cores.

Core: refers to a single processing unit capable of performing computations. Each core may contain one or two threads.

Thread: refers to a thread of execution in a program, each thread has attributes of one core, managed and scheduled as a single logical processor by the OS.

 

unnamed file

 

The maximum number of processes/threads/workers/tasks can be executed on each node is;

                                            2 sockets-per-node * 8 cores-per-socket * 2 threads-per-core =32 threads-per-node

 

Basic SLURM commands

  • Before job submission:

          sinfo: view information about LURM nodes and partitions

[ydu@biohpcws009 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
128GB*       up   infinite     23  alloc Nucleus[010-025,034-040]
128GB*       up   infinite      9   idle Nucleus[026-033,041]
256GB        up   infinite      3  alloc Nucleus[042,048-049]
256GB        up   infinite      5   idle Nucleus[043-047]
GPU          up   infinite      3  alloc Nucleus[042,048-049]
GPU          up   infinite      5   idle Nucleus[043-047]
384GB        up   infinite      1  alloc Nucleus050
super        up   infinite     27  alloc Nucleus[010-025,034-040,042,048-050]
super        up   infinite     14   idle Nucleus[026-033,041,043-047]

 

  • Submit a job

          sbatch: Submit a script for later execution (batch mode)

          srun: Create  a job allocation (if needed) and lauch a job step (typically an muti-threading/muti-core job)

  • During job running

          squeue: Report job and job step status

[ydu@biohpcws009 ~]$ squeue 
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  28303     128GB 3Dclsc31   jiangq   R    8:39:37      8 Nucleus[019-025,034]
  28325     128GB CC_Un_Mu   ajones   R    3:55:54      1 Nucleus037
  28274     256GB   ERMI11   kwhite   R 1-01:46:48      1 Nucleus042
  28280     256GB   ERMI21   kwhite   R 1-01:07:53      1 Nucleus048
  28281     256GB   ERMI41   kwhite   R 1-01:05:21      1 Nucleus049
  28284     super 3Dcls_gy   gyadav   R 1-00:57:27      8 Nucleus[011-018]
  28305     super remoteGU   mdrisc   R    6:48:39      1 Nucleus010
  28306     super remoteGU   mdrisc   R    6:47:37      1 Nucleus035
  28312     super remoteGU      ydu   R    6:29:07      1 Nucleus038
  28322     super   webGUI hatawang   R    5:17:42      1 Nucleus036
  28334     super remoteGU   mdrisc   R    1:38:08      1 Nucleus039
  28335     super   webGUI    lding   R    1:05:42      1 Nucleus040

 

          scontrol: View/update system, job, step, partition or reservation status

[ydu@biohpcws009 ~]$ scontrol show job 28312
JobId=28312 Name=remoteGUI
   UserId=ydu(158992) GroupId=biohpc_admin(1001)
   Priority=4294876515 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=06:31:50 TimeLimit=20:00:00 TimeMin=N/A
   SubmitTime=2015-02-17T09:42:37 EligibleTime=2015-02-17T09:42:37
   StartTime=2015-02-17T09:42:37 EndTime=2015-02-18T05:42:37
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=super AllocNode:Sid=Nucleus005:14594
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=Nucleus038
   BatchHost=Nucleus038
   NumNodes=1 NumCPUs=32 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/cm/shared/examples/.remoteGUI.job

 

          sview: Report/update system, job, step, partition or reservation status (GTK-based GUI)

  • After job completed

          sacct: Report accounting information by individual job and job step

[ydu@biohpcws009 ~]$ sacct -j 28066
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
28066        ReadCount+      super                    32  COMPLETED      0:0 
28066.batch       batch                                1  COMPLETED      0:0 
28066.0      count_rea+                                2  COMPLETED      0:0 

There are man pages available for all commands

 

Life cycle of a job

unnamed file

 

 

SLURM example scripts

  • Serial job
  #!/bin/bash
  #SBATCH --job-name=serialJob                              # job name
  #SBATCH --partition=super                                 # select partion from 128GB, 256GB, 384GB, GPU and super
  #SBATCH --nodes=1                                         # number of nodes requested by user
  #SBATCH --time=0-00:00:30                                 # run time, format: D-H:M:S (max wallclock time)
  #SBATCH --output=serialJob.%j.out                         # standard output file name
  #SBATCH --error=serialJob.%j.time                         # standard error output file name
  #SBATCH --mail-user=username@utsouthwestern.edu           # specify an email address
  #SBATCH --mail-type=ALL                                   # send email when job status change (start, end, abortion and etc.)

  module add matlab/2014b                                   # load software package

  matlab -nodisplay -nodesktop -nosplash < script.m         # execute program

 

  • Muti-tasks job
  #!/bin/bash
  #SBATCH --job-name=mutiTaskJob                                      # job name
  #SBATCH --partition=super                                           # select partion from 128GB, 256GB, 384GB, GPU and super
  #SBATCH --nodes=2                                                   # number of nodes requested by user
  #SBATCH --ntasks=64                                                 # number of total tasks
  #SBATCH --time=0-00:00:30                                           # run time, format: D-H:M:S (max wallclock time)
  #SBATCH --output=mutiTaskJob.%j.out                                 # standard output file name
  #SBATCH --error=mutiTaskJob.%j.time                                 # standard error output file name
  #SBATCH --mail-user=username@utsouthwestern.edu                     # specify an email address
  #SBATCH --mail-type=ALL                                             # send email when job status change (start, end, abortion and etc.)

  module add matlab/2014b                                             # load software package
  
  let "ID=$SLURM_NODEID*$SLURM_NTASKS/$SLURM_NNODES+SLURM_LOCALID+1"  # distribute tasks to 2 nodes based on their ID

  srun matlab -nodisplay -nodesktop -nosplash < script.m ID            # execute program 

 

  • Muti-threading job
  #!/bin/bash
  #SBATCH --job-name=mutiThreading                                    # job name
  #SBATCH --partition=super                                           # select partion from 128GB, 256GB, 384GB, GPU and super
  #SBATCH --nodes=1                                                   # number of nodes requested by user
  #SBATCH --ntasks=30                                                 # number of total tasks
  #SBATCH --time=0-10:00:00                                           # run time, format: D-H:M:S (max wallclock time)
  #SBATCH --output=mutiThreading.%j.out                               # redirect both standard output and erro output to the same file
  #SBATCH --mail-user=username@utsouthwestern.edu                     # specify an email address
  #SBATCH --mail-type=ALL                                             # send email when job status change (start, end, abortion and etc.)

  module add phenix/1.9                                               # load software package

  phenix.den_refine model.pdb data.mtz nproc=30                       # execute program with 30 CPUs 

 

  • Muti-core job (MPI)
  #!/bin/bash
  #SBATCH --job-name=MPI                                              # job name
  #SBATCH --partition=super                                           # select partion from 128GB, 256GB, 384GB, GPU and super
  #SBATCH --nodes=2                                                   # number of nodes requested by user
  #SBATCH --ntasks=64                                                 # number of total tasks
  #SBATCH --time=0-00:00:10                                           # run time, format: D-H:M:S (max wallclock time)
  #SBATCH --output=MPI.%j.out                                         # redirect both standard output and erro output to the same file
  #SBATCH --mail-user=username@utsouthwestern.edu                     # specify an email address
  #SBATCH --mail-type=ALL                                             # send email when job status change (start, end, abortion and etc.)

  module add mvapich2/gcc/1.9                                         # load MPI library

  mpirun ./MPI_only                                                   # execute 64 MPI tasks across 2 nodes 

 

  • Hybrid muti-core/muti-threading job (MPI with pthread)
  #!/bin/bash
  #SBATCH --job-name=MPI_pthread                                       # job name
  #SBATCH --partition=super                                            # select partion from 128GB, 256GB, 384GB, GPU and super
  #SBATCH --nodes=4                                                    # number of nodes requested by user
  #SBATCH --ntasks=8                                                   # number of total MPI tasks
  #SBATCH --time=0-00:00:10                                            # run time, format: D-H:M:S (max wallclock time)
  #SBATCH --output=MPI_pthread.%j.out                                  # redirect both standard output and erro output to the same file
  #SBATCH --mail-user=username@utsouthwestern.edu                      # specify an email address
  #SBATCH --mail-type=ALL                                              # send email when job status change (start, end, abortion and etc.)

  module add mvapich2/gcc/1.9                                          # load MPI library

  let "NUM_THREADS=$SLURM_CPUS_ON_NODE/($SLURM_NTASKS/$SLURM_NNODES)"  # calculate number of threads per MPI job
  
  mpirun ./MPI_pthread $NUM_THREADS                                    # 8 MPI tasks across 4 nodes, each MPI executes with 16 threads 

 

  • GPU job
  #!/bin/bash
  #SBATCH --job-name=cuda-test                             # job name
  #SBATCH --partition=GPU                                  # select partion GPU
  #SBATCH --nodes=1                                        # number of nodes requested by user
  #SBATCH --gres=gpu:1                                     # use generic resource GPU, format: --gres=gpu:[n], n is the number of GPU card
  #SBATCH --time=0-00:00:10                                # run time, format: D-H:M:S (max wallclock time)
  #SBATCH --output=cuda.%j.out                             # redirect both standard output and erro output to the same file
  #SBATCH --mail-user=username@utsouthwestern.edu          # specify an email address
  #SBATCH --mail-type=ALL                                  # send email when job status change (start, end, abortion and etc.)

  module add cuda65                                        # load cuda library
  
  ./matrixMul                                              # execute GPU program 

 

Sample scripts can be downloaded from Slurm Job Scheduler (Demo Files) .