Resource Constrains

Modified

November 9, 2023

Abstract

The hardware of each compute node dictates the resource limits** of an application executed on it. These limits include for example the main memory and the number of processor cores. Furthermore there are limits defined by the cluster controller like run-time or the maximum allocatable memory per compute Job.

Limits

Use the sinfo command to overview resource limits for nodes in their corresponding partitions:

» sinfo -o "%9P  %6g %11L %10l %10m %5D %7X %5Y %7Z"
PARTITION  GROUPS DEFAULTTIME TIMELIMIT  MEMORY     NODES SOCKETS CORES THREADS
debug      all    5:00        30:00      257649     5     8       8     2
hpc_debug  all    5:00        30:00      257649     6     8       8     2
main*      all    2:00:00     8:00:00    257649     161   8       8     2
long       all    2:00:00     7-00:00:00 257649     38    8       8     2
grid       all    1:00:00     3-00:00:00 127653+    104   2+      8+    2

Resource constrains define the execution boundary of an application. By defining resource requirements like maximum runtime and allocatable memory, users ensure to not unintentionally consume more resources then planed.

This is in particular important for the runtime, since a software bug 1 or an issue with access to the input data, for instance on shared storage, can increase execution time tremendously for your application. Keep in mind that run-time is one of the main factors accounted for by the cluster controller and is charged to your associated accounts for fair-share priority calculation.

Runtime

Use the sinfo command to list runtime limits:

sinfo -o "%9P %6g %11L %10l %5D %20C"

The output will include following columns:

Column Description
TIMELIMIT Maximum runtime of a job in a given partition.
DEFAULTTIME Default runtime if not specified by user.

If a user does not explicitly specify a runtime limit, then the cluster controller will apply the default runtime defined for the partitions the compute job is executed in.

Specification

Command-line options to specify a maximum runtime:

Optio Description
-t, --time Limit on the total runtime of the job allocation

The time format to specify a runtime limited is the following:

minutes
minutes:seconds
hours:minutes:seconds
days-hours
days-hours:minutes
days-hours:minutes:seconds

If the requested time limit exceeds the partitions time limit, it will be rejected:

# time limit configuration of the debug partition
» sinfo -o "%11L %10l" -p debug
DEFAULTTIME TIMELIMIT 
5:00        30:00

# request a job with a time limite excerding the the partitions configuration
» salloc -p debug -t 02:00:00
salloc: error: Job submit/allocate failed: Requested time limit is invalid...

A runtime limit can be set by an environment variable:

Variable Description
SBATCH_TIMELIMIT Limit on the total runtime of the job allocation.
» SBATCH_TIMELIMIT=05:00 sbatch $LUSTRE_HOME/sleep.sh
» scontrol show job=$(squeue -ho %A -n sleep) | grep TimeLimit
   RunTime=00:00:43 TimeLimit=00:05:00 TimeMin=N/A

Limit Reached

When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL:

# submit a job with a 1 minute time limit
» sbatch -p debug -t 1 sleep.sh 360

# show the configuration of the job
» scontrol show job=$(squeue -ho %A -n sleep) | grep Time
   RunTime=00:00:53 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2019-08-06T11:24:00 EligibleTime=2019-08-06T11:24:00
   AccrueTime=2019-08-06T11:24:00
   StartTime=2019-08-06T11:24:01 EndTime=2019-08-06T11:25:01 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0

# job is killed after reaching its limit
» cat $LUSTRE_HOME/$(squeue -ho %A -n sleep).log 
[2019/08/06T11:24:01] START vpenso@lxbk0595:/lustre/hpc/vpenso virgo:debug sleep-104 2:2 2048
[2019/08/06T11:24:01] Sleep for 360 seconds
slurmstepd: error: *** JOB 104 ON lxbk0595 CANCELLED AT 2019-08-06T11:25:21 DUE TO TIME LIMIT ***

» sacct -j 104                                           
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
104               sleep      debug        hpc          2    TIMEOUT      0:0 
104.batch         batch                   hpc          2  CANCELLED     0:15 

Memory

The job submission commands support following options for users to specify the maximum amount of real memory. It is important to specify enough memory since Slurm will not allow the application to use more than the requested amount of real memory. Jobs will be stopped by the out-of-memory handler if they uses more then the requested memory.

Option Description
--mem Specify memory required per node.
--mem-per-cpu Minimum memory required per allocated CPU.

Memory is specified in the format size[units] with the unit suffix [K|M|G|T]. If the final amount of memory requested by a job can’t be satisfied by any of the nodes configured in the partition, then the job will be rejected.

Memory/CPU Ratio

Global configuration for memory per CPU:

» scontrol show config | grep MemPer
DefMemPerCPU            = 2048
MaxMemPerCPU            = 4096

Requesting memory beyond DefMemPerCPU will automatically allocate additional CPUs to compensate:

» sbatch --ntasks=1 --mem=64G -- $LUSTRE_HOME/sleep.sh

» scontrol show job=$(squeue -ho %A -n sleep) | grep -e CPUs -e Memory
   NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   MinCPUsNode=16 MinMemoryNode=64G MinTmpDiskNode=0

» sbatch --ntasks=1 --mem-per-cpu=8G -- $LUSTRE_HOME/sleep.sh

» scontrol show job=$(squeue -ho %A -n sleep) | grep -e CPUs -e Memory
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0

Out-of-Memory

Jobs allocating memory beyond their limits will be killed:

# request 128MB of RAM
» salloc -p debug --mem-per-cpu=128M

# allocate RAM above the requested limit
» srun -- stress -v -t 3 -m 1 --vm-hang 3 --vm-bytes 256M
...
stress: dbug: [57638] allocating 268435456 bytes ...
stress: dbug: [57638] touching bytes in strides of 4096 bytes ...
stress: FAIL: [57637] (415) <-- worker 57638 got signal 9
...
slurmstepd: error: Detected 1 oom-kill event(s) in step 106.0 cgroup. Some of
your processes may have been killed by the cgroup out-of-memory handler.
srun: error: lxbk0595: task 0: Out Of Memory

Cores

A typical compute node is build with support for multiple CPU sockets 2 on its motherboard. Each socket hosts a group of Central Processing Units (CPUs) 3 , where each CPU provides multiple cores 4. Each core can execute two threads in parallel. Details are described in Support for Multi-core/Multi-thread Architectures 5 in the Slurm documentation. AMD EPYC compute node are build with multi-chip modules (MCMs) 6 where each chiplet represents its own socket from the perspective of the cluster controller.

Print the number of sockets, cores and threads with sinfo. The -e, --exact option list all available node configurations explicitly:

sinfo -e -o '%9P %4c %8z %8X %8Y %8Z %5D %N'

Tasks

Within a compute job a task represents a unit of work (your application instance) and the corresponding resources required to be executed. The number of tasks per jobs is configurable by the user. Following list contains a subset of options for the salloc, srun and sbatch commands, to request a specific number of tasks:

Option Description
-n, --ntasks= Number of tasks to start (default=1)
--ntasks-per-node= Number of tasks to invoke on each node (default=1)
--ntasks-per-socket= Number of tasks to invoke on each socket
--ntasks-per-core= Number of tasks to invoke on each core

Each task is distributed to only one node, but more than one task may be distributed to each node. The number of tasks distributed to a node is constrained by the number of CPUs allocated on the node and the number of CPUs per task.

From the perspective of Linux (the host operating system) each execution thread from a CPU core is represented as an individual CPU. Therefore is is important to evaluate the term CPU depending on its context, a physical CPU (hardware) is not necessarily the same as a logic CPU.

The cluster controller is very flexible in the distribution of tasks to nodes depending on the command-line options provided by the user. For example following command executes a single job with 12 tasks:

» srun --partition=debug --ntasks=12 \
        -- hostname | sort | uniq -c 
     12 lxbk0595

A user can control how tasks are distributed over nodes, motherboards sockets on a node, and the number of tasks per core (physical CPU). Following example starts a single job with four tasks distributed over two nodes:

» srun --partition=debug --ntasks-per-node=2 --ntasks=4 \
        -- hostname | sort | uniq -c
      2 lxbk0595
      2 lxbk0596

Logic CPUs

By default a task allocates two logic CPUs (on a single physical CPU core). This is due to the support of modern CPUs for multi-threading 7. The locality of the memory hierarchy on modern compute hardware, as well as the limited capabilities of a second thread on a single core, makes it inefficient to execute two completely independent programs in parallel on the same core. Therefore users will get the full performance of a physical CPU core, and have the option to execute two threads in parallel if desired.

# two logic CPUs by default
» srun --partition=debug --ntasks=1 \
        -- cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:  1,65

# the same as specifically requesting two logic CPUs
» srun --partition=debug --ntasks=1 --cpus-per-task=2 \
        -- cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:  1,65

Users have the capability to configure the number of logic CPUs per task with a command-line option:

Option Description
-c, --cpus-per-task= Number of CPUs per process (default=2)

I.e. a single job with a single task requesting 4 logic CPUs executed on two physical CPU cores:

» srun --partition=debug --ntasks=1 --cpus-per-task=4 \
        -- cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:  1-2,65-66

Task Affinity

Task affinity is a mechanism for the user to control how tasks are distributed over the physical CPU cores. The command line option --hint controls distribution methods for the allocation of logic CPUs:

» srun --hint=help
Application hint options:
    --hint=             Bind tasks according to application hints
        compute_bound   use all cores in each socket
        memory_bound    use only one core in each socket
        [no]multithread [don't] use extra threads with in-core multi-threading
        help            show this help message

By default compute bound will allocate complete physical core:

» srun -p debug -n 1 -c 1 --hint=compute_bound \
        -- cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:      1,65

» srun -p debug -n 1 -c 2 --hint=compute_bound \
        -- cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:      1-2,65-66

» srun -p debug -n 1 -c 3 --hint=compute_bound \
        -- cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:      1-3,65-67

User have the option to use --hint=multithread to explicitly control how execution threads are allocated:

# single task with a single logic CPU
» srun -p debug -n 1 -c 1 --hint=multithread \
        -- cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:      1

# single task with two logic CPUs
» srun -p debug -n 1 -c 2 --hint=multithread \
        -- cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:      1,65

# single task wtih three logic CPUs
» srun -p debug -n 1 -c 3 --hint=multithread \
        -- cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:      1-2,65

# two task with a single logic CPU each
» srun -p debug -n 2 -c 1 --hint=multithread \
        -- cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:  1
Cpus_allowed_list:  65

# three tasks with a single logic CPU each
» srun -p debug -n 3 -c 1 --hint=multithread \
        -- cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:  65
Cpus_allowed_list:  1
Cpus_allowed_list:  8

Features

Nodes can have features assigned to them by the Slurm administrator. Users can specify which of these features are required by their job using the constraint option.

Print a list of available features with the sinfo command:

» date ; sinfo -o "%20N %f"
Fri Mar  5 08:48:33 CET 2021
NODELIST             AVAIL_FEATURES
lxbk[0553-0723]      amd,epic,7551
lxbk[0724-1033]      intel,xeon,gold6248r
lxbk[0501-0552]      intel,xeon,e52680

Use an argument to the salloc, srun or sbatch commands to request specific features:

Option Description
-C, --constraint Multiple constraints may be specified with the and operator &, or the or operator |. (Further details are described in the corresponding manual pages.)

Following examples request specific CPU types:

# require a specific CPU type
» srun --constraint=intel \
       -- cat /proc/cpuinfo | grep 'model name' | sort | uniq
model name  : Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz

» srun --constraint=amd \
       -- cat /proc/cpuinfo | grep 'model name' | sort | uniq
model name  : AMD EPYC 7551 32-Core Processor

# request nodes with multiple features (AND operator)
» srun --constraint='intel&e52680' \
       -- cat /proc/cpuinfo | grep 'model name' | sort | uniq
model name  : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz

When srun is executed from within salloc or sbatch, the constraint value can only contain a single feature name. None of the other operators are currently supported for job steps.

Footnotes

  1. Software bug, Wikipedia
    https://en.wikipedia.org/wiki/Software_bug↩︎

  2. CPU socket, Wikipedia
    https://en.wikipedia.org/wiki/CPU_socket↩︎

  3. Central processing unit, Wikipedia
    https://en.wikipedia.org/wiki/Central_processing_unit↩︎

  4. Multi-core processor, Wikipedia
    https://en.wikipedia.org/wiki/Multi-core_processor↩︎

  5. Support for Multi-core/Multi-thread Architectures, SchedMD
    https://slurm.schedmd.com/mc_support.html↩︎

  6. Multi-chip module, Wikipedia
    https://en.wikipedia.org/wiki/Multi-chip_module↩︎

  7. CPU Management User and Administrator Guide, SchedMD
    https://slurm.schedmd.com/cpu_management.html↩︎