Resource Allocation

Modified

November 1, 2024

Abstract

Following sections describes how users request resource like CPUs, memory of GPUs to be allocated for a compute job. Furthermore the differences between interactive jobs and batch jobs in explained.

Allocations

Users request the allocation of computing resources on behalf of their associated accounts using the salloc ¹, srun ² or sbatch ³ commands:

Command	Interactive	Blocking	Description
`salloc`	yes	yes	Allocate resources and launches a shell.
`srun`	yes	yes	Allocate resources and starts an application.
`sbatch`	no	no	Queues an application for later execution.

A resource allocation specifies a set of resources, e.g. nodes, CPUs, RAM, etc., possibly with some set of constraints, e.g. number of processors per node, maximum runtime and so on. All three commands accept the same set of parameters for resource allocation.

A significant difference is that salloc and srun are interactive & blocking. This means that both are linked to your terminal session, hence bound to your connection to the submit node. Output is directed to your interactive shell running in the terminal session. Losing the connection to the submit node might kill the job.

The sbatch command in contrast **transfers complete control to the cluster controller, and allows you to disconnect from the submit node.

Interactive

The following example uses salloc to request a set of default resources from one of the partitions. The command blocks until resources are allocated:

# start an interactive command interpreter
» salloc --partition=debug
salloc: Granted job allocation 2964352

salloc launches an interactive shell, after resources have been granted by the cluster controller:

# execute an arbitrary command
» cat /proc/cpuinfo | grep 'model name' | sort | uniq
model name      : AMD EPYC 7551 32-Core Processor

# investigate the job configuration
» scontrol show job $SLURM_JOB_ID | grep Time
   RunTime=00:00:51 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2020-08-27T07:23:14 EligibleTime=2020-08-27T07:23:14
   AccrueTime=Unknown
   StartTime=2020-08-27T07:23:14 EndTime=2020-08-27T07:28:14 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0

Use the exit command to stop the interactive shell and release all allocated resources:

» exit
exit
salloc: Relinquishing job allocation 2964352

srun within a job allocation launches in parallel across all allocated nodes. The execution environment is inherited by all launched processes:

» salloc --partition=debug --nodes=2 --chdir=/tmp -- bash
salloc: Granted job allocation 2964433
# run a command on the node hosting the interactive shell
» hostname
lxbk0595
# run a command on all allocated nodes (in parallel)
» srun hostname
lxbk0595
lxbk0596
# run another command
» srun uptime
 07:37:04 up 56 days, 14:26, 27 users,  load average: 0.28, 0.18, 0.15
 07:37:04 up 56 days, 14:26,  4 users,  load average: 0.03, 0.07, 0.13
# release the resources
» exit
exit
salloc: Relinquishing job allocation 2964433

Each invocation of srun within a job is known as a step. A (compute) job consists of one or more steps, each consisting of one or more tasks, each using one or more processors.

Real Time

Using srun outside of a job allocation from salloc, requests resources specified by command options and waits until resource allocation. As soon as resource are available it automatically launches the specified application.

» srun --partition=debug --nodes=5 --chdir=/tmp -- hostname
lxbk0596
lxbk0597
lxbk0599
lxbk0598
lxbk0600

Once the specified application has finished, resources a relinquished automatically.

The command above is basically is short notation for:

» salloc --partition=debug --nodes=5 \
        srun --chdir=/tmp -- hostname
salloc: Granted job allocation 138
lxbk0596
lxbk0599
lxbk0597
lxbk0598
lxbk0600
salloc: Relinquishing job allocation 138

Batch Jobs

sbatch is used to submit a compute job to the scheduler-queue for later execution. The command exits immediately after the controller has assigned a unique JOBID, and the job is queued by the scheduler. Batch jobs wait in the queue of pending jobs until resources become available.

The compute job is copied to a compute node as soon as resources have been granted by the scheduler. The jobs application is typically launched by a batch script with the resource allocation and resource constrains specified by meta-commands.

A compute job submitted with sbatch has at least one implicit job step, the start of the executable provided as argument to the command.

Following is a very simple example of a batch job executing a simple shell-script:

# simple script executing a couple of commands
cat > $LUSTRE_HOME/sleep.sh <<EOF
#!/usr/bin/env bash
hostname ; uptime ; sleep 180 ; uname -a
EOF
# submit the script above, with the job name "sleep"
sbatch --output='%j.log' --chdir=$LUSTRE_HOME \
       --job-name=sleep -- $LUSTRE_HOME/sleep.sh

Check if the state of the job using the squeue command:

» squeue --format='%6A %8T %8N %9L %o' --name=sleep
JOBID  STATE    NODELIST TIME_LEFT COMMAND
18     RUNNING  lxbk0596 1:57:40   /lustre/hpc/vpenso/sleep.sh

# read the stdout of the job
» cat $LUSTRE_HOME/$(squeue -ho %A -n sleep).log
lxbk0596
 10:08:37 up 1 day,  2:05,  0 users,  load average: 0.00, 0.01, 0.05

Attach to Job

From a submit node you may attach an interactive debugging shell to your running job with the following command:

srun --jobid <running-jobid> [-w <hostname>] -O --pty bash

Option	Description
`--jobid=<jobid>`	Initiate a job step under an already allocated job with given id.
`-O`, `--overcommit`	The instantiated job step and task for the debugging shell do not demand additional resources from the existing allocation (which is usually already used up).
`--pty`	Execute task zero in pseudo terminal mode. Implicitly sets `--unbuffered`. Implicitly sets `--error` and `--output` to `/dev/null` for all tasks except task zero, which may cause those tasks to exit immediately (e.g. shells will typically exit immediately in that situation). This option applies to step allocations.
`-w`, `--nodelist=<hostname>`	Request a specific host. Useful if job allocation spans multiple nodes.

Recurring Jobs

scrontab schedules recurring jobs on the cluster. It provides a cluster based equivalent to crontab (short for “cron table”), a system that specifies scheduled tasks to be run by the cron daemon⁴ on Unix-like systems. scrontab is used to configure Slurm to execute commands at specified intervals, allowing users to automate repetitive tasks.

All users can have their own scrontab file, allowing for personalized job scheduling without interfering with other users. Users can define jobs directly in the scrontab file, specifying the command to run, the schedule, and any Slurm options (like resource requests).

Format

The scrontab configuration format works similar to the traditional cron format, allowing users to specify when and how often jobs should be executed. The configuration can have several crontab entries (jobs).

# create a simple example for scrontab
>>> cat > sleep.scrontab <<EOF
#SCRON --time=00:02:00
#SCRON --job-name=sleep-scrontab
#SCRON --chdir=/lustre/hpc/vpenso
#SCRON --output=sleep-scrontab-%j.log
#SCRON --open-mode=append
*/10 * * * * date && sleep 30
EOF

# install a new scrontab from a file
>>> scrontab sleep.scrontab

# check the queue
>>> squeue --me -O Jobid,EligibleTime,Name,State
JOBID               ELIGIBLE_TIME       NAME                STATE               
14938318            2024-10-31T10:20:00 sleep-scrontab      PENDING

Time Fields

The first five fields specify the schedule for the job, and they represent from left to right:

Field	Description
Minute (0-59)	The minute of the hour when the job should be scheduled
Hour (0-23)	The hour of the day when the job should be scheduled
Day of the Month (1-31)	The specific day of the month when the job should run
Month (1-12)	The month when the job should run
Day of the Week (0-7)	The day of the week when the job should run (0 and 7 both represent Sunday).

Special characters are sued to define more complex schedules:

Character	Description
Asterisk (`*`)	Represents “every” unit of time. For example, an asterisk in the minute field means the job will run every minute.
Comma (`,`)	Used to specify multiple values. For example, `1,15` in the minute field means the job will run at the 1st and 15th minute of the hour.
Dash (`-`)	Specifies a range of values. For example, `1-5` in the day of the week field means the job will run from Monday to Friday.
Slash (`/`)	Specifies increments. For example, `*/5` in the minute field means the job will run every 5 minutes.

Some users may find it convenient to us a web-site based crontab generator⁵ to prepare a custom configuration.

Shortcuts

Shortcuts to specify some common time intervals

Shortcut	Description
`@annually`	Job will become eligible at 00:00 Jan 01 each year
`@monthly`	Job will become eligible at 00:00 on the first day of each month
`@weekly`	Job will become eligible at 00:00 Sunday of each week
`@daily`	Job will become eligible at 00:00 each day
`@hourly`	Job will become eligible at the first minute of each hour.

Meta-Commands

Lines starting with #SCRON allow users to set Slurm options for the single following crontab entry. This means each crontab entry needs its own list of #SCRON meta-commands, for example:

#SCRON --job-name=sleep-scrontab
#SCRON --chdir /lustre/hpc/vpenso
@daily path/to/sleep.sh > sleep-$(date +%Y%m%dT%H%M).log

Options include most of those available to the sbatch command (make sure to read the manual pages for more details). In order to write output of a recurring job into a single file use following option:

Options	Description
`--open-mode`	Appends output to an existing log-file (instead of overwrite)

#SCRON --job-name=sleep-scrontab
#SCRON --chdir /lustre/hpc/vpenso
#SCRON --output=sleep-scrontab-%j.log
#SCRON --open-mode=append
0 8 * * * path/to/sleep.sh

Usage

Users can configure their scrontab in multiple ways:

# modify the configuration with your preferred text-edotr
1EDITOR=vim scrontab -e
# read the configuration from a file
2scrontab path/to/file

# print the configuration
3scrontab -l

# clear the configuration
4scrontab -r

1: Modify the configuration with an text-editor using option -e.
2: Apply a configuration by passing a file as argument.
3: Option -l print the configuration to the terminal
4: Option -r removes the entire configuration (jobs continue to run, but won’t longer recur).

Jobs have the same Job ID for every run (until the next time the configuration is modified).

# list jobs with 
1squeue --me -O Jobid,EligibleTime,Name,State

# list all recurring jobs in the past
2sacct --duplicates --jobs $job_id

# skip next run
3scontrol requeue $job_id

# disable a cron job
4scancel --cron $job_id

1: List when cronjobs will be eligible for next execution. Note that jobs are not guaranteed to execute at the preferred time.
2: List all recurring executions of the cronjob from the accounting.
3: Skip next execution of a cronjob with scontrol and reschedule the job to the upcoming available time.
4: Request to cancel a job submitted by crontab with scancel. The job in the crontab will be preceded by the comment #DISABLED

Footnotes

salloc manual page, SchedMD
https://slurm.schedmd.com/salloc.html ↩︎
srun manual page, SchedMD
https://slurm.schedmd.com/srun.html ↩︎
sbatch manual page, SchedMD
https://slurm.schedmd.com/sbatch.html ↩︎
cron, Wikipedia
https://en.wikipedia.org/wiki/Cron ↩︎
Crontab Generator
https://crontab-generator.org ↩︎

--- title: Resource Allocation date-modified: 2024/11/01 abstract: > Following sections describes how users request resource like CPUs, memory of GPUs to be allocated for a compute job. Furthermore the differences between interactive jobs and batch jobs in explained. toc-expand: 3 --- ## Allocations Users request the allocation of computing resources on behalf of their associated [accounts][SdD97] using the `salloc` [^4g2MI], `srun` [^yDrhe] or `sbatch` [^TOjTZ] commands: [SdD97]: accounts.html [^4g2MI]: `salloc` manual page, SchedMD <https://slurm.schedmd.com/salloc.html> [^yDrhe]: `srun` manual page, SchedMD <https://slurm.schedmd.com/srun.html> [^TOjTZ]: `sbatch` manual page, SchedMD <https://slurm.schedmd.com/sbatch.html> Command | Interactive | Blocking | Description ----------|-------------|----------|------------------------ `salloc` | yes | yes | Allocate resources and launches a shell. `srun` | yes | yes | Allocate resources and starts an application. `sbatch` | no | no | Queues an application for later execution. A resource allocation specifies a **set of resources**, e.g. nodes, CPUs, RAM, etc., possibly with some **set of constraints**, e.g. number of processors per node, maximum runtime and so on. All three commands accept the same set of parameters for resource allocation. A significant difference is that **`salloc` and `srun` are interactive & blocking**. This means that both are linked to your terminal session, hence **bound to your connection to the submit node**. Output is directed to your interactive shell running in the terminal session. Losing the connection to the submit node might kill the job. The **`sbatch`** command in contrast **transfers complete control to the cluster controller, and allows you to disconnect from the submit node. ### Interactive The following example uses `salloc` to request a set of default resources from one of the [partitions][4NXUz]. The command **blocks until resources are allocated**: [4NXUz]: partitions.html ```sh # start an interactive command interpreter » salloc --partition=debug salloc: Granted job allocation 2964352 ``` **`salloc` launches an interactive shell**, after resources have been granted by the cluster controller: ```sh # execute an arbitrary command » cat /proc/cpuinfo | grep 'model name' | sort | uniq model name : AMD EPYC 7551 32-Core Processor # investigate the job configuration » scontrol show job $SLURM_JOB_ID | grep Time RunTime=00:00:51 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2020-08-27T07:23:14 EligibleTime=2020-08-27T07:23:14 AccrueTime=Unknown StartTime=2020-08-27T07:23:14 EndTime=2020-08-27T07:28:14 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 ``` Use the **`exit` command to stop the interactive shell** and release all allocated resources: ```sh » exit exit salloc: Relinquishing job allocation 2964352 ``` `srun` within a job allocation **launches in parallel across all allocated nodes**. The execution **environment is inherited** by all launched processes: ```sh » salloc --partition=debug --nodes=2 --chdir=/tmp -- bash salloc: Granted job allocation 2964433 # run a command on the node hosting the interactive shell » hostname lxbk0595 # run a command on all allocated nodes (in parallel) » srun hostname lxbk0595 lxbk0596 # run another command » srun uptime 07:37:04 up 56 days, 14:26, 27 users, load average: 0.28, 0.18, 0.15 07:37:04 up 56 days, 14:26, 4 users, load average: 0.03, 0.07, 0.13 # release the resources » exit exit salloc: Relinquishing job allocation 2964433 ``` ::: {.callout-note appearance="simple"} Each invocation of `srun` within a job is known as a **step**. A (compute) **job** consists of one or more steps, each consisting of one or more **tasks**, each using one or more processors. ::: ### Real Time Using `srun` outside of a job allocation from `salloc`, requests resources specified by command options and **waits until resource allocation**. As soon as resource are available it automatically **launches the specified application**. ```sh » srun --partition=debug --nodes=5 --chdir=/tmp -- hostname lxbk0596 lxbk0597 lxbk0599 lxbk0598 lxbk0600 ``` Once the specified application has finished, **resources a relinquished automatically**. The command above is basically is short notation for: ```sh » salloc --partition=debug --nodes=5 \ srun --chdir=/tmp -- hostname salloc: Granted job allocation 138 lxbk0596 lxbk0599 lxbk0597 lxbk0598 lxbk0600 salloc: Relinquishing job allocation 138 ``` ## Batch Jobs `sbatch` is used to submit a compute job to the scheduler-queue for later execution. The command exits immediately after the controller has assigned a unique JOBID, and the job is queued by the scheduler. Batch jobs **wait in the queue of pending jobs** until resources become available. The compute job is copied to a compute node as soon as resources have been granted by the scheduler. The jobs application is typically launched by a batch script with the [resource allocation][qp4C8] and [resource constrains][M28Ty] specified by [meta-commands][sR7IP]. [M28Ty]: resource-constrains.html [qp4C8]: resource-allocation.html [sR7IP]: environment.html#meta-commands ::: {.callout-note appearance="simple"} A compute job submitted with `sbatch` has at least **one implicit job step**, the start of the executable provided as argument to the command. ::: Following is a very simple example of a batch job executing a simple shell-script: ```sh # simple script executing a couple of commands cat > $LUSTRE_HOME/sleep.sh <<EOF #!/usr/bin/env bash hostname ; uptime ; sleep 180 ; uname -a EOF # submit the script above, with the job name "sleep" sbatch --output='%j.log' --chdir=$LUSTRE_HOME \ --job-name=sleep -- $LUSTRE_HOME/sleep.sh ``` Check if the state of the job using the `squeue` command: ```sh » squeue --format='%6A %8T %8N %9L %o' --name=sleep JOBID STATE NODELIST TIME_LEFT COMMAND 18 RUNNING lxbk0596 1:57:40 /lustre/hpc/vpenso/sleep.sh # read the stdout of the job » cat $LUSTRE_HOME/$(squeue -ho %A -n sleep).log lxbk0596 10:08:37 up 1 day, 2:05, 0 users, load average: 0.00, 0.01, 0.05 ``` ## Attach to Job From a submit node you may attach an interactive debugging shell to your running job with the following command: ```sh srun --jobid <running-jobid> [-w <hostname>] -O --pty bash ``` Option | Description ---------------------|------------------------ `--jobid=<jobid>` | Initiate a job step under an already allocated job with given id. `-O`, `--overcommit` | The instantiated job step and task for the debugging shell do not demand additional resources from the existing allocation (which is usually already used up). `--pty` | Execute task zero in pseudo terminal mode. Implicitly sets `--unbuffered`. Implicitly sets `--error` and `--output` to `/dev/null` for all tasks except task zero, which may cause those tasks to exit immediately (e.g. shells will typically exit immediately in that situation). This option applies to step allocations. `-w`, `--nodelist=<hostname>` | Request a specific host. Useful if job allocation spans multiple nodes. # Recurring Jobs `scrontab` schedules recurring jobs on the cluster. It provides a cluster based equivalent to `crontab` (short for "cron table"), a system that specifies scheduled tasks to be run by the cron daemon[^vlMHW] on Unix-like systems. `scrontab` is used to configure Slurm to **execute commands at specified intervals, allowing users to automate repetitive tasks**. [^vlMHW]: `cron`, Wikipedia <https://en.wikipedia.org/wiki/Cron> All users can have their own `scrontab` file, allowing for personalized job scheduling without interfering with other users. Users can define jobs directly in the `scrontab` file, specifying the command to run, the schedule, and any Slurm options (like resource requests). ## Format The `scrontab` configuration format works similar to the traditional cron format, allowing users to specify when and how often jobs should be executed. The configuration can have several crontab entries (jobs). ```bash # create a simple example for scrontab >>> cat > sleep.scrontab <<EOF #SCRON --time=00:02:00 #SCRON --job-name=sleep-scrontab #SCRON --chdir=/lustre/hpc/vpenso #SCRON --output=sleep-scrontab-%j.log #SCRON --open-mode=append */10 * * * * date && sleep 30 EOF # install a new scrontab from a file >>> scrontab sleep.scrontab # check the queue >>> squeue --me -O Jobid,EligibleTime,Name,State JOBID ELIGIBLE_TIME NAME STATE 14938318 2024-10-31T10:20:00 sleep-scrontab PENDING ``` ### Time Fields The first five fields specify the schedule for the job, and they represent from left to right: Field | Description ------------|---------------------------------- Minute (0-59) | The minute of the hour when the job should be scheduled Hour (0-23) | The hour of the day when the job should be scheduled Day of the Month (1-31) | The specific day of the month when the job should run Month (1-12) | The month when the job should run Day of the Week (0-7) | The day of the week when the job should run (0 and 7 both represent Sunday). Special characters are sued to define more complex schedules: Character | Description ------------|---------------------------------- Asterisk (`*`) | Represents "every" unit of time. For example, an asterisk in the minute field means the job will run every minute. Comma (`,`) | Used to specify multiple values. For example, `1,15` in the minute field means the job will run at the 1st and 15th minute of the hour. Dash (`-`) | Specifies a range of values. For example, `1-5` in the day of the week field means the job will run from Monday to Friday. Slash (`/`) | Specifies increments. For example, `*/5` in the minute field means the job will run every 5 minutes. Some users may find it convenient to us a web-site based `crontab` generator[^THlUD] to prepare a custom configuration. [^THlUD]: Crontab Generator <https://crontab-generator.org> ### Shortcuts Shortcuts to specify some common time intervals Shortcut | Description ------------|---------------------------------- `@annually` | Job will become eligible at 00:00 Jan 01 each year `@monthly` | Job will become eligible at 00:00 on the first day of each month `@weekly` | Job will become eligible at 00:00 Sunday of each week `@daily` | Job will become eligible at 00:00 each day `@hourly` | Job will become eligible at the first minute of each hour. ### Meta-Commands Lines starting with `#SCRON` allow users to set Slurm options for the single following `crontab` entry. This means each `crontab` entry needs its own list of `#SCRON` meta-commands, for example: ```bash #SCRON --job-name=sleep-scrontab #SCRON --chdir /lustre/hpc/vpenso @daily path/to/sleep.sh > sleep-$(date +%Y%m%dT%H%M).log ``` Options include most of those available to the `sbatch` command (make sure to read the manual pages for more details). In order to write output of a recurring job into a single file use following option: Options | Description ------------|---------------------------------- `--open-mode` | Appends output to an existing log-file (instead of overwrite) ```bash #SCRON --job-name=sleep-scrontab #SCRON --chdir /lustre/hpc/vpenso #SCRON --output=sleep-scrontab-%j.log #SCRON --open-mode=append 0 8 * * * path/to/sleep.sh ``` ## Usage Users can configure their `scrontab` in multiple ways: ```bash # modify the configuration with your preferred text-edotr EDITOR=vim scrontab -e #<1> # read the configuration from a file scrontab path/to/file #<2> # print the configuration scrontab -l #<3> # clear the configuration scrontab -r #<4> ``` 1. Modify the configuration with an text-editor using option `-e`. 2. Apply a configuration by passing a file as argument. 3. Option `-l` print the configuration to the terminal 4. Option `-r` removes the entire configuration (jobs continue to run, but won't longer recur). **Jobs have the same Job ID for every run** (until the next time the configuration is modified). ```bash # list jobs with squeue --me -O Jobid,EligibleTime,Name,State #<1> # list all recurring jobs in the past sacct --duplicates --jobs $job_id #<2> # skip next run scontrol requeue $job_id #<3> # disable a cron job scancel --cron $job_id #<4> ``` 1. List when cronjobs will be eligible for next execution. Note that jobs are not guaranteed to execute at the preferred time. 2. List all recurring executions of the cronjob from the accounting. 3. Skip next execution of a cronjob with `scontrol` and reschedule the job to the upcoming available time. 4. Request to cancel a job submitted by crontab with `scancel`. The job in the crontab will be preceded by the comment `#DISABLED`