Architecture & Technology

Modified

November 14, 2023

Compute Cluster

The idea for cluster computers 1 originated as cost-effective alternative to traditional High Performance Computing (HPC) 2 with the goal to achieve similar performance orchestration of low-cost off-the-shelf computer hardware. The architecture depends on the network interconnect in terms of bandwidth and latency to enable efficient communication among nodes between nodes and the storage. Thus, many compute clusters use network hardware other then Ethernet, for example Infiniband.

In layman’s terms: “A cluster is a set of interconnected computers”.

Cluster computers are collections of tightly connected computes integrated into a single system by a workload management system. Each compute node runs its own instance of an operating system (OS) and typically provides a limited capacity of local storage, which can be used as temporary cache. The majority of compute clusters are connected to a huge, e.g. in the scale of PetaBytes, distributed storage system accessible by all compute nodes like shared storage.

Slurm Architecture

SLURM - Simple Linux Utility for Resource Management - is a widely used cluster resource management system running on 80% of supercomputers listed in TOP500 3. It is an actively developed Open Source (GPLv2) project hosted on Github 4 with over 180 contributors from academia and industry. It is highly scalable and used on the largest compute clusters like the Tianhe-2 with 16.000 nodes and more the 3 million cores.

Daemons

Slurm uses an architecture of multiple interconnected service daemons forming the communication infrastructure to manage resources in a cluster of distributed compute nodes, which consist of the following components:

slurmctld – Central controller (typically one per cluster)

Monitors state of resources, manages job queues, allocates resources.

slurmdbd – Database daemon (typically one per enterprise)

Collects accounting information, uploads configuration information (limits, fair-share, etc.) to slurmctld.

slurmd – Compute node daemon (typically one per compute node)

Launches and manages slurmstepd (see below), small and very light-weight, quiescent after launch except for optional accounting. Supports hierarchical communications with configurable fanout.

slurmstepd – Job step shepherd

Launched for each job step, launches user application tasks, manages application I/O, signals, etc.

Nodes

Slurms responsibility is to manage all connected computes nodes as a single shared resource for all users. Nodes can be allocated by requesting specifics like the number of CPUs, required RAM using the salloc, srun or the sbatch command. The terms compute node and execution node are used interchangeably, since the compute node is executing the user applications. The detailed hardware configuration of nodes can be queried with the scontrol command. $(hostname) assumes you are logged onto a submit node, and want to inquire information about the local host:

# show information about a submit node
» scontrol show node $(hostname)
NodeName=lxbk0597 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUTot=128 CPULoad=0.01
   AvailableFeatures=amd,epic7551
   ActiveFeatures=amd,epic7551
   Gres=(null)
   NodeAddr=lxbk0597 NodeHostName=lxbk0597 Version=18.08
   OS=Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:17 UTC 2020
   RealMemory=257653 AllocMem=0 FreeMem=241736 Sockets=8 Boards=1
   CoreSpecCount=1 CPUSpecList=0-1 MemSpecLimit=2048
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2020-04-27T07:53:28 SlurmdStartTime=2020-04-27T07:54:34
   CfgTRES=cpu=128,mem=257653M,billing=128
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Partitions

Slurm partitions allow administrators to establish different job limits or access controls for various groups of nodes. Nodes may be in more than one partition, making partitions serve as general purpose queues. For example, one may put the same set of nodes into two different partitions, each with different constraints, e.g. time limit, job sizes, groups allowed to use the partition and so on. Jobs are allocated resources within a single partition. The partition configuration is queried with the sinfo command described in the partitions section.

Footnotes

  1. Compute Cluster, Wikipedia
    https://en.wikipedia.org/wiki/Computer_cluster↩︎

  2. Supercomputer, Wikipedia
    https://en.wikipedia.org/wiki/Supercomputer↩︎

  3. TOP500, ranks the 500 most powerful HPC systems in the world
    https://www.top500.org/lists/top500↩︎

  4. Slurm Source Code Repository, GitHub
    https://github.com/SchedMD/slurm↩︎