Architecture & Technology

Modified

November 14, 2023

Compute Cluster

The idea for cluster computers 1 originated as cost-effective alternative to traditional High Performance Computing (HPC) 2 with the goal to achieve similar performance orchestration of low-cost off-the-shelf computer hardware. The architecture depends on the network interconnect in terms of bandwidth and latency to enable efficient communication among nodes between nodes and the storage. Thus, many compute clusters use network hardware other then Ethernet, for example Infiniband.

In layman’s terms: “A cluster is a set of interconnected computers”.

Cluster computers are collections of tightly connected computes integrated into a single system by a workload management system. Each compute node runs its own instance of an operating system (OS) and typically provides a limited capacity of local storage, which can be used as temporary cache. The majority of compute clusters are connected to a huge, e.g. in the scale of PetaBytes, distributed storage system accessible by all compute nodes like shared storage.

Slurm Architecture

SLURM - Simple Linux Utility for Resource Management - is a widely used cluster resource management system running on 80% of supercomputers listed in TOP500 3. It is an actively developed Open Source (GPLv2) project hosted on Github 4 with over 180 contributors from academia and industry. It is highly scalable and used on the largest compute clusters like the Tianhe-2 with 16.000 nodes and more the 3 million cores.

Daemons

Slurm uses an architecture of multiple interconnected service daemons forming the communication infrastructure to manage resources in a cluster of distributed compute nodes, which consist of the following components:

slurmctld – Central controller (typically one per cluster)

Monitors state of resources, manages job queues, allocates resources.

slurmdbd – Database daemon (typically one per enterprise)

Collects accounting information, uploads configuration information (limits, fair-share, etc.) to slurmctld.

slurmd – Compute node daemon (typically one per compute node)

Launches and manages slurmstepd (see below), small and very light-weight, quiescent after launch except for optional accounting. Supports hierarchical communications with configurable fanout.

slurmstepd – Job step shepherd

Launched for each job step, launches user application tasks, manages application I/O, signals, etc.

Nodes

Slurms responsibility is to manage all connected computes nodes as a single shared resource for all users. Nodes can be allocated by requesting specifics like the number of CPUs, required RAM using the salloc, srun or the sbatch command. The terms compute node and execution node are used interchangeably, since the compute node is executing the user applications. The detailed hardware configuration of nodes can be queried with the scontrol command. $(hostname) assumes you are logged onto a submit node, and want to inquire information about the local host:

# show information about a submit node
» scontrol show node $(hostname)
NodeName=lxbk0597 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUTot=128 CPULoad=0.01
   AvailableFeatures=amd,epic7551
   ActiveFeatures=amd,epic7551
   Gres=(null)
   NodeAddr=lxbk0597 NodeHostName=lxbk0597 Version=18.08
   OS=Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:17 UTC 2020
   RealMemory=257653 AllocMem=0 FreeMem=241736 Sockets=8 Boards=1
   CoreSpecCount=1 CPUSpecList=0-1 MemSpecLimit=2048
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2020-04-27T07:53:28 SlurmdStartTime=2020-04-27T07:54:34
   CfgTRES=cpu=128,mem=257653M,billing=128
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Partitions

Slurm partitions allow administrators to establish different job limits or access controls for various groups of nodes. Nodes may be in more than one partition, making partitions serve as general purpose queues. For example, one may put the same set of nodes into two different partitions, each with different constraints, e.g. time limit, job sizes, groups allowed to use the partition and so on. Jobs are allocated resources within a single partition. The partition configuration is queried with the sinfo command described in the partitions section.

Linux Containers

It is very difficult to fit all application dependencies for all possible use-cases of a compute cluster onto a single environment. Linux container technology 5 allows the infrastructure provider to overcome this limitation. Containers are in many ways the next logical progression from virtual machines and implement operating system-level virtualization 6 to encapsulate a software environment or so called software stack. They enable multiple users/groups to execute and build their own instance of an application environment.

Containers also allow better decoupling of user application software from the host operating system. Which gives cluster administrators more freedom to migrate the underling platform running the hardware. This is not only relevant for security patch management. But even more important to support a broad scope of hardware, including accelerators like GPUs with software drivers bound to specific Linux kernel versions.

In context of containers a custom environment to execute a program is called an application (run-time) environment. These application environments are executed by a container run-time engine (CRE) like the Apptainer 7. Typically CREs load the application environment from a container image, which is a single binary file storing the file-system tree including all applications dependencies.

Limitations

Linux containers have limitations concerning portability and abstraction:

  1. Containers are build for a specifics machine architecture in binary format. For instance, a container build for Intel x86_64 will not run on any different platform.
  2. Containers rely on the Linux kernel Application Binary Interface (ABI), which is not necessarily uniform across all Linux distributions. In particular older kernels may not have all required capabilities.
  3. Containers need to support hardware used by applications. This includes network devices interconnects like InfiniBand and hardware accelerator devices like GPUs.
  4. Multiple different formats for container images are available. Some of them are interchangeable or have means to convert between different formats.

The Virtual Application Environments (VAEs) provided by IT department are build and tested specifically for the Virgo cluster and relieve the users from the necessity of dealing with hardware support and optimization.

Container Images

Basically all HPC container run-time engines are in some way compatible to Docker 8. Container images build with Docker follow the container format specification of the Open Container Initiative (OCI) 9, which makes these run-time agnostic and to a certain extend future proof.

However, using a standardized container image format does not necessarily imply independence of a run-time infrastructure as described in the previous section about container limitations. Apptainer uses the SIF (Singularity Image Format) as a container image format to improve on the support of scientific computing application on HPC infrastructure. If you are interested in the relation and compatibility to Docker read the support for Docker and OCI 10 section in the Apptainer User Guide.

Footnotes

  1. Compute Cluster, Wikipedia
    https://en.wikipedia.org/wiki/Computer_cluster↩︎

  2. Supercomputer, Wikipedia
    https://en.wikipedia.org/wiki/Supercomputer↩︎

  3. TOP500, ranks the 500 most powerful HPC systems in the world
    https://www.top500.org/lists/top500↩︎

  4. Slurm Source Code Repository, GitHub
    https://github.com/SchedMD/slurm↩︎

  5. List of Linux Containers, Wikipedia
    https://en.wikipedia.org/wiki/List_of_Linux_containers↩︎

  6. OS Level Virtualization, Wikipedia
    https://en.wikipedia.org/wiki/OS-level_virtualization↩︎

  7. Apptainer Project, Linux Foundation
    https://apptainer.org/↩︎

  8. Docker Software, Wikipedia
    https://en.wikipedia.org/wiki/Docker_(software)↩︎

  9. Open Container Initiative
    https://opencontainers.org↩︎

  10. Support for Docker and OCI Containers, Apptainer User Guide
    https://apptainer.org/docs/user/latest/docker_and_oci.html↩︎