Hardware Environment

Modified

November 17, 2023

All CPUs on the cluster are a x86-64 compatible ¹ processors architecture like AMD EPYC ² and Intel XEON ³. A typical compute node hosts two multi-core Processors in a dual socket configuration. Nodes may have a variation in core count, available main memory or the local storage device capacity. Use Slurm commands to print an overview of the hardware resources like available CPU cores, and main memory (RAM).

>>> date ; sinfo -o "%20N %4c %6z %6m %5D %f"     
Mon Sep 16 11:46:27 CEST 2024
NODELIST             CPUS S:C:T  MEMORY NODES AVAIL_FEATURES
lxbk[0724-1033]      96   2:24:2 191000 310   intel,xeon,gold6248r
lxbk[1034-1079]      256  2:64:2 103100 46    amd,epyc,7662
lxbk[1080-1129]      96   2:24:2 515000 50    amd,epyc,7413,mi100
lxbk[1130-1265]      256  2:64:2 514000 136   amd,epyc,7713

Further details are described in the resource-allocation section. Nodes typically have a list of associated features to distinguish different CPU types. Some nodes support a GPU indicated in the feature list as well.

InfiniBand Network

The network connecting the cluster with its compute and storage nodes is build with an InfiniBand ⁴ interconnect. InfiniBand is a high-bandwidth and low-latency optimized High Performance Interconnect (HPI). It spreads network traffic across multiple parallel physical links (multipath) to scale the available bandwidth. The network topology is build like a fat-tree ⁵, with a 2 to 1 blocking factor on Virgo (over-subscription). This topology including its fault-tolerant routing allows optimal utilization of redundant network connections with high-throughput.

The application interface to InfiniBand is developed by the OpenFabrics Alliance (OFA) ⁶ and distributed as the open-source software OpenFabrics Enterprise Distribution (OFED). OFED includes a Linux kernel-level driver for channel-oriented RDMA and send/receive operations. This drive is accompanied by the kernel and an user-level Application Programming Interface (API).

In June 2023 the original network fabric based on InfiniBand FDR has been replaced with newer generation InfiniBand HDR equipment. The network is built from a combination of active and passive cables, both copper wires and optical fibers. Compute nodes are connected with so called splitter cables, optical breakout cables with a single port on the switch and to individual cables to edge nodes. Effectively this split the HDR 200Gbps bandwidth of a switch port into two 100Gbps uplinks for compute nodes.

GPU Accelerators

The Virgo cluster hosts a selection of nodes with GPUs for scientific computing. The majority of GPUs are AMD Radeon Instinct ⁷ cards, and a small set of Nvidia Tesla ⁸ for testing. Please refer to the GPUs example section.

GPUs (graphics processing units) in HPC offer a parallel architecture to speed up certain computing processes, especially those related to artificial intelligence (AI) and machine learning (ML) models. For suitable applications GPUs enable processing with higher efficiency and less power consumption, therefore at a lower cost. GPUs performance improves faster then CPUs driven by demand for the video game marked. Furthermore the specialised architectures simplifies scaling of transistors counts in GPUs.

Footnotes

x86-64 (also known as AMD64), Wikipedia
https://en.wikipedia.org/wiki/X86-64 ↩︎
AMD EPYC, Wikipedia
https://en.wikipedia.org/wiki/Epyc ↩︎
Intel XEON, Wikipedia
https://en.wikipedia.org/wiki/Xeon ↩︎
Infiniband, Wikipedia
https://en.wikipedia.org/wiki/InfiniBand ↩︎
Fat Tree, Wikipedia
https://en.wikipedia.org/wiki/Fat_tree ↩︎
OpenFabrics Alliance
https://www.openfabrics.org ↩︎
AMD Radeon Instinct, Wikipedia
https://en.wikipedia.org/wiki/Radeon_Instinct ↩︎
Nvidia Tesla, Wikipedia
https://en.wikipedia.org/wiki/Nvidia_Tesla ↩︎

--- title: Hardware Environment date-modified: 2023/11/17 --- All CPUs on the cluster are a x86-64 compatible [^x86] processors architecture like AMD EPYC [^epyc] and Intel XEON [^xeon]. A typical compute node hosts two **multi-core Processors in a dual socket configuration**. Nodes may have a variation in core count, available main memory or the local storage device capacity. Use [Slurm commands][d948p] to print an overview of the hardware resources like available CPU cores, and main memory (RAM). [^epyc]: AMD EPYC, Wikipedia <https://en.wikipedia.org/wiki/Epyc> [^x86]: x86-64 (also known as AMD64), Wikipedia <https://en.wikipedia.org/wiki/X86-64> [^xeon]: Intel XEON, Wikipedia <https://en.wikipedia.org/wiki/Xeon> [d948p]: ../cluster/environment.html ```sh >>> date ; sinfo -o "%20N %4c %6z %6m %5D %f" Mon Sep 16 11:46:27 CEST 2024 NODELIST CPUS S:C:T MEMORY NODES AVAIL_FEATURES lxbk[0724-1033] 96 2:24:2 191000 310 intel,xeon,gold6248r lxbk[1034-1079] 256 2:64:2 103100 46 amd,epyc,7662 lxbk[1080-1129] 96 2:24:2 515000 50 amd,epyc,7413,mi100 lxbk[1130-1265] 256 2:64:2 514000 136 amd,epyc,7713 ``` Further details are described in the [resource-allocation][6RMrx] section. Nodes typically have a list of associated [features][2G31X] to distinguish different CPU types. Some nodes support a GPU indicated in the feature list as well. [6RMrx]: ../cluster/resource-allocation.html [2G31X]: ../cluster/resource-constrains.html#features ### InfiniBand Network The network connecting the cluster with its compute and storage nodes is build with an **InfiniBand [^wpib] interconnect**. InfiniBand is a high-bandwidth and low-latency optimized High Performance Interconnect (HPI). It spreads network traffic across multiple parallel physical links (multipath) to scale the available bandwidth. The network topology is build like a **fat-tree [^wpft], with a 2 to 1 blocking factor** on Virgo (over-subscription). This topology including its fault-tolerant routing allows optimal utilization of redundant network connections with high-throughput. The application interface to InfiniBand is developed by the OpenFabrics Alliance (OFA) [^ofbs] and distributed as the open-source software **OpenFabrics Enterprise Distribution** (OFED). OFED includes a Linux kernel-level driver for channel-oriented RDMA and send/receive operations. This drive is accompanied by the kernel and an user-level Application Programming Interface (API). [^wpib]: Infiniband, Wikipedia <https://en.wikipedia.org/wiki/InfiniBand> [^wpft]: Fat Tree, Wikipedia <https://en.wikipedia.org/wiki/Fat_tree> [^ofbs]: OpenFabrics Alliance <https://www.openfabrics.org> In June 2023 the original network fabric based on InfiniBand FDR has been replaced with newer generation InfiniBand HDR equipment. The network is built from a combination of active and passive cables, both copper wires and optical fibers. Compute nodes are connected with so called splitter cables, optical breakout cables with a single port on the switch and to individual cables to edge nodes. Effectively this split the HDR 200Gbps bandwidth of a switch port into two 100Gbps uplinks for compute nodes. ### GPU Accelerators The Virgo cluster hosts a selection of nodes with GPUs for scientific computing. The majority of GPUs are **AMD Radeon Instinct** [^insti] cards, and a small set of **Nvidia Tesla** [^tesla] for testing. Please refer to the [GPUs][B2rQ5] example section. GPUs (graphics processing units) in HPC offer a parallel architecture to speed up certain computing processes, especially those related to artificial intelligence (AI) and machine learning (ML) models. For suitable applications GPUs enable processing with higher efficiency and less power consumption, therefore at a lower cost. GPUs performance improves faster then CPUs driven by demand for the video game marked. Furthermore the specialised architectures simplifies scaling of transistors counts in GPUs. [B2rQ5]: ../examples/gpus.html [^tesla]: Nvidia Tesla, Wikipedia <https://en.wikipedia.org/wiki/Nvidia_Tesla> [^insti]: AMD Radeon Instinct, Wikipedia <https://en.wikipedia.org/wiki/Radeon_Instinct>