GPUs

Modified

September 5, 2024

Resources

Nodes	GPU	Memory	Slurm Feature	GRES Type	Notes
lxbk0[717-718]	NVIDIA Tesla V100 ¹	32 GB	`v100`	`tesla`	Not available
lxbk[1080-1129]	AMD Radeon Instinct MI100 ²	32 GB	`mi100`	`cdna`

Compute nodes equipped with GPUs have the required drivers and runtime libraries installed. The table above lists nodes by GPU type and corresponding Slurm feature. Generic Resource (GRES) ³ options for the sbatch and srun commands allow allocation of GPUs for a job:

Option	Description
`--constraint=<list>`	Select the variant of GPU hardware to use, for example `mi100` (see table above).
`--gres=gpu[[:type]:count]`	The `count` is the number of GPUS with a default value of 1. The specified resources will be allocated to the job on each node.`type` is optional and defines the GRES type of the resources.

Compute jobs usually use an option like --mem to specify the amount of real memory required (per node). This affects only RAM not the dedicated memory of the GPU. If necessary set a memory requirement (per allocated GPU) with option --mem-per-gpu. Note that --mem, --mem-per-cpu and --mem-per-gpu options are mutually exclusive.

AMD ROCm

AMD provides ready to use container images for ROCm available on AMD Infinity Hub ⁴ and a corresponding ROCm page on DockerHub ⁵.

GPU Partition

AMD MI100 GPUs are accessible via a dedicated partition called gpu. Use following options allocate a node with eight GPUs:

sbatch --gres=gpu:8 --constraint=mi100 ...

AMD Radeon Instinct GPU hardware uses the ROCm ⁶ (programming-language) runtime environment. A simple test will be based on a Singularity container equipped with a TensorFlow installation.

# prepare the container:
mkdir $LUSTRE_HOME/rocm
cd $LUSTRE_HOME/rocm
apptainer pull docker://rocm/tensorflow:latest

At the end of the pull request, a file named tensorflow_latest.sif should be available. Following python script will be used as an Hello World example:

#!/usr/bin/env ipython

from __future__ import print_function

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Simple hello world using TensorFlow

# Create a Constant op
# The op is added as a node to the default graph.
#
# The value returned by the constructor represents the output
# of the Constant op.
hello = tf.constant('Hello, TensorFlow!')

# Start tf session
sess = tf.Session()

# Run the op
print(sess.run(hello))

Following example shows a batch-script used to allocate a GPU:

>> cat submit_rocm.sh
#!/bin/bash
#SBATCH -J apptainer_rocm_test
#SBATCH --gres=gpu:1
#SBATCH -t 0-00:30
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=8000

apptainer exec \
      --bind /cvmfs \
      --rocm $LUSTRE_HOME/rocm/tensorflow_latest.sif \
      $LUSTRE_HOME/rocm/ts_hworld.py

Submit your ROCm container to the gpu partition on Virgo:

sbatch -p gpu submit_rocm.sh

Submit your ROCm container based on a Slurm features:

sbatch --constraint="mi100" submit_rocm.sh

Nvidia CUDA

This is important

At the moment no Nvidia GPUs are available in the cluster. Please contact the support in case of further questions.

Currently we do not provide a dedicated submit node with support for Nvidia GPUs. User are required to build and execute a custom container supporting the run-time environment for Nvidia. Note that it is mandatory to login to a submit node with access to the container run-time. The CUDA environment installed is based on version 11.4, please refer to the official CUDA documentation ⁷. The official CUDA container images ⁸ are available in three variants:

base includes the CUDA runtime (cudart).
runtime builds on the base image and includes the CUDA math libraries, NCCL, and cuDNN.
devel builds on the runtime images and includes headers and development tools for building CUDA images (These images are particularly useful for multi-stage builds)

The Dockerfile for the images are open-source and licensed under 3-clause BSD.

# grab a CUDA development environment and run it through Apptainer:
mkdir $LUSTRE_HOME/cuda
cd $LUSTRE_HOME/cuda
apptainer pull docker://nvidia/cuda:11.4.3-devel-rockylinux8

At the end of the pull request, a file named cuda_11.4.3-devel-rockylinux8.sif should be available.

# test the CUDA development environment on localhost
cd $HOME/my_src
git clone https://github.com/NVIDIA/cuda-samples.git
apptainer shell $LUSTRE_HOME/cuda/cuda_11.4.3-devel-rockylinux8.sif
cd cuda-samples
make
exit

The binaries will be saved under $HOME/my_src/cuda-samples/bin/x86_64/linux/release. Then just move or copy the compiled files under $LUSTRE_HOME/cuda/samples or any other directory as you saw fit. Create a batch job:

>> cat submit_nvidia.sh
#!/bin/bash
#SBATCH -J apptainer_nv_test
#SBATCH --gres=gpu:1
#SBATCH --reservation=nvidia_gpu
#SBATCH -p long
#SBATCH -t 0-00:30
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=8000

apptainer exec --bind /cvmfs \
      --nv $SINGULARITY_CONTAINERS/nvidia/cuda/cuda_11.4.3-devel-rockylinux8.sif \
      $LUSTRE_HOME/cuda/cuda-samples/bin/x86_64/linux/release/deviceQuery

Submit your CUDA container to Virgo:

sbatch --constraint="v100" submit_nvidia.sh

Footnotes

Nvidia V100 Tensor Core GPU
https://www.nvidia.com/en-us/data-center/v100/↩︎
AMD Radeon Instinct™ MI100 Accelerator
https://www.amd.com/en/products/server-accelerators/instinct-mi100 ↩︎
Slurm Generic Resource (GRES) Scheduling
https://slurm.schedmd.com/gres.html ↩︎
AMD Infinity Hub
https://www.amd.com/en/technologies/infinity-hub ↩︎
AMD ROCm, DockerHub
https://hub.docker.com/u/rocm ↩︎
AMD ROCm™ Information Portal
https://rocmdocs.amd.com/en/latest/↩︎
CUDA Documentation, Nvidia
https://docs.nvidia.com/cuda ↩︎
CUDA Container Images, DockerHub
https://hub.docker.com/r/nvidia/cuda ↩︎

--- title: GPUs date-modified: 2024/09/05 --- ## Resources Nodes | GPU | Memory | Slurm Feature | GRES Type | Notes | ----------------|------------------------------------|--------|-----------------|-----------|-------------------| lxbk0[717-718] | NVIDIA Tesla V100 [^v100] | 32 GB | `v100` | `tesla` | **Not available** | lxbk[1080-1129] | AMD Radeon Instinct MI100 [^mi100] | 32 GB | `mi100` | `cdna` | | Compute nodes equipped with GPUs have the required drivers and runtime libraries installed. The table above lists nodes by GPU type and corresponding Slurm [feature](features). Generic Resource (GRES) [^gres] options for the `sbatch` and `srun` commands allow allocation of GPUs for a job: Option | Description -|--------------------------- `--constraint=<list>` | Select the variant of GPU hardware to use, for example `mi100` (see table above). `--gres=gpu[[:type]:count]` | The `count` is the number of GPUS with a default value of 1. The specified resources will be allocated to the job on each node.`type` is optional and defines the GRES type of the resources. Compute jobs usually use an option like `--mem` to specify the amount of real memory required (per node). This affects only RAM not the dedicated memory of the GPU. If necessary set a memory requirement (per allocated GPU) with option `--mem-per-gpu`. Note that `--mem`, `--mem-per-cpu` and `--mem-per-gpu` options are **mutually exclusive**. ## AMD ROCm AMD provides ready to use container images for ROCm available on AMD Infinity Hub [^N4Apa] and a corresponding ROCm page on DockerHub [^4Ph9X]. [^N4Apa]: AMD Infinity Hub <https://www.amd.com/en/technologies/infinity-hub> [^4Ph9X]: AMD ROCm, DockerHub <https://hub.docker.com/u/rocm> ### GPU Partition AMD MI100 GPUs are accessible via a dedicated [partition][vNYD7] called `gpu`. Use following options allocate a node with eight GPUs: [vNYD7]: cluster/partitions.html ```sh sbatch --gres=gpu:8 --constraint=mi100 ... ``` AMD Radeon Instinct GPU hardware uses the ROCm [^rocm] (programming-language) runtime environment. A simple test will be based on a Singularity container equipped with a _TensorFlow_ installation. ```sh # prepare the container: mkdir $LUSTRE_HOME/rocm cd $LUSTRE_HOME/rocm apptainer pull docker://rocm/tensorflow:latest ``` At the end of the _pull_ request, a file named ``tensorflow_latest.sif`` should be available. Following python script will be used as an _Hello World_ example: ```python #!/usr/bin/env ipython from __future__ import print_function import tensorflow.compat.v1 as tf tf.disable_v2_behavior() # Simple hello world using TensorFlow # Create a Constant op # The op is added as a node to the default graph. # # The value returned by the constructor represents the output # of the Constant op. hello = tf.constant('Hello, TensorFlow!') # Start tf session sess = tf.Session() # Run the op print(sess.run(hello)) ``` Following example shows a batch-script used to allocate a GPU: ```sh >> cat submit_rocm.sh #!/bin/bash #SBATCH -J apptainer_rocm_test #SBATCH --gres=gpu:1 #SBATCH -t 0-00:30 #SBATCH -N 1 #SBATCH -c 1 #SBATCH --mem=8000 apptainer exec \ --bind /cvmfs \ --rocm $LUSTRE_HOME/rocm/tensorflow_latest.sif \ $LUSTRE_HOME/rocm/ts_hworld.py ``` Submit your ROCm container to the `gpu` partition on Virgo: ```sh sbatch -p gpu submit_rocm.sh ``` Submit your ROCm container based on a Slurm _features_: ```sh sbatch --constraint="mi100" submit_rocm.sh ``` ## Nvidia CUDA ::: {.callout-important title="This is important"} At the moment no Nvidia GPUs are available in the cluster. Please contact the support in case of further questions. ::: Currently we do not provide a dedicated submit node with support for Nvidia GPUs. User are required to build and [execute a custom container][ILEaj] supporting the run-time environment for Nvidia. Note that it is mandatory to login to a submit node with access to the [container run-time][aEJt5]. The CUDA environment installed is based on version **11.4**, please refer to the official CUDA documentation [^gCaWS]. The official CUDA container images [^d8X5n] are available in three variants: [ILEaj]: ../containers/execution.html [aEJt5]: ../containers/run-time.html [^gCaWS]: CUDA Documentation, Nvidia <https://docs.nvidia.com/cuda> [^d8X5n]: CUDA Container Images, DockerHub <https://hub.docker.com/r/nvidia/cuda> 1. `base` includes the CUDA runtime (cudart). 2. `runtime` builds on the base image and includes the CUDA math libraries, NCCL, and cuDNN. 3. `devel` builds on the runtime images and includes headers and development tools for building CUDA images (These images are particularly useful for multi-stage builds) The `Dockerfile` for the images are open-source and licensed under _3-clause BSD_. ```sh # grab a CUDA development environment and run it through Apptainer: mkdir $LUSTRE_HOME/cuda cd $LUSTRE_HOME/cuda apptainer pull docker://nvidia/cuda:11.4.3-devel-rockylinux8 ``` At the end of the _pull_ request, a file named ``cuda_11.4.3-devel-rockylinux8.sif`` should be available. ```sh # test the CUDA development environment on localhost cd $HOME/my_src git clone https://github.com/NVIDIA/cuda-samples.git apptainer shell $LUSTRE_HOME/cuda/cuda_11.4.3-devel-rockylinux8.sif cd cuda-samples make exit ``` The binaries will be saved under `$HOME/my_src/cuda-samples/bin/x86_64/linux/release`. Then just move or copy the compiled files under `$LUSTRE_HOME/cuda/samples` or any other directory as you saw fit. Create a [batch job][8EjjE]: [8EjjE]: ../cluster/resource-allocation.html#batch-jobs ```sh >> cat submit_nvidia.sh #!/bin/bash #SBATCH -J apptainer_nv_test #SBATCH --gres=gpu:1 #SBATCH --reservation=nvidia_gpu #SBATCH -p long #SBATCH -t 0-00:30 #SBATCH -N 1 #SBATCH -c 1 #SBATCH --mem=8000 apptainer exec --bind /cvmfs \ --nv $SINGULARITY_CONTAINERS/nvidia/cuda/cuda_11.4.3-devel-rockylinux8.sif \ $LUSTRE_HOME/cuda/cuda-samples/bin/x86_64/linux/release/deviceQuery ``` Submit your CUDA container to Virgo: ```sh sbatch --constraint="v100" submit_nvidia.sh ``` [^rocm]: AMD ROCm™ Information Portal <https://rocmdocs.amd.com/en/latest/> [^gres]: Slurm Generic Resource (GRES) Scheduling <https://slurm.schedmd.com/gres.html> [^v100]: Nvidia V100 Tensor Core GPU <https://www.nvidia.com/en-us/data-center/v100/> [^mi100]: AMD Radeon Instinct™ MI100 Accelerator <https://www.amd.com/en/products/server-accelerators/instinct-mi100>