GPUs

Modified

September 5, 2024

Resources

Nodes GPU Memory Slurm Feature GRES Type Notes
lxbk0[717-718] NVIDIA Tesla V100 1 32 GB v100 tesla Not available
lxbk[1080-1129] AMD Radeon Instinct MI100 2 32 GB mi100 cdna

Compute nodes equipped with GPUs have the required drivers and runtime libraries installed. The table above lists nodes by GPU type and corresponding Slurm feature. Generic Resource (GRES) 3 options for the sbatch and srun commands allow allocation of GPUs for a job:

Option Description
--constraint=<list> Select the variant of GPU hardware to use, for example mi100 (see table above).
--gres=gpu[[:type]:count] The count is the number of GPUS with a default value of 1. The specified resources will be allocated to the job on each node.type is optional and defines the GRES type of the resources.

Compute jobs usually use an option like --mem to specify the amount of real memory required (per node). This affects only RAM not the dedicated memory of the GPU. If necessary set a memory requirement (per allocated GPU) with option --mem-per-gpu. Note that --mem, --mem-per-cpu and --mem-per-gpu options are mutually exclusive.

AMD ROCm

AMD provides ready to use container images for ROCm available on AMD Infinity Hub 4 and a corresponding ROCm page on DockerHub 5.

GPU Partition

AMD MI100 GPUs are accessible via a dedicated partition called gpu. Use following options allocate a node with eight GPUs:

sbatch --gres=gpu:8 --constraint=mi100 ...

AMD Radeon Instinct GPU hardware uses the ROCm 6 (programming-language) runtime environment. A simple test will be based on a Singularity container equipped with a TensorFlow installation.

# prepare the container:
mkdir $LUSTRE_HOME/rocm
cd $LUSTRE_HOME/rocm
apptainer pull docker://rocm/tensorflow:latest

At the end of the pull request, a file named tensorflow_latest.sif should be available. Following python script will be used as an Hello World example:

#!/usr/bin/env ipython

from __future__ import print_function

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Simple hello world using TensorFlow

# Create a Constant op
# The op is added as a node to the default graph.
#
# The value returned by the constructor represents the output
# of the Constant op.
hello = tf.constant('Hello, TensorFlow!')

# Start tf session
sess = tf.Session()

# Run the op
print(sess.run(hello))

Following example shows a batch-script used to allocate a GPU:

>> cat submit_rocm.sh
#!/bin/bash
#SBATCH -J apptainer_rocm_test
#SBATCH --gres=gpu:1
#SBATCH -t 0-00:30
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=8000

apptainer exec \
      --bind /cvmfs \
      --rocm $LUSTRE_HOME/rocm/tensorflow_latest.sif \
      $LUSTRE_HOME/rocm/ts_hworld.py

Submit your ROCm container to the gpu partition on Virgo:

sbatch -p gpu submit_rocm.sh

Submit your ROCm container based on a Slurm features:

sbatch --constraint="mi100" submit_rocm.sh

Nvidia CUDA

This is important

At the moment no Nvidia GPUs are available in the cluster. Please contact the support in case of further questions.

Currently we do not provide a dedicated submit node with support for Nvidia GPUs. User are required to build and execute a custom container supporting the run-time environment for Nvidia. Note that it is mandatory to login to a submit node with access to the container run-time. The CUDA environment installed is based on version 11.4, please refer to the official CUDA documentation 7. The official CUDA container images 8 are available in three variants:

  1. base includes the CUDA runtime (cudart).
  2. runtime builds on the base image and includes the CUDA math libraries, NCCL, and cuDNN.
  3. devel builds on the runtime images and includes headers and development tools for building CUDA images (These images are particularly useful for multi-stage builds)

The Dockerfile for the images are open-source and licensed under 3-clause BSD.

# grab a CUDA development environment and run it through Apptainer:
mkdir $LUSTRE_HOME/cuda
cd $LUSTRE_HOME/cuda
apptainer pull docker://nvidia/cuda:11.4.3-devel-rockylinux8

At the end of the pull request, a file named cuda_11.4.3-devel-rockylinux8.sif should be available.

# test the CUDA development environment on localhost
cd $HOME/my_src
git clone https://github.com/NVIDIA/cuda-samples.git
apptainer shell $LUSTRE_HOME/cuda/cuda_11.4.3-devel-rockylinux8.sif
cd cuda-samples
make
exit

The binaries will be saved under $HOME/my_src/cuda-samples/bin/x86_64/linux/release. Then just move or copy the compiled files under $LUSTRE_HOME/cuda/samples or any other directory as you saw fit. Create a batch job:

>> cat submit_nvidia.sh
#!/bin/bash
#SBATCH -J apptainer_nv_test
#SBATCH --gres=gpu:1
#SBATCH --reservation=nvidia_gpu
#SBATCH -p long
#SBATCH -t 0-00:30
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=8000

apptainer exec --bind /cvmfs \
      --nv $SINGULARITY_CONTAINERS/nvidia/cuda/cuda_11.4.3-devel-rockylinux8.sif \
      $LUSTRE_HOME/cuda/cuda-samples/bin/x86_64/linux/release/deviceQuery

Submit your CUDA container to Virgo:

sbatch --constraint="v100" submit_nvidia.sh

Footnotes

  1. Nvidia V100 Tensor Core GPU
    https://www.nvidia.com/en-us/data-center/v100/↩︎

  2. AMD Radeon Instinct™ MI100 Accelerator
    https://www.amd.com/en/products/server-accelerators/instinct-mi100↩︎

  3. Slurm Generic Resource (GRES) Scheduling
    https://slurm.schedmd.com/gres.html↩︎

  4. AMD Infinity Hub
    https://www.amd.com/en/technologies/infinity-hub↩︎

  5. AMD ROCm, DockerHub
    https://hub.docker.com/u/rocm↩︎

  6. AMD ROCm™ Information Portal
    https://rocmdocs.amd.com/en/latest/↩︎

  7. CUDA Documentation, Nvidia
    https://docs.nvidia.com/cuda↩︎

  8. CUDA Container Images, DockerHub
    https://hub.docker.com/r/nvidia/cuda↩︎