GPUs
Resources
Nodes | GPU | Memory | Slurm Feature | GRES Type | Notes |
---|---|---|---|---|---|
lxbk0[717-718] | NVIDIA Tesla V100 1 | 32 GB | v100 |
tesla |
Not available |
lxbk[1080-1129] | AMD Radeon Instinct MI100 2 | 32 GB | mi100 |
cdna |
Compute nodes equipped with GPUs have the required drivers and runtime libraries installed. The table above lists nodes by GPU type and corresponding Slurm feature. Generic Resource (GRES) 3 options for the sbatch
and srun
commands allow allocation of GPUs for a job:
Option | Description |
---|---|
--constraint=<list> |
Select the variant of GPU hardware to use, for example mi100 (see table above). |
--gres=gpu[[:type]:count] |
The count is the number of GPUS with a default value of 1. The specified resources will be allocated to the job on each node.type is optional and defines the GRES type of the resources. |
Compute jobs usually use an option like --mem
to specify the amount of real memory required (per node). This affects only RAM not the dedicated memory of the GPU. If necessary set a memory requirement (per allocated GPU) with option --mem-per-gpu
. Note that --mem
, --mem-per-cpu
and --mem-per-gpu
options are mutually exclusive.
AMD ROCm
AMD provides ready to use container images for ROCm available on AMD Infinity Hub 4 and a corresponding ROCm page on DockerHub 5.
GPU Partition
AMD MI100 GPUs are accessible via a dedicated partition called gpu
. Use following options allocate a node with eight GPUs:
sbatch --gres=gpu:8 --constraint=mi100 ...
AMD Radeon Instinct GPU hardware uses the ROCm 6 (programming-language) runtime environment. A simple test will be based on a Singularity container equipped with a TensorFlow installation.
# prepare the container:
mkdir $LUSTRE_HOME/rocm
cd $LUSTRE_HOME/rocm
apptainer pull docker://rocm/tensorflow:latest
At the end of the pull request, a file named tensorflow_latest.sif
should be available. Following python script will be used as an Hello World example:
#!/usr/bin/env ipython
from __future__ import print_function
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
# Simple hello world using TensorFlow
# Create a Constant op
# The op is added as a node to the default graph.
#
# The value returned by the constructor represents the output
# of the Constant op.
= tf.constant('Hello, TensorFlow!')
hello
# Start tf session
= tf.Session()
sess
# Run the op
print(sess.run(hello))
Following example shows a batch-script used to allocate a GPU:
>> cat submit_rocm.sh
#!/bin/bash
#SBATCH -J apptainer_rocm_test
#SBATCH --gres=gpu:1
#SBATCH -t 0-00:30
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=8000
apptainer exec \
--bind /cvmfs \
--rocm $LUSTRE_HOME/rocm/tensorflow_latest.sif \
$LUSTRE_HOME/rocm/ts_hworld.py
Submit your ROCm container to the gpu
partition on Virgo:
sbatch -p gpu submit_rocm.sh
Submit your ROCm container based on a Slurm features:
sbatch --constraint="mi100" submit_rocm.sh
Nvidia CUDA
Currently we do not provide a dedicated submit node with support for Nvidia GPUs. User are required to build and execute a custom container supporting the run-time environment for Nvidia. Note that it is mandatory to login to a submit node with access to the container run-time. The CUDA environment installed is based on version 11.4, please refer to the official CUDA documentation 7. The official CUDA container images 8 are available in three variants:
base
includes the CUDA runtime (cudart).runtime
builds on the base image and includes the CUDA math libraries, NCCL, and cuDNN.devel
builds on the runtime images and includes headers and development tools for building CUDA images (These images are particularly useful for multi-stage builds)
The Dockerfile
for the images are open-source and licensed under 3-clause BSD.
# grab a CUDA development environment and run it through Apptainer:
mkdir $LUSTRE_HOME/cuda
cd $LUSTRE_HOME/cuda
apptainer pull docker://nvidia/cuda:11.4.3-devel-rockylinux8
At the end of the pull request, a file named cuda_11.4.3-devel-rockylinux8.sif
should be available.
# test the CUDA development environment on localhost
cd $HOME/my_src
git clone https://github.com/NVIDIA/cuda-samples.git
apptainer shell $LUSTRE_HOME/cuda/cuda_11.4.3-devel-rockylinux8.sif
cd cuda-samples
make
exit
The binaries will be saved under $HOME/my_src/cuda-samples/bin/x86_64/linux/release
. Then just move or copy the compiled files under $LUSTRE_HOME/cuda/samples
or any other directory as you saw fit. Create a batch job:
>> cat submit_nvidia.sh
#!/bin/bash
#SBATCH -J apptainer_nv_test
#SBATCH --gres=gpu:1
#SBATCH --reservation=nvidia_gpu
#SBATCH -p long
#SBATCH -t 0-00:30
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=8000
apptainer exec --bind /cvmfs \
--nv $SINGULARITY_CONTAINERS/nvidia/cuda/cuda_11.4.3-devel-rockylinux8.sif \
$LUSTRE_HOME/cuda/cuda-samples/bin/x86_64/linux/release/deviceQuery
Submit your CUDA container to Virgo:
sbatch --constraint="v100" submit_nvidia.sh
Footnotes
Nvidia V100 Tensor Core GPU
https://www.nvidia.com/en-us/data-center/v100/↩︎AMD Radeon Instinct™ MI100 Accelerator
https://www.amd.com/en/products/server-accelerators/instinct-mi100↩︎Slurm Generic Resource (GRES) Scheduling
https://slurm.schedmd.com/gres.html↩︎AMD Infinity Hub
https://www.amd.com/en/technologies/infinity-hub↩︎AMD ROCm, DockerHub
https://hub.docker.com/u/rocm↩︎AMD ROCm™ Information Portal
https://rocmdocs.amd.com/en/latest/↩︎CUDA Documentation, Nvidia
https://docs.nvidia.com/cuda↩︎CUDA Container Images, DockerHub
https://hub.docker.com/r/nvidia/cuda↩︎