Plasma PIC with Apptainer Containers

EPOCH and WarpX Simulation on CPU and GPU

Container
Authors

Denis Bertini

Victor Penso

Published

November 14, 2023

Modified

November 15, 2023

Abstract

This article overviews how to run EPOCH and WarpX simulation with Apptainer containers. The examples include best practices to optimize MPI intra-node communication by forcing all MPI process to share a single container instance. Furthermore use of the GPUs with containers is covered in a dedicated section.

Keywords

Apptainer, EPOCH, WarpX, GPU

Software Stack

Container images are all built with RockyLinux 8.8 1 in alignment with the host platform of the compute cluster. Each container image provides a self-consistent environment to perform PIC simulation using either EPOCH 2 or WarpX 3. The environments have been tested and guarantee full speed InfiniBand interconnect communication and Lustre file-system optimised MPI I/O. Containers provide the latest versions the PIC codes and dependencies, which includes:

  • Linux RockyLinux 8.8 with Lustre Client v2.15.3 4
  • OpenMPI v5.0.0 5, PMix v4.2.7 6, UCX v1.15.0 7
  • I/O Libraries HDF5 v1.14 8 and ADIOS2 v2.9.1 9
  • EPOCH v4.19.0 10 and WarpX v23.10 11
  • Python v3.6.8 with scientific packages (Numpy , Matplotlib, etc.) and SDFUtils bindings for EPOCH

All Apptainer container definition files 12 are available to users in a GitLab repository in sub-directory defs/. Use the definition files to build the containers and provide feedback using GitLab issues:

Repository URL
pp-containers https://git.gsi.de/d.bertini/pp-containers

Container Images

Ready to use container images are provided to the cluster on CVMFS in the directory /cvmfs/phelix.gsi.de/sifs/. The latest images are stored in the development dev/ sub-directory for validation from the user community. After validation, images are moved to the production prod/ sub-directory, replacing the previous versions preserved for reference in the old/ sub-directory. The directory structure looks similar to the following example:

>>> tree /cvmfs/phelix.gsi.de/sifs
.
├── cpu
│   ├── dev
│   │   ├── rlx8_ompi5_ucx_cma.sif
│   │   ├── rlx8_ompi5_ucx_dask.sif
│   │   ├── rlx8_ompi5_ucx_gcc13_cma.sif
│   │   ├── rlx8_ompi5_ucx_gcc13.sif
│   │   └── rlx8_ompi5_ucx.sif
│   ├── old
│   └── prod
│       ├── rlx8_ompi_ucx_gcc12.sif
│       └── rlx8_ompi_ucx.sif
└── gpu
    ├── dev
    │   └── rlx8_rocm-5.7.1_warpx.sif
    ├── old
    └── prod
        ├── rlx8_rocm-5.4.6.def
        ├── rlx8_rocm-5.4.6.sif
        ├── rlx8_rocm-5.4.6_warpx.def
        ├── rlx8_rocm-5.4.6_warpx.sif
        ├── ubuntu-20.04_rocm-5.4.2_picongpu.def
        ├── ubuntu-20.04_rocm-5.4.2_picongpu.sif
        ├── ubuntu-20.04_rocm-5.4.2_warpx.def
        └── ubuntu-20.04_rocm-5.4.2_warpx.sif

The file names for container image reflect the base Linux distribution, and compiler and library versions. For example RockyLinus 8 with the default compiler GCC 8.5:

  • /cvmfs/phelix.gsi.de/sifs/cpu/dev/rlx8_ompi5_ucx.sif
  • /cvmfs/phelix.gsi.de/sifs/cpu/dev/rlx8_ompi5_ucx_cma.sif
  • /cvmfs/phelix.gsi.de/sifs/cpu/dev/rlx8_ompi5_ucx_dask.sif

Similar but with a recent GCC 13.2:

  • /cvmfs/phelix.gsi.de/sifs/cpu/dev/rlx8_ompi5_ucx_gcc13.sif
  • /cvmfs/phelix.gsi.de/sifs/cpu/dev/rlx8_ompi5_ucx_gcc13_cma.sif

Getting Started

In order to use a container image of your choice, login to the (non-VAE) cluster submit nodes. These nodes grant access to the bare-metal environment enabling users to launch custom container instance:

# ...login to a (bare-metal) submit node
ssh $USER@virgo.hpc.gsi.de

# ...selecting a RockyLinux 8 container with GCC 8.5.0
export APPTAINER_CONTAINER=/cvmfs/phelix.gsi.de/sifs/cpu/dev/rlx8_ompi5_ucx.sif

# ...bind storage directories to the container
export APPTAINER_BINDPATH=/lustre/rz/dbertini/,/cvmfs
export OMPI_MCA_io=romio341

# ...run an application within a container instance
echo "." | srun --export=ALL -- $APPTAINER_CONTAINER epoch3d

MPI Communication

The cluster utilizes Apptainer as container engine in non-setuid mode 13 by default. Using non-setuid mode offers several advantages. It enables unprivileged users to build their own containers without requiring any configurations from system administrators. In non-setuid mode, each processes spawned by MPI runs in a dedicated user-namespace which prevent any intra-node optimizations (for example POSIX shared memory or cross memory attach).

A possible workaround is to use a container as service approach 14 to run a single container instance per node. All MPI processes “join” this instances sharing one user-namespace. Following example illustrates how to adapt a typical EPOCH submit script in order to use a single Apptainer instances per node. Launch of the application is build with two scripts. The batch script run-file-cma.sh:

#!/bin/bash

echo "." | srun  --export=ALL  ./epoch3d.sh

A second script epoch3d.sh first checks if a container instance already exist on the node, otherwise it starts an instance. All consecutive MPI processes created on the node will then “join” the same container instance before execution:

#!/bin/bash

export APPTAINER_CONTAINER=/cvmfs/phelix.gsi.de/sifs/cpu/dev/rlx8_ompi5_ucx_cma.sif
export APPTAINER_CONFIGDIR=/lustre/rz/$USER/apptainer 

export OMP_NUM_THREADS=1
export OMPI_MCA_io=romio341

export EPOCH_EXE=epoch3d

nb_instances=`flock -x /tmp apptainer instance list $SLURM_JOB_ID | grep $SLURM_JOB_ID | wc -l`

if [ $nb_instances -eq 1 ]
then
        echo "Instance already created:  join ..." $nb_instances $SLURMD_NODENAME  
        apptainer exec instance://$SLURM_JOB_ID $EPOCH_EXE 
else
        apptainer instance list $SLURM_JOB_ID | grep $SLURM_JOB_ID | wc -l
        echo "Instance not created:  create it ..." $nb_instances $SLURMD_NODENAME
        flock -o -x /tmp \
              apptainer instance start  -B /lustre -B /cvmfs $APPTAINER_CONTAINER $SLURM_JOB_ID
        apptainer exec instance://$SLURM_JOB_ID $EPOCH_EXE
fi

The scripts above are available in the scripts/cma/ sub-directory of the pp-container repository. Note that race conditions when querying for a container instance are prevented by requiring an advisory lock via the Linux flock command.

Interactive Usage

Once you data is produced you can do analysis using the same containerized environment since it also provide the necessary python libraries:

>>> apptainer exec /cvmfs/phelix.gsi.de/sifs/cpu/rlx8_ompi_ucx.sif bash -l  
Centos system profile loaded ...
Apptainer> python3 --version
Python 3.6.8
Apptainer> python3
Python 3.6.8 (default, Feb 21 2023, 16:57:46) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> import matplotlib
>>> import sdf
...

GPU Containers

A dedicated container images provides the latest WarpX 23.11 used with the cluster gpu partition. The image is built on top of the standard RockyLinux container adding latest Radeon Open Compute (ROCm) 15 from AMD. The image has been tested on the AMD MI 100 GPUs available on the gpu partition. The gpu_scripts/ sub-directory in the pp-container repository provides example batch scripts to launch a batch job on GPUS:

#!/bin/bash

# Define container image and working directory
export CONT=/cvmfs/phelix.gsi.de/sifs/gpu/dev/rlx8_rocm-5.7.1_warpx.sif
export WDIR=/lustre/rz/dbertini/gpu/warpx

# Define I/O module for openMPI 
export OMPI_MCA_io=romio341

# Definie apptainer external filesystems bindings 
export APPTAINER_BINDPATH=/lustre/rz/dbertini/,/cvmfs

# Executable with dimentionality and corresponding input deck file.
srun --export=ALL -- $CONT warpx_2d  $WDIR/scripts/inputs/warpx_opmd_deck

Install Software

Install additional packages and software using the provided container images as foundation. Build software within a container environment by stating a container instance for an infective session. The apptainer> prompt indicates the containerized environment. Load your Bash profile by running the bash command (note that your home-directory is automatically mounted to the container):

# ...set an environment variable with the path for the container image to use
export APPTAINER_CONTAINER=/cvmfs/phelix.gsi.de/sifs/cpu/dev/rlx8_ompi5_ucx_cma.sif

# ...start the container mounting shared storage (if required)
apptainer exec $APPTAINER_CONTAINER -B /lustre -B /cvmfs bash -l
Apptainer> 

# ...load your Bash profile
Apptainer> bash
[dbertini@lxbk1130 /lustre/rz/dbertini]$

You can check for example which version are available within the environment:

>>> g++ --version
g++ (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

>>> ompi_info | grep ucx
  Configure command line: '--prefix=/usr/local' '--with-pmix=/usr/local' '--with-libevent=/usr' '--with-ompi-pmix-rte' '--with-orte=no' '--disable-oshmem' '--enable-mpirun-prefix-by-default' '--enable-shared' '--without-verbs' '--with-hwloc' '--with-ucx=/usr/local/ucx' '--with-lustre' '--with-slurm' '--enable-mca-no-build=btl-uct' '--with-cma=no'
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v5.0.0)
                 MCA pml: ucx (MCA v2.1.0, API v2.1.0, Component v5.0.0)

Installation additional software to Lustre shared storage and use it with a batch script similar to:

export APPTAINER_CONTAINER=/cvmfs/phelix.gsi.de/sifs/cpu/dev/rlx8_ompi5_ucx_cma.sif
export APPTAINER_BINDPATH=/lustre/rz/dbertini/,/cvmfs
export SLURM_WORKING_DIR=/lustre/rz/dbertini/warpx
export OMPI_MCA_io=romio321

srun --export=ALL \
      -- apptainer exec -B /lustre -B /cvmfs $APPTAINER_CONTAINER \
      $my_executable $options