The ALICE Analysis Facility at GSI

Resources and Services

Container
Author

Raffaele Grosso

Published

November 15, 2023

Modified

November 24, 2023

Abstract

In the following text the role of GSI computing infrastructure within the ALICE experiment at CERN is described: which computing resources at GSI are integrated in the ALICE Grid and in which way, which services are required, which technical solutions are used.

Keywords

ALICE Grid, GSI Analysis Facility

The context

High Energy Physics experiments at big accelerators require to search for special signals in a “haystack” of background noise. The whole of the RAW data (data read out from the detectors’ electronics) is firstly processed by complex algorithms in order to be interpreted as tracks of primary particles or of their decay products and the actual physics analysis later consists on applying clever cuts for selecting interesting candidates, for example combinations of two tracks, in the whole track background, for example in the huge combinatorial background of all track pairs of a collision event (proton-proton, lead-lead, proton-ion, …). For the results to be significant the search has to be carried over a huge amount of collisions, which is the reason why accelerators are challenged to provide an always higher rate of collisions and the computing infrastructure is challenged to be capable to process always bigger volumes of data per unit of time.

To efficiently store and process such huge amount of data, computing resources distributed worldwide have been organized in a hierarchy of sites playing different roles according to the computing model of the pertaining experiment with the aim of finding the best trade-off between centralization and distribution of resources to minimize data storage and data processing costs.

Moving from Run 2 to Run 3 (started in summer 2023) the ALICE experiment 1 at the CERN LHC is facing the challenge of processing a raw data throughput from the detector bigger than 1.1 TB/s, twenty times bigger then in Run 2, due to the increased luminosity – a way to measure the collision rate per cross section unit – provided by the accelerator and to the continuous readout, that is to taking data without a trigger upstream of the detector. The peak data rate of 1.1 TB/s is reached in lead-lead collisions at 50 kHz interaction rate. This is the setting for big changes in the ALICE computing model for Run 3, based on the necessity to maximize the compression of the data volume read out from the detectors synchronously with data taking, that is performing a first calibration and reconstruction pass online in order to drastically reduce the the amount of data to be stored (down to 50 GB/s).

The GSI contribution to ALICE computing in Run 3

GSI is one of the ALICE Grid sites and provides computing and storage resources (several PB and several thousands cores) to the ALICE experiment. The amount and mode of integration of these resources in the global framework of the experiment results from the computing requirements 2 and is reviewed on a yearly bases to account for the current needs and possibilities. As anticipated above the data throughput from the detector has drastically increased from Run 2 to Run 3 leading to big changes in the ALICE data processing model in particular to anticipate as much as possible the calibration and reconstruction just after the read out, before storing and archiving the data. This has also implied a reorganization of the role played by the Grid sites, in particular to minimize the displacement of data and to maximize the proximity of computing jobs to their input data. In Run 2 the GSI site has been a Tier 2 in the ALICE Grid, meaning a site at the second level in the Grid hierarchy, where the first level – Tier 1 – includes some big computing facilities (GridKA in the case of Germany) and the upper level – Tier 0 – are the central facilities very closed to the detector itself. As a Tier 2 site GSI resources have been used in the context of the Run 2 computing model for central simulations, central analysis and user analysis jobs. In the data processing of Run 3 the role of the different sites is more specialized and GSI becomes an Analysis Facility (AF), whose focus is in being able to process a large amount of data in a short time (up to 5 PB in 12 hours, or equivalently 100 GB/s) providing fast analysis validation an cut tuning.

How are GSI resources integrated in the ALICE Grid

As the other big High Energy Physics experiments at the LHC, ALICE generates data volumes whose storage and processing require to be distributed on many sites. The ALICE Grid middleware JAliEn 3 allows to represent world-wide spread resources under a single interface transparently to the user, who sees a single namespace for files and a single resource manager for computing jobs. For this to happen each Grid site maintains dedicated services and servers which make the storage and computing resources at the site (Storage Element and Computing Element) centrally accessible.

Local and global storage

The following analogy can help to visualize the software stack: a file system local to a machine represents bits on the storage device as logical entities (files) occupying a place in a logical structure (file tree). Similarly, one level above, a parallel file system manages data distributed on several storage devices transparently presenting to the user a single global namespace. Similarly, and as the next level of abstraction, a Grid file system reads and writes data to Storage Elements distributed world wide presenting the underlying files under a single global namespace to the user, who does not need to be aware of anything about the physical location of files. This upper level of abstraction is based on centrally maintained databases which map the files’ name and logical position in the global namespace, the Logical File Name (LFN), to the files’ name at the underlying, the Physical File Names (PFN) of one or more replicas at one or more Storage Elements corresponding to that LFN. It is also based on local services at each SE to “bridge” the local distributed file system to the central services. User queries for LFNs translate into remote queries to replicas at GSI and are served by locally maintained XRootD 4 redirectors and data servers which in turn query the local parallel file system Lustre 5.

Global namespace for ALICE files

Local and global jobs

With respect to jobs and compute resources another similar analogy holds: a job scheduler in the operating system of a machine allocates resources (RAM and CPU) to many different processes organized in prioritized queues. One level above a resource manager does the same with jobs submitted by users on the resources of a cluster. And, one level above, the Grid services of the experiment queue, distribute and monitor jobs submitted by users who don’t need to know on which CPUs at which site their job has run.

ALICE jobs queue

This additional level of abstraction requires of course some services to communicate requests and availability between the global ALICE Grid instance and the single computing sites. It also requires users to submit jobs via a small file describing in a standardized way the job’s needs, as for example input files to be processed, required packages, single or multi core execution, required memory, expected time to live. The central services interact with servers at the sites called VoBoxes which in turn spawn ALICE jobagents on the local workload manager, in the form of Slurm 6 jobs in our case. Finally the job agents communicate with the central services in particular to pull ALICE jobs.

VoBox and job submission

The detailed and more precise workflow of how AliEn jobs are run on worker nodes is detailed in the second part of this presentation7. One can see that actually the JobAgent is launched by a JobRunner to account for multicore execution. The JobAgent in turn launches a JobWrapper which finally executes the actual AliEn job. The main duty of the JobAgent is to ask the Task Queue for a matching job and launch in a container if possible a JobWrapper according to the “payload” (job ID, JDL and job token) received from the central services. If you are still not lost in the many layers of abstraction presented up to now, it is worth mentioning that the startup script executed on the WN is encrypted and embedded in the slurm-submit script received by the VoBox and submitted to Slurm. Here it is modified before being sent in order to accommodate it to the network configuration specific to our cluster by unsetting the proxy-related environment variables.

Some technical challenges and our solutions

The aim of the middleware is to link local resources and central services transparently for the users and painlessly for the local and central admins. Nevertheless we had some issues to be solved, mostly related to the peculiarity of providing resources on shared, non-dedicated systems (our storage instance of Lustre and our cluster Virgo). Hereafter we mention a couple of these issues and describe how we solved them.

The VoBox is not in the cluster

An assumption of the experiment’s middleware is that the VoBox is internal to the cluster and in particular that it runs on a machine where the Slurm client is installed. The VoBox however runs services which require to be accessible from the Internet, which is not the case for the GSI cluster’s network. Therefore the VoBoxes of the GSI site run on a machines outside of the cluster’s network on which the Slurm client is not installed. The Slurm commands issued by the VoBox (squeue, sbatch, …) must be somehow sent to a node which is instead able to query the Slurm servers. Our solution is to “wrap” these commands in a corresponding executable which prepends them with an ssh to a submit node of the cluster. These executables are located in /usr/local/bin so that they are found in the $PATH and as first instances.

Remaining space on Lustre

The space for the ALICE SE at GSI is made available on the local shared storage of Lustre and reserved by means of its quota administration tools (lfs setquota). However the assumption of the middleware is to query the space on a dedicated storage. To overcome this problem an XRootD plug-in was implemented and is installed, which overwrites the XRootD implementation of space usage queries with queries to the Lustre quota API. In this way the available space on Lustre is checked correctly accounting for the quota reserved to ALICE.

The worker nodes are firewalled

ALICE jobs running on the worker nodes need to access the Internet to communicate with the central services and to get files from remote storage elements. This queries however would be blocked by the firewall. Therefore this outgoing traffic has to be redirected bypassing the firewall. Our solution is to make use of policy based routing and natting as described in one of the next blog contributions.

Optimize file access at the local Storage Element

Access to the local storage resources by jobs on the Grid is provided by local services and in particular by the XRootD redirectors and data servers. When a read or write query from a client for a file located at Storage Element at GSI reaches the redirector, it queries in turn the data servers, it gets back possibly the reply by at least one of them and forwards the data server’s address to the client. The client can then communicate directly with the data server which get the file from the underlying storage (Lustre) to ship it to the client. In this data flow there is room for improvement when the client and the SE storing the file are at the same site. This important optimization has been implemented in an XRootD plug-in as described in another blog contribution.