Lustre Shared Storage

Modified

November 9, 2023

Shared Storage

Access shared storage in the directory /lustre on all nodes.

Lustre ¹ is a distributed and parallel file system. The GSI Lustre installation manages hundreds of millions of files and petabytes of data, certain common practices in handling I/O on smaller network file systems do not apply to Lustre. In order to understand these boundary conditions for Lustre, users have to understand how data access is established.

Storage in /lustre has no automatic backup!

Risk of data loss is reduced by storing data on RAID arrays. However, this does not guarantee the safety of data and it does not protect against errors of RAID controllers, silent data corruption and other sources of data loss. The namespace metadata is copied in regular intervals. This may help in case of total destruction of a Meta Data Server (MDS). Success of recovering metadata on the production scale has not been shown. Any file operations more recent than the last backup copy of the namespace metadata would be lost.

Why should You care?

It would be much more convenient if it were possible to hide all the particularities of a highly complex system like Lustre from users. However, applications performance and to some degree Lustre’s serviceability depend on basic understanding of Lustre architecture. Simple changes to your application code or runtime environment may have a big impact on reliability and performance of Lustre file systems.

The difficulty in implementing parallel I/O workloads for scientists is based on the difference between data representation from the scientific point of view, e.g. particles, cells, etc. and the conditions of a physical data storage. Furthermore, software layers like Lustre in between applications and physical devices are complex and varying. Nevertheless, simple insides as described in “Best Practices” here will optimize application performance with a moderate amount of work.

How does Lustre I/O work?

With reference to Luster architecture following paragraph describes the basic IO mechanism of Lustre with clients on compute nodes that run jobs and process files by first sending a request to the MDS. Using its local metadata, the MDS determines the location and structural distribution of a file. Second, the MDS redirects the client to the Object Store Server (OSS) that manages multiple Object Store Targets (OST), where OST(s) store the requested file objects. Once a client has established a network connection to the target OSS, the MDS is no longer involved in the I/O process. Third, the actual data access of clients goes directly over the OSS(s) to the OST(s) performing operations like file locking, disk allocation and so forth.

Where are the limits?

Since Lustre enforces coherence in case of multiple clients accessing the same file at the same time, a distributed file lock management is required to guarantee consistency for all clients. A Lustre system is limited in the number of operations it can do per second. Therefore, it is strongly recommended to keep the amount of file-open and file-lock operations in parallel as small as possible to reduce contention.

Architecture

Communication may refer to certain Lustre machine types without giving explanations, so here they are:

The Metadata Server (MDS) stores namespace metadata such as file and directory names or access permissions. This single service directs each file request to a corresponding server storing data. Once the file is opened, the MDS is not longer involved.
The Object Storage Servers (OSS) typically manage several Object Storage Targets (OST) by controlling I/O accesses and handling network requests. OSTs are storage devices that store your files and consist of physical disks in a RAID configuration. File data is stored in one or more objects, where each object is distributed to a different OST. The capacity of a Lustre file system is the sum of the capacities provided by all OSTs.
Clients normally used as compute nodes in a cluster access file data. Lustre presents a single namespace to users, visible like a usual network file system via a mounted path in the local directory tree, which is standard POSIX compliant. It allows concurrent and coherent read and write access to the distributed file system.
The Configuration Management Server (MGS) is the central point of contact providing configuration information about Lustre file systems and the entry point for Lustre communication, e.g. for a client intending to mount a Lustre file system. The MGS is usually co-located to the MDT on one machine.

Best Practices

Lustre is a system shared among all users of the compute cluster:

Due to the architecture of Lustre, users have a significant impact on the overall performance depending on their methods of working with data stored on the file system.
Optimizing the I/O performance for each application utilizing distributed storage will decrease the overall load on Lustre and hence improve the user experience for everyone. Following are some general advices for I/O activities on Lustre.
Keep in mind that misused access on Lustre multiplies with the number of jobs submitted to the compute cluster. Many of the topics discussed further down will unfold with severity depending on the scale of cluster applications.

Avoid flooding the MDS

Commands that access file metadata like ls, find, du, or df can flood the MDS with huge number of requests, especially if they descend into a deep directory tree structure with many files.

Metadata information such as ownership or permissions are stored in the MDS, whereas a file size is only available from the respective OST. For example, ls -lR issues a request to the MDS and to an OST for each file or directory. Similarly, recursive scanning of directory trees with find is very expensive and takes a long time to complete.
Avoid as far as possible searching for input/output files in jobs. If you need to check for the existence of a file, for instance with ls, omit unnecessary command options reading irrelevant metadata, e.g. -l or --color. Should you not need sorted output for ls, use option -U which will improve the response time for listing.
Likewise, if you use rsync to move data between Lustre and execution nodes make sure to copy only absolutely relevant files and do not use --exclude PATTERN options. Generally, wild-cards for commands like tar or rm for huge lists of files should be avoided. For example, executing rm -rf /path/to/files/* with millions of files will never finish, since the expansion of the wild-card * will have highly negative impact on the responsiveness of Lustre overall.

Don’t store too many files in a directory

Concurrent access to several files in the same directory creates contention, because Lustre has to maintain a lock to the directory. Therefore, make sure to use subdirectories and keep the number of files per directory within thousands.

Avoid many small files

The optimal access strategy for handling I/O activities of a single job would be to have exactly one file containing input data and to write exactly one output file. Neither input nor output files should be shared with another jobs respective process.

Naturally, in many applications this is impossible or very difficult due to the nature of the executed computation itself. Nevertheless, the number of files accessed by your application should be as small as possible.
For this reason we recommend file sizes bigger than 1 GB for input data if feasible. If your experiment data consists of many small files, consider to merge the data once before your execute several processing applications on significantly bigger data files. The number of output files needs to be very limited too. Remember that log files are output files as well. For example, separating standard out and standard error should be avoided, as well as writing additional log streams from child processes.

Don’t write one file from many processes

Logical concurrence of file access is a burden for Lustre. In case of massively parallel computations with a large number of processes or threads contention has to be taken into account. Instead of allowing all processes to do the I/O operations, choose just a few processes to do this. For writes, a couple of processes should collect the data from other processes and merge it before writing to storage.

Keep files open and buffer data

Each file-open is a metadata request to the MDS followed by a redirection to an OSS. Keep file handles open during the execution of your application. Make sure to buffer output as long as possible, e.g. 1MB or more, before you flush it to the storage. Generally, aggregate small read and write operations into the larger ones for instance with MPI-IO Collective Buffering.

Don’t install software in Lustre

The IT department provides a dedicated [infrastructure for software deployment][D2jRt] on the compute cluster. Please contact the software coordinator of your experiment/working group for support.

Avoid installing software frameworks and libraries in Lustre, because they usually contain lots of very small files. When many jobs start to initialize applications with library dependencies inside a Lustre file system, the MDS will be flooded with requests.
Do not compile software in Lustre! Since the build process generates plenty of compile artifacts and temporary files, it floods the MDS with requests. If you absolutely need to deploy binaries on Lustre, compile them in a temporary directory on the interactive machines and install them afterwards.

Beware of executables in Lustre

Lustre clients can block on I/O operations in case of high load in Lustre, since Lustre comes with a “strong” client/server coupling to enable connection recovery after infrastructure failures. Usually, in such cases programs crash when instructions are loaded from an inaccessible executable into memory. It is possible to submit binaries as jobs with the cluster management system, but we recommend to copy the executable into the temporary scratch space local at the execution node before executing it. Furthermore, executables suffer from a performance penalty due to the network latency when executed from Lustre.

Footnotes

Lustre File System, Wikipedia
https://en.wikipedia.org/wiki/Lustre_(file_system)↩︎

--- title: Lustre Shared Storage date-modified: 2023/11/09 --- ## Shared Storage **Access shared storage in the directory `/lustre` on all nodes.** Lustre [^lustre] is a distributed and parallel file system. The GSI Lustre installation manages hundreds of millions of files and petabytes of data, certain common practices in handling I/O on smaller network file systems do not apply to Lustre. In order to understand these boundary conditions for Lustre, users have to understand how data access is established. ::: {.callout-warning appearance="simple"} Storage in `/lustre` has no automatic backup! ::: Risk of data loss is reduced by storing data on RAID arrays. However, this does not guarantee the safety of data and it does not protect against errors of RAID controllers, silent data corruption and other sources of data loss. The namespace metadata is copied in regular intervals. This may help in case of total destruction of a Meta Data Server (MDS). Success of recovering metadata on the production scale has not been shown. Any file operations more recent than the last backup copy of the namespace metadata would be lost. #### Why should You care? It would be much more convenient if it were possible to hide all the particularities of a highly complex system like Lustre from users. However, **applications performance and to some degree Lustre's serviceability depend on basic understanding of Lustre architecture**. Simple changes to your application code or runtime environment may have a big impact on reliability and performance of Lustre file systems. The difficulty in implementing parallel I/O workloads for scientists is based on the difference between data representation from the scientific point of view, e.g. particles, cells, etc. and the conditions of a physical data storage. Furthermore, software layers like Lustre in between applications and physical devices are complex and varying. Nevertheless, simple insides as described in **"Best Practices" here will optimize application performance with a moderate amount of work.** #### How does Lustre I/O work? With reference to [Luster architecture](#architecture) following paragraph describes the basic IO mechanism of Lustre with clients on compute nodes that run jobs and process files by **first sending a request to the MDS**. Using its local metadata, the MDS determines the location and structural distribution of a file. **Second, the MDS redirects the client to the Object Store Server (OSS)** that manages multiple Object Store Targets (OST), where OST(s) store the requested file objects. Once a client has established a network connection to the target OSS, the MDS is no longer involved in the I/O process. **Third, the actual data access of clients goes directly over the OSS(s) to the OST(s)** performing operations like file locking, disk allocation and so forth. #### Where are the limits? Since Lustre enforces coherence in case of multiple clients accessing the same file at the same time, a distributed file lock management is required to guarantee consistency for all clients. A Lustre system is limited in the number of operations it can do per second. Therefore, it is strongly recommended to **keep the amount of file-open and file-lock operations in parallel as small as possible to reduce contention.** [^lustre]: Lustre File System, Wikipedia <https://en.wikipedia.org/wiki/Lustre_(file_system)> ## Architecture Communication may refer to certain Lustre machine types without giving explanations, so here they are: 1. The **Metadata Server (MDS)** stores namespace metadata such as file and directory names or access permissions. This single service directs each file request to a corresponding server storing data. Once the file is opened, the MDS is not longer involved. 2. The **Object Storage Servers (OSS)** typically manage several Object Storage Targets (OST) by controlling I/O accesses and handling network requests. OSTs are storage devices that store your files and consist of physical disks in a RAID configuration. File data is stored in one or more objects, where each object is distributed to a different OST. The capacity of a Lustre file system is the sum of the capacities provided by all OSTs. 3. **Clients** normally used as compute nodes in a cluster access file data. Lustre presents a single namespace to users, visible like a usual network file system via a mounted path in the local directory tree, which is standard POSIX compliant. It allows concurrent and coherent read and write access to the distributed file system. 4. The **Configuration Management Server (MGS)** is the central point of contact providing configuration information about Lustre file systems and the entry point for Lustre communication, e.g. for a client intending to mount a Lustre file system. The MGS is usually co-located to the MDT on one machine. ## Best Practices Lustre is a system shared among all users of the compute cluster: * Due to the architecture of Lustre, **users have a significant impact on the overall performance** depending on their methods of working with data stored on the file system. * Optimizing the I/O performance for each application utilizing distributed storage will decrease the overall load on Lustre and hence improve the user experience for everyone. Following are some general advices for I/O activities on Lustre. * Keep in mind that **misused access on Lustre multiplies with the number of jobs** submitted to the compute cluster. Many of the topics discussed further down will unfold with severity depending on the scale of cluster applications. #### Avoid flooding the MDS Commands that **access file metadata like `ls`, `find`, `du`, or `df` can flood the MDS** with huge number of requests, especially if they descend into a deep directory tree structure with many files. * Metadata information such as ownership or permissions are stored in the MDS, whereas a file size is only available from the respective OST. For example, `ls -lR` issues a request to the MDS and to an OST for each file or directory. Similarly, recursive scanning of directory trees with `find` is very expensive and takes a long time to complete. * Avoid as far as possible searching for input/output files in jobs. If you need to check for the existence of a file, for instance with `ls`, omit unnecessary command options reading irrelevant metadata, e.g. `-l` or `--color`. Should you not need sorted output for ls, use option -U which will improve the response time for listing. * Likewise, if you use `rsync` to move data between Lustre and execution nodes make sure to copy only absolutely relevant files and do not use `--exclude PATTERN` options. Generally, wild-cards for commands like `tar` or `rm` for huge lists of files should be avoided. For example, executing `rm -rf /path/to/files/*` with millions of files will never finish, since the expansion of the wild-card `*` will have highly negative impact on the responsiveness of Lustre overall. #### Don't store too many files in a directory Concurrent access to several files in the same directory creates contention, because Lustre has to maintain a lock to the directory. Therefore, make sure to use subdirectories and **keep the number of files per directory within thousands**. #### Avoid many small files The optimal access strategy for handling I/O activities of a single job would be to have exactly one file containing input data and to write exactly one output file. Neither input nor output files should be shared with another jobs respective process. * Naturally, in many applications this is impossible or very difficult due to the nature of the executed computation itself. Nevertheless, **the number of files accessed by your application should be as small as possible**. * For this reason we recommend file sizes bigger than 1 GB for input data if feasible. If your experiment data consists of many small files, consider to merge the data once before your execute several processing applications on significantly bigger data files. The number of output files needs to be very limited too. Remember that log files are output files as well. For example, separating standard out and standard error should be avoided, as well as writing additional log streams from child processes. #### Don't write one file from many processes Logical concurrence of file access is a burden for Lustre. In case of massively parallel computations with a large number of processes or threads contention has to be taken into account. Instead of allowing all processes to do the I/O operations, choose just a few processes to do this. For writes, **a couple of processes should collect the data from other processes and merge it before writing to storage**. #### Keep files open and buffer data Each file-open is a metadata request to the MDS followed by a redirection to an OSS. **Keep file handles open during the execution of your application**. Make sure to **buffer output as long as possible**, e.g. 1MB or more, before you flush it to the storage. Generally, aggregate small read and write operations into the larger ones for instance with MPI-IO Collective Buffering. #### Don't install software in Lustre The IT department provides a dedicated [infrastructure for software deployment][D2jRt] on the compute cluster. Please contact the software coordinator of your experiment/working group for support. * **Avoid installing software frameworks and libraries** in Lustre, because they usually contain lots of very small files. When many jobs start to initialize applications with library dependencies inside a Lustre file system, the MDS will be flooded with requests. * **Do not compile software in Lustre!** Since the build process generates plenty of compile artifacts and temporary files, it floods the MDS with requests. If you absolutely need to deploy binaries on Lustre, compile them in a temporary directory on the interactive machines and install them afterwards. #### Beware of executables in Lustre Lustre clients can block on I/O operations in case of high load in Lustre, since Lustre comes with a "strong" client/server coupling to enable connection recovery after infrastructure failures. Usually, in such cases programs crash when instructions are loaded from an inaccessible executable into memory. It is possible to submit binaries as jobs with the cluster management system, but we recommend to **copy the executable into the temporary scratch space local at the execution node** before executing it. Furthermore, executables suffer from a performance penalty due to the network latency when executed from Lustre.