Lustre Shared Storage
Architecture
Communication may refer to certain Lustre machine types without giving explanations, so here they are:
- The Metadata Server (MDS) stores namespace metadata such as file and directory names or access permissions. This single service directs each file request to a corresponding server storing data. Once the file is opened, the MDS is not longer involved.
- The Object Storage Servers (OSS) typically manage several Object Storage Targets (OST) by controlling I/O accesses and handling network requests. OSTs are storage devices that store your files and consist of physical disks in a RAID configuration. File data is stored in one or more objects, where each object is distributed to a different OST. The capacity of a Lustre file system is the sum of the capacities provided by all OSTs.
- Clients normally used as compute nodes in a cluster access file data. Lustre presents a single namespace to users, visible like a usual network file system via a mounted path in the local directory tree, which is standard POSIX compliant. It allows concurrent and coherent read and write access to the distributed file system.
- The Configuration Management Server (MGS) is the central point of contact providing configuration information about Lustre file systems and the entry point for Lustre communication, e.g. for a client intending to mount a Lustre file system. The MGS is usually co-located to the MDT on one machine.
Best Practices
Lustre is a system shared among all users of the compute cluster:
- Due to the architecture of Lustre, users have a significant impact on the overall performance depending on their methods of working with data stored on the file system.
- Optimizing the I/O performance for each application utilizing distributed storage will decrease the overall load on Lustre and hence improve the user experience for everyone. Following are some general advices for I/O activities on Lustre.
- Keep in mind that misused access on Lustre multiplies with the number of jobs submitted to the compute cluster. Many of the topics discussed further down will unfold with severity depending on the scale of cluster applications.
Avoid flooding the MDS
Commands that access file metadata like ls
, find
, du
, or df
can flood the MDS with huge number of requests, especially if they descend into a deep directory tree structure with many files.
- Metadata information such as ownership or permissions are stored in the MDS, whereas a file size is only available from the respective OST. For example,
ls -lR
issues a request to the MDS and to an OST for each file or directory. Similarly, recursive scanning of directory trees withfind
is very expensive and takes a long time to complete. - Avoid as far as possible searching for input/output files in jobs. If you need to check for the existence of a file, for instance with
ls
, omit unnecessary command options reading irrelevant metadata, e.g.-l
or--color
. Should you not need sorted output for ls, use option -U which will improve the response time for listing. - Likewise, if you use
rsync
to move data between Lustre and execution nodes make sure to copy only absolutely relevant files and do not use--exclude PATTERN
options. Generally, wild-cards for commands liketar
orrm
for huge lists of files should be avoided. For example, executingrm -rf /path/to/files/*
with millions of files will never finish, since the expansion of the wild-card*
will have highly negative impact on the responsiveness of Lustre overall.
Don’t store too many files in a directory
Concurrent access to several files in the same directory creates contention, because Lustre has to maintain a lock to the directory. Therefore, make sure to use subdirectories and keep the number of files per directory within thousands.
Avoid many small files
The optimal access strategy for handling I/O activities of a single job would be to have exactly one file containing input data and to write exactly one output file. Neither input nor output files should be shared with another jobs respective process.
- Naturally, in many applications this is impossible or very difficult due to the nature of the executed computation itself. Nevertheless, the number of files accessed by your application should be as small as possible.
- For this reason we recommend file sizes bigger than 1 GB for input data if feasible. If your experiment data consists of many small files, consider to merge the data once before your execute several processing applications on significantly bigger data files. The number of output files needs to be very limited too. Remember that log files are output files as well. For example, separating standard out and standard error should be avoided, as well as writing additional log streams from child processes.
Don’t write one file from many processes
Logical concurrence of file access is a burden for Lustre. In case of massively parallel computations with a large number of processes or threads contention has to be taken into account. Instead of allowing all processes to do the I/O operations, choose just a few processes to do this. For writes, a couple of processes should collect the data from other processes and merge it before writing to storage.
Keep files open and buffer data
Each file-open is a metadata request to the MDS followed by a redirection to an OSS. Keep file handles open during the execution of your application. Make sure to buffer output as long as possible, e.g. 1MB or more, before you flush it to the storage. Generally, aggregate small read and write operations into the larger ones for instance with MPI-IO Collective Buffering.
Don’t install software in Lustre
The IT department provides a dedicated [infrastructure for software deployment][D2jRt] on the compute cluster. Please contact the software coordinator of your experiment/working group for support.
- Avoid installing software frameworks and libraries in Lustre, because they usually contain lots of very small files. When many jobs start to initialize applications with library dependencies inside a Lustre file system, the MDS will be flooded with requests.
- Do not compile software in Lustre! Since the build process generates plenty of compile artifacts and temporary files, it floods the MDS with requests. If you absolutely need to deploy binaries on Lustre, compile them in a temporary directory on the interactive machines and install them afterwards.
Beware of executables in Lustre
Lustre clients can block on I/O operations in case of high load in Lustre, since Lustre comes with a “strong” client/server coupling to enable connection recovery after infrastructure failures. Usually, in such cases programs crash when instructions are loaded from an inaccessible executable into memory. It is possible to submit binaries as jobs with the cluster management system, but we recommend to copy the executable into the temporary scratch space local at the execution node before executing it. Furthermore, executables suffer from a performance penalty due to the network latency when executed from Lustre.
Footnotes
Lustre File System, Wikipedia
https://en.wikipedia.org/wiki/Lustre_(file_system)↩︎