Lustre (file system)

Lustre
Developer(s)	Oracle Corporation
Stable release	1.8.5 / October 29, 2010
Preview release	2.0.0 / August 30, 2010
Repository	git.whamcloud.com/fs/lustre-release.git ;
Operating system	Linux
Type	Distributed file system
License	GPL
Website	http://www.lustre.org , Oracle's Lustre website

Lustre is a massively parallel distributed file system, generally used for large scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster.^[1] Available under the GNU GPL, the project provides a high performance file system for clusters of tens of thousands of nodes with petabytes of storage capacity.

Lustre was initially designed, developed, and maintained by Cluster File Systems, Inc. starting in 2003 until its acquisition in 2007 by Sun Microsystems^[2]^[3]. Sun sold Lustre with its HPC hardware offerings, with the intent to bring the benefits of Lustre technologies to Sun's ZFS file system and the Solaris operating system. In 2010 Oracle Corporation, by way of its 2010 acquisition of Sun, began to manage and release Lustre.

In April 2010 Oracle announced it would limit paid support for new Lustre 2.0 deployment to Oracle hardware, or hardware provided by approved third party vendors. Lustre remained available under the GPL license to all users, and existing Lustre 1.8 customers would continue to receive support from Oracle.^[4]

In December, 2010, Oracle ceased development of Lustre and placed the 1.8 release into maintenance-only mode^[5], leaving the task of future development to the Lustre community, including Whamcloud^[6], Xyratex^[7], OpenSFS, HPCFS, EUROPEAN Open Filesystems (OFS) SCE and others.

Lustre file systems are used in computer clusters ranging from small workgroup clusters to large-scale, multi-site clusters. Fifteen of the top 30 supercomputers in the world use Lustre file systems, including the world's fastest supercomputer (as of October 2010^[8]), the Tianhe-1A at the National Supercomputing Center in Tianjin, China. Other supercomputers that use the Lustre file system include the second fastest Jaguar supercomputer at Oak Ridge National Laboratory (ORNL) and systems at the National Energy Research Scientific Computing Center located at Lawrence Berkeley National Laboratory (LBNL), Lawrence Livermore National Laboratory (LLNL), Pacific Northwest National Laboratory, Texas Advanced Computing Center and NASA^[9] in North America, the largest system in Asia at Tokyo Institute of Technology,^[10] and one of the largest systems in Europe at CEA.^[11]

Lustre file systems can support tens of thousands of client systems, tens of petabytes (PBs) of storage and hundreds of gigabytes per second (GB/s) of I/O throughput. Due to Lustre's high scalability, businesses such as Internet service providers, financial institutions, and the oil and gas industry deploy Lustre file systems in their data centers.^[12]

History

The Lustre file system architecture was developed as a research project in 1999 by Peter Braam, who was a Senior Systems Scientist at Carnegie Mellon University at the time. Braam went on to found his own company Cluster File Systems, which released Lustre 1.0 in 2003. In November 2008, Braam left Sun Microsystems to work on another filesystem, leaving Eric Barton and Andreas Dilger in charge of Lustre architecture and development. In 2010 Eric Barton and Andreas Dilger left Oracle for the Lustre-centric startup Whamcloud ^[13], where they continue to enhance Lustre for the next generation of supercomputers.

The Lustre file system was first installed for production use in March 2003 on the MCR Linux Cluster at LLNL,^[14] one of the largest supercomputers at the time.^[15]

Lustre 1.2.0, released in March 2004, provided Linux kernel 2.6 support, a "size glimpse" feature to avoid lock revocation on files undergoing write, and client side data write-back cache accounting (grant).

Lustre 1.4.0, released in November 2004, provided protocol compatibility between versions, InfiniBand network support, and support for extents/mballoc in the ldiskfs on-disk filesystem.

Lustre 1.6.0, released in April 2007, supported mount configuration (“mountconf”) allowing servers to be configured with "mkfs" and "mount", supported dynamic addition of object storage targets (OSTs), enabled Lustre distributed lock manager (LDLM) scalability on symmetric multiprocessing (SMP) servers, and supported free space management for object allocations.

Lustre 1.8.0, released in May 2009, provided OSS Read Cache, improves recovery in the face of multiple failures, adds basic heterogeneous storage management via OST Pools, adaptive network timeouts, and version-based recovery. It also serves as a transition release, being interoperable with both Lustre 1.6 and Lustre 2.0.^[16]

Lustre 2.0.0, released in August 2010, provided a rewritten metadata server stack to provide a basis for Clustered Metadata (CMD) to allow distribution of the Lustre metadata across multiple metadata servers, a new Client IO stack (CLIO) for portability to other client operating systems such as Mac OS, Windows, and Solaris, and an abstracted Object Storage Device (OSD) back-end for portability to other filesystems such as ZFS.

The Lustre file system and associated open source software has been adopted by many partners. Both Red Hat and SUSE (Novell) offer Linux kernels that work without patches on the client for easy deployment.

Architecture

A Lustre file system has three major functional units:

A single metadata server (MDS) that has a single metadata target (MDT) per Lustre filesystem that stores namespace metadata, such as filenames, directories, access permissions, and file layout. The MDT data is stored in a single local disk filesystem.

One or more object storage servers (OSSes) that store file data on one or more object storage targets (OSTs). Depending on the server’s hardware, an OSS typically serves between two and eight OSTs, with each OST managing a single local disk filesystem. The capacity of a Lustre file system is the sum of the capacities provided by the OSTs.

Client(s) that access and use the data. Lustre presents all clients with a unified namespace for all of the files and data in the filesystem, using standard POSIX semantics, and allow concurrent and coherent read and write access to the files in the filesystem.

The MDT, OST, and client can be on the same node, but in typical installations these functions are on separate nodes communicating over a network. The Lustre Network (LNET) layer supports several network interconnects, including native Infiniband verbs, TCP/IP on Ethernet and other networks, Myrinet, Quadrics, and other proprietary network technologies. Lustre will take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.

The storage used for the MDT and OST backing filesystems is partitioned, optionally organized with logical volume management (LVM) and/or RAID, and normally formatted as ext4 file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems.

An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read/write operations. An MDT is a dedicated filesystem that controls file access and tells clients which object(s) make up a file. MDTs and OSTs currently use an enhanced version of ext4 called ldiskfs to store data. Work started in 2008 at Sun to port Lustre to Sun's ZFS/DMU for back-end data storage^[17] and continues as an open source project.^[18]

When a client accesses a file, it completes a filename lookup on the MDS. As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client. For read or write operations, the client then interprets the layout in the logical object volume (LOV) layer, which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.

Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as Global File System and OCFS must allow direct access to the underlying storage by all of the clients in the filesystem and increase the risk of filesystem corruption from misbehaving/defective clients.

Implementation

In a typical Lustre installation on a Linux client, a Lustre filesystem driver module is loaded into the kernel and the filesystem is mounted like any other local or network filesystem. Client applications see a single, unified filesystem even though it may be composed of tens to thousands of individual servers and MDT/OST filesystems.

On some massively parallel processor (MPP) installations, computational processors can access a Lustre file system by redirecting their I/O requests to a dedicated I/O node configured as a Lustre client. This approach is used in the LLNL Blue Gene installation.^[19]

Another approach used in the past is the liblustre library, which provided userspace applications with direct filesystem access. Liblustre was a user-level library that allows computational processors to mount and use the Lustre file system as a client. Using liblustre, the computational processors could access a Lustre file system even if the service node on which the job was launched is not a Lustre client. Liblustre allowed data movement directly between application space and the Lustre OSSs without requiring an intervening data copy through the kernel, thus providing low latency, high bandwidth access from computational processors to the Lustre file system directly.

Data objects and file striping

In a traditional UNIX disk file system, an inode data structure contains basic information about each file, such as where the data contained in the file is stored. The Lustre file system also uses inodes, but inodes on MDTs point to one or more OST objects associated with the file rather than to data blocks. These objects are implemented as files on the OSTs. When a client opens a file, the file open operation transfers a set of object pointers and their layout from the MDS to the client, so that the client can directly interact with the OSS node where the object is stored, allowing the client to perform I/O on the file without further communication with the MDS.

If only one OST object is associated with an MDT inode, that object contains all the data in the Lustre file. When more than one object is associated with a file, data in the file is “striped” across the objects similar to RAID 0. Striping a file over multiple objects provides significant performance benefits. When striping is used, the maximum file size is not limited by the size of a single target. Capacity and aggregate I/O bandwidth scale with the number of OSTs a file is striped over. Also, since the locking of each object is managed independently for each OST, adding more stripes (OSTs) scales the file IO locking capability of the filesystem proportionately. Each file in the filesystem can have a different striping layout, so that performance and capacity can be tuned optimally for each file.

Locking

Lustre has a distributed lock manager in the style of the VMS style to protect the integrity of each file's data and metadata. Access and modification of a Lustre file is completely cache coherent among all of the clients. Metadata locks are managed by the MDT that stores the inode for the file, using the 128-bit Lustre File Identifier (FID, composed of the Sequence number and Object ID) as the resource name. The metadata locks are split into multiple bits that protect the lookup of the file (file owner and group, permission and mode, and access control list (ACL)), the state of the inode (directory size, directory contents, link count, timestamps), and layout (file striping). A client can fetch multiple metadata lock bits for a single inode with a single RPC request, but currently they are only ever granted a read lock for the inode. The MDS manages all modifications to the inode in order to avoid lock resource contention and is currently the only node that gets write locks on inodes.

File data locks are managed by the OST on which each object of file is striped, using byte-range extent locks. Clients can be granted both overlapping read extent locks for part or all of the file, allowing multiple concurrent readers of the same file, and/or non-overlapping write extent locks for regions of the file. This allows many Lustre clients to access a single file concurrently for both read and write, avoiding bottlenecks during file IO. In practice, because Linux clients manage their data cache in units of pages, the clients will request locks that are always an integer multiple of the page size (4096 bytes on most clients). When a client is requesting an extent lock the OST may grant a lock for a larger extent than requested, in order to reduce the number of lock requests that the client makes. The actual size of the granted lock depends on several factors, including the number of currently-granted locks, whether there are conflicting write locks, and the number of outstanding lock requests. The granted lock is never smaller than the originally-requested extent. OST extent locks use the Lustre FID as the resource name for the lock. Since the number of extent lock servers scales with the number of OSTs in the filesystem, this also scales the aggregate locking performance of the filesystem, and of a single file if it is striped over multiple OSTs.

Networking

In a cluster with a Lustre file system, the system network connecting the servers and the clients is implemented using Lustre Networking (LNET), which provides the communication infrastructure required by the Lustre file system. Disk storage is connected to the Lustre file system MDSs and OSSs using traditional storage area network (SAN) technologies.

LNET supports many commonly-used network types, such as InfiniBand and IP networks, and allows simultaneous availability across multiple network types with routing between them. Remote Direct Memory Access (RDMA) is permitted when supported by underlying networks such as Quadrics Elan, Myrinet, and InfiniBand. High availability and recovery features enable transparent recovery in conjunction with failover servers.

LNET provides end-to-end throughput over Gigabit Ethernet (GigE) networks in excess of 100 MB/s,^[20] throughput up to 3 GB/s using InfiniBand quad data rate (QDR) links, and throughput over 1 GB/s across 10GigE interfaces.

High availability

Lustre file system high availability features include a robust failover and recovery mechanism, making server failures and reboots transparent. Version interoperability between successive minor versions of the Lustre software enables a server to be upgraded by taking it offline (or failing it over to a standby server), performing the upgrade, and restarting it, while all active jobs continue to run, merely experiencing a delay while the backup server takes over the storage.

Lustre MDSes are configured as an active/passive pair, while OSSes are typically deployed in an active/active configuration that provides redundancy without extra overhead. Often the standby MDS is the active MDS for another Lustre file system, so no nodes are idle in the cluster.

Commercial support

Commercial support for Lustre is available from a wide array of vendors. In most cases, this support is bundled along with the computing system and/or storage hardware sold by the vendor. A non-exhaustive list of vendors selling bundled computing and Lustre storage systems include Cray, Dell, Hewlett-Packard, SGI, and others. Sun Microsystems no longer sells either systems or storage that include Lustre. Major vendors selling storage hardware with bundled Lustre support include Data Direct Networks (DDN), Dell, Terascala, Xyratex, and many others.

References

^ "Lustre Home". Archived from the original on 2000-08-23.
^ "Sun Assimilates Lustre Filesystem". Linux Magazine. 2007-09-13.
^ "Sun Welcomes Cluster File Systems Customers and Partners". Sun Microsystems, Inc. 2007-10-02.
^ "Lustre 2.0 support limited to Oracle hardware customers". Infostor. 2010-04-28.
^ "Oracle has Kicked Lustre to the Curb". Inside HPC. 2011-01-10.
^ "Whamcloud aims to make sure Lustre has a future in HPC". Inside HPC. 2011-08-20.
^ "Xyratex Acquires ClusterStor, Lustre File System Expertise/". HPCwire. 2010-11-09.
^ "Tianhe-1A". TOP500.Org. 2010-11-16.
^ "Pleiades Supercomputer". www.nas.nasa.gov. 2008-08-18.
^ "TOP500 List - November 2006". TOP500.Org.
^ "TOP500 List - June 2006". TOP500.Org.
^ "Lustre File System presentation". Google Video. Retrieved 2008-01-28.
^ "Whamcloud Staffs up for Brighter Lustre". InsideHPC.
^ "Lustre Helps Power Third Fastest Supercomputer". DSStar.
^ "MCR Linux Cluster Xeon 2.4 GHz - Quadrics". Top500.Org.
^ "Lustre Roadmap and Future Plans" (PDF). Sun Microsystems. Retrieved 2008-08-21.
^ "Lustre to run on ZFS". Government Computer News. 2008-10-26.
^ "ZFS on Lustre". 2011-05-10.
^ "DataDirect Selected As Storage Tech Powering BlueGene/L". HPC Wire, October 15, 2004: Vol. 13, No. 41. {{cite web}}: Italic or bold markup not allowed in: |publisher= (help)
^ Lafoucrière, Jacques-Charles. "Lustre Experience at CEA/DIF" (PDF). HEPiX Forum, April 2007.

[1] "Lustre Home". Archived from the original on 2000-08-23.

[2] "Sun Assimilates Lustre Filesystem". Linux Magazine. 2007-09-13.

[3] "Sun Welcomes Cluster File Systems Customers and Partners". Sun Microsystems, Inc. 2007-10-02.

[4] "Lustre 2.0 support limited to Oracle hardware customers". Infostor. 2010-04-28.

[5] "Oracle has Kicked Lustre to the Curb". Inside HPC. 2011-01-10.

[6] "Whamcloud aims to make sure Lustre has a future in HPC". Inside HPC. 2011-08-20.

[7] "Xyratex Acquires ClusterStor, Lustre File System Expertise/". HPCwire. 2010-11-09.

[8] "Tianhe-1A". TOP500.Org. 2010-11-16.

[9] "Pleiades Supercomputer". www.nas.nasa.gov. 2008-08-18.

[10] "TOP500 List - November 2006". TOP500.Org.

[11] "TOP500 List - June 2006". TOP500.Org.

[12] "Lustre File System presentation". Google Video. Retrieved 2008-01-28.

[13] "Whamcloud Staffs up for Brighter Lustre". InsideHPC.

[14] "Lustre Helps Power Third Fastest Supercomputer". DSStar.

[15] "MCR Linux Cluster Xeon 2.4 GHz - Quadrics". Top500.Org.

[16] "Lustre Roadmap and Future Plans" (PDF). Sun Microsystems. Retrieved 2008-08-21.

[17] "Lustre to run on ZFS". Government Computer News. 2008-10-26.

[18] "ZFS on Lustre". 2011-05-10.

[19] "DataDirect Selected As Storage Tech Powering BlueGene/L". HPC Wire, October 15, 2004: Vol. 13, No. 41. {{cite web}}: Italic or bold markup not allowed in: |publisher= (help)

[20] Lafoucrière, Jacques-Charles. "Lustre Experience at CEA/DIF" (PDF). HEPiX Forum, April 2007.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]