Jump to content

Btrfs: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Pwl0lwp (talk | contribs)
m fixed developer list citation
Pwl0lwp (talk | contribs)
History: added Btrfs as default FS for SLE12
Line 58: Line 58:
In 2008, the principal developer of the [[ext3]] and [[ext4]] file systems, [[Theodore Ts'o]], stated that although ext4 has improved features, it is not a major advance; it uses old technology and is a stop-gap. Ts'o said that Btrfs is the better direction because "it offers improvements in scalability, reliability, and ease of management".<ref>{{ cite journal | url = http://arstechnica.com/open-source/news/2009/04/linux-collaboration-summit-the-kernel-panel.ars | title = Panelists ponder the kernel at Linux Collaboration Summit | first = Ryan | last = Paul | date = 13 April 2009 | accessdate = 2009-08-22 | publisher = Ars Technica{{inconsistent citations}} |archiveurl=http://www.webcitation.org/68elXKsj7 |archivedate=24 June 2012 |deadurl=no}}</ref> Btrfs also has "a number of the same design ideas that [[ReiserFS|reiser3]]/[[Reiser4|4]] had".<ref>{{ cite mailing list | title = Re: reiser4 for 2.6.27-rc1 | first = Theodore |last=Ts'o | url = http://lkml.org/lkml/2008/8/1/217 | date = 1 August 2008 | accessdate = 2010-12-31 | mailinglist = linux-kernel }}</ref>
In 2008, the principal developer of the [[ext3]] and [[ext4]] file systems, [[Theodore Ts'o]], stated that although ext4 has improved features, it is not a major advance; it uses old technology and is a stop-gap. Ts'o said that Btrfs is the better direction because "it offers improvements in scalability, reliability, and ease of management".<ref>{{ cite journal | url = http://arstechnica.com/open-source/news/2009/04/linux-collaboration-summit-the-kernel-panel.ars | title = Panelists ponder the kernel at Linux Collaboration Summit | first = Ryan | last = Paul | date = 13 April 2009 | accessdate = 2009-08-22 | publisher = Ars Technica{{inconsistent citations}} |archiveurl=http://www.webcitation.org/68elXKsj7 |archivedate=24 June 2012 |deadurl=no}}</ref> Btrfs also has "a number of the same design ideas that [[ReiserFS|reiser3]]/[[Reiser4|4]] had".<ref>{{ cite mailing list | title = Re: reiser4 for 2.6.27-rc1 | first = Theodore |last=Ts'o | url = http://lkml.org/lkml/2008/8/1/217 | date = 1 August 2008 | accessdate = 2010-12-31 | mailinglist = linux-kernel }}</ref>


Btrfs 1.0, with finalized on-disk format, was originally slated for a late-2008 release,<ref>{{cite web | url = http://btrfs.wiki.kernel.org/index.php/Development_timeline |deadurl= yes | title = Development timeline | work= Btrfs wiki | date = 11 December 2008 | accessdate = 5 November 2011 | archivedate= 20 December 2008 |archiveurl= http://web.archive.org/web/20081220083235/http://btrfs.wiki.kernel.org/index.php/Development_timeline }}</ref> and was finally accepted into the [[Linux kernel mainline]] in 2009.<ref>{{ cite news | url = http://www.linux-magazine.com/Online/News/Kernel-2.6.29-Corbet-Says-Btrfs-Next-Generation-Filesystem | title = Kernel 2.6.29: Corbet Says Btrfs Next Generation Filesystem |first= Britta |last=Wuelfing |work= [[Linux Magazine]] | date = 12 January 2009 | accessdate= 5 November 2011 }}</ref> Several [[Linux distribution]]s began offering Btrfs as an experimental choice of [[root file system]] during installation.<ref>{{cite web |title=Red Hat Enterprise Linux 6 documentation: Technology Previews |url=http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Technical_Notes/storage.html#id4452791 }}</ref><ref>{{cite web |title=Fedora Weekly News Issue 276 |url=http://fedoraproject.org/wiki/FWN/LatestIssue#What.27s_new_in_Fedora_15_.28Lovelock.29.3F |date=25 May 2011}}</ref><ref>{{ cite press release | url=http://www.debian.org/News/2011/20110205a.en.html | title=Debian 6.0 "Squeeze" released | date=6 February 2011 | publisher=[[Debian]] | quote=Support has also been added for the ext4 and Btrfs filesystems... | accessdate=2011-02-08}}</ref> In summer 2012, several Linux distributions moved Btrfs from experimental to production or supported status.<ref>{{cite web|url=http://www.novell.com/linux/releasenotes/x86_64/SUSE-SLES/11-SP2/#fate-306585|title=SLES 11 SP2 Release Notes|date=21 August 2012|accessdate=2012-08-29}}</ref><ref>{{cite web|url=http://www.oracle.com/us/technologies/linux/025994.htm|title=Oracle Linux Technical Information|accessdate=2012-11-14}}</ref>
Btrfs 1.0, with finalized on-disk format, was originally slated for a late-2008 release,<ref>{{cite web | url = http://btrfs.wiki.kernel.org/index.php/Development_timeline |deadurl= yes | title = Development timeline | work= Btrfs wiki | date = 11 December 2008 | accessdate = 5 November 2011 | archivedate= 20 December 2008 |archiveurl= http://web.archive.org/web/20081220083235/http://btrfs.wiki.kernel.org/index.php/Development_timeline }}</ref> and was finally accepted into the [[Linux kernel mainline]] in 2009.<ref>{{ cite news | url = http://www.linux-magazine.com/Online/News/Kernel-2.6.29-Corbet-Says-Btrfs-Next-Generation-Filesystem | title = Kernel 2.6.29: Corbet Says Btrfs Next Generation Filesystem |first= Britta |last=Wuelfing |work= [[Linux Magazine]] | date = 12 January 2009 | accessdate= 5 November 2011 }}</ref> Several [[Linux distribution]]s began offering Btrfs as an experimental choice of [[root file system]] during installation.<ref>{{cite web |title=Red Hat Enterprise Linux 6 documentation: Technology Previews |url=http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Technical_Notes/storage.html#id4452791 }}</ref><ref>{{cite web |title=Fedora Weekly News Issue 276 |url=http://fedoraproject.org/wiki/FWN/LatestIssue#What.27s_new_in_Fedora_15_.28Lovelock.29.3F |date=25 May 2011}}</ref><ref>{{ cite press release | url=http://www.debian.org/News/2011/20110205a.en.html | title=Debian 6.0 "Squeeze" released | date=6 February 2011 | publisher=[[Debian]] | quote=Support has also been added for the ext4 and Btrfs filesystems... | accessdate=2011-02-08}}</ref> In summer 2012, several Linux distributions moved Btrfs from experimental to production or supported status.<ref>{{cite web|url=http://www.novell.com/linux/releasenotes/x86_64/SUSE-SLES/11-SP2/#fate-306585|title=SLES 11 SP2 Release Notes|date=21 August 2012|accessdate=2012-08-29}}</ref><ref>{{cite web|url=http://www.oracle.com/us/technologies/linux/025994.htm|title=Oracle Linux Technical Information|accessdate=2012-11-14}}</ref>. In 2015, Btrfs was adopted as the default filesystem for [[SUSE Linux Enterprise Server]] 12.<ref>{{cite web |url=https://www.suse.com/releasenotes/x86_64/SUSE-SLES/12/#fate-317221|title=SUSE Linux Enterprise Server 12 Release Notes| date=2015-11-05 |accessdate=2016-01-20}}</ref>


In 2011, defragmentation features were announced for version 3.0 of the Linux kernel.<ref>{{ cite news |title= Linux 3.0 scrubs up Btrfs, gets more Xen |first= Eric |last=Brown |work= Linux devices |publisher= eWeek |date= 22 July 2011 |url= http://www.linuxfordevices.com/c/a/News/Linux-30-released/ |accessdate= 8 November 2011 }}</ref> Besides Mason at Oracle, Miao Xie at Fujitsu contributed performance improvements.<ref>{{ cite news |title= Kernel Log: Coming in 3.0 (Part 2) - Filesystems |first=Thorsten |last=Leemhuis |work= The H Open |date= 21 June 2011 |url= http://www.h-online.com/open/features/Kernel-Log-Coming-in-3-0-Part-2-Filesystems-1263681.html |accessdate= 8 November 2011 }}</ref> In June 2012, Chris Mason left Oracle for [[Fusion-io]], which he left a year later with Josef Bacik to join [[Facebook]]; in both companies, Mason continued to work on Btrfs.<ref>{{cite web|url=http://www.itwire.com/business-it-news/open-source/62417-faecbook-lures-top-btrfs-hackers|title=iTWire|author=Sam Varghese|work=itwire.com|accessdate=19 April 2015}}</ref><ref name="joinfb">{{cite web|url=http://www.phoronix.com/scan.php?page=news_item&px=MTUzNTE|title=Lead Btrfs File-System Developers Join Facebook|work=phoronix.com|accessdate=19 April 2015}}</ref>
In 2011, defragmentation features were announced for version 3.0 of the Linux kernel.<ref>{{ cite news |title= Linux 3.0 scrubs up Btrfs, gets more Xen |first= Eric |last=Brown |work= Linux devices |publisher= eWeek |date= 22 July 2011 |url= http://www.linuxfordevices.com/c/a/News/Linux-30-released/ |accessdate= 8 November 2011 }}</ref> Besides Mason at Oracle, Miao Xie at Fujitsu contributed performance improvements.<ref>{{ cite news |title= Kernel Log: Coming in 3.0 (Part 2) - Filesystems |first=Thorsten |last=Leemhuis |work= The H Open |date= 21 June 2011 |url= http://www.h-online.com/open/features/Kernel-Log-Coming-in-3-0-Part-2-Filesystems-1263681.html |accessdate= 8 November 2011 }}</ref> In June 2012, Chris Mason left Oracle for [[Fusion-io]], which he left a year later with Josef Bacik to join [[Facebook]]; in both companies, Mason continued to work on Btrfs.<ref>{{cite web |url=http://www.itwire.com/business-it-news/open-source/62417-faecbook-lures-top-btrfs-hackers|title=iTWire|author=Sam Varghese|work=itwire.com|accessdate=19 April 2015}}</ref><ref name="joinfb">{{cite web|url=http://www.phoronix.com/scan.php?page=news_item&px=MTUzNTE|title=Lead Btrfs File-System Developers Join Facebook|work=phoronix.com|accessdate=19 April 2015}}</ref>


== Features ==
== Features ==

Revision as of 00:51, 20 January 2016

Btrfs
Developer(s)Facebook, Fujitsu, Fusion-IO, Intel, Linux Foundation, Netgear, Oracle, Red Hat, STRATO AG, and SUSE[1]
Full nameBtrfs
IntroducedStable: Linux kernel 3.10, 29 Jul 2013
Unstable: Linux kernel 2.6.29, March 2009 with Linux
Structures
Directory contentsB-tree
File allocationExtents
Limits
Max volume size16 EiB[2][a]
Max file size16 EiB[2][a]
Max no. of files264
Max filename length255 ASCII characters (fewer for multibyte character encodings such as Unicode)
Allowed filename
characters
All except '/' and NUL ('\0')
Features
Dates recordedCreation (otime),[5] modification (mtime), attribute modification (ctime), and access (atime)
Date resolutionNanosecond
AttributesPOSIX and extended attributes
File system
permissions
POSIX and ACL
Transparent
compression
Yes (zlib, LZO[6] and LZ4[7] (planned))
Transparent
encryption
Planned[8]
Data deduplicationIn development[9]
Other
Supported
operating systems
Linux
Websitebtrfs.wiki.kernel.org

Btrfs (B-tree file system, pronounced as "butter F S", "better F S",[8] "b-tree F S",[10] or simply by spelling it out) is a file system based on the copy-on-write (COW) principle, initially designed at Oracle Corporation for use in Linux. Development began in 2007; 2014, the file system's on-disk format has been marked stable.[11]

Btrfs is intended to address the lack of pooling, snapshots, checksums, and integral multi-device spanning in Linux file systems.[8] Chris Mason, the principal Btrfs author, has stated that its goal was "to let Linux scale for the storage that will be available. Scaling is not just about addressing the storage but also means being able to administer and to manage it with a clean interface that lets people see what's being used and makes it more reliable."[12][13] Valerie Aurora formulated the technical difference between ZFS and btrfs: "At the highest level, ZFS uses plain ol’ trees of pointers to blocks, FFS-style, and variable block sizes, inspired by the SLAB kernel memory allocator. Btrfs uses a specialized, COW-friendly form of b-trees (as presented by Ohad Rodeh at LSF ’07) and extents. Btrfs is actually slightly more exciting than this: every single piece of data or metadata in btrfs is an item in a b-tree, and items are packed together indiscriminately, without regard to their types. ZFS reduced all file system metadata and data to objects and related operations; btrfs reduced them all to items in a b-tree. Now all the interesting decisions are about how to assign keys to items and order them inside the b-tree."[13]

History

The core data structure of Btrfs‍—‌the copy-on-write B-tree‍—‌was originally proposed by IBM researcher Ohad Rodeh at a presentation at USENIX 2007.[14] Chris Mason, an engineer working on ReiserFS for SUSE at the time, joined Oracle later that year and began work on a new file system based on these B-trees.[15]

In 2008, the principal developer of the ext3 and ext4 file systems, Theodore Ts'o, stated that although ext4 has improved features, it is not a major advance; it uses old technology and is a stop-gap. Ts'o said that Btrfs is the better direction because "it offers improvements in scalability, reliability, and ease of management".[16] Btrfs also has "a number of the same design ideas that reiser3/4 had".[17]

Btrfs 1.0, with finalized on-disk format, was originally slated for a late-2008 release,[18] and was finally accepted into the Linux kernel mainline in 2009.[19] Several Linux distributions began offering Btrfs as an experimental choice of root file system during installation.[20][21][22] In summer 2012, several Linux distributions moved Btrfs from experimental to production or supported status.[23][24]. In 2015, Btrfs was adopted as the default filesystem for SUSE Linux Enterprise Server 12.[25]

In 2011, defragmentation features were announced for version 3.0 of the Linux kernel.[26] Besides Mason at Oracle, Miao Xie at Fujitsu contributed performance improvements.[27] In June 2012, Chris Mason left Oracle for Fusion-io, which he left a year later with Josef Bacik to join Facebook; in both companies, Mason continued to work on Btrfs.[28][15]

Features

By version 3.14 of the Linux kernel, Btrfs implements the following features:[29][30]

Planned features include the following:[29]

In 2009, Btrfs was expected to offer a feature set comparable to ZFS, developed by Sun Microsystems.[48] After Oracle's acquisition of Sun in 2009, Mason and Oracle decided to continue with Btrfs development.[49]

Cloning

Btrfs provides a clone operation that atomically creates a copy-on-write snapshot of a file. Such cloned files are sometimes referred to as reflinks, in light of the associated Linux kernel system calls.[50]

By cloning, the file system does not create a new link pointing to an existing inode; instead, it creates a new inode that initially shares the same disk blocks with the original file. As a result, cloning works only within the boundaries of the same Btrfs file system, and since version 3.6 of the Linux kernel it may cross the boundaries of subvolumes under certain circumstances.[51][52] The actual data blocks are not duplicated; at the same time, due to the copy-on-write (CoW) nature of Btrfs, modifications to any of the cloned files are not visible in the original file and vice versa.[53]

Cloning should not be confused with hard links, which are directory entries that associate multiple file names with actual files on a file system. While hard links can be taken as different names for the same file, cloning in Btrfs provides independent files that share their disk blocks.[53][54]

Support for this Btrfs feature was added in version 7.5 of the GNU coreutils, via the --reflink option to the cp command.[55][56]

Subvolumes and snapshots

A Btrfs subvolume can be thought of as a separate POSIX file namespace, mountable separately by passing subvol or subvolid options to the mount(8) utility. It can also be accessed by mounting the top-level subvolume, in which case subvolumes are visible and accessible as its subdirectories.[57]

Subvolumes can be created at any place within the file system hierarchy, and they can also be nested. Nested subvolumes appear as subdirectories within their parent subvolumes, similar to the way that the top-level subvolume presents its subvolumes as subdirectories. Deleting a subvolume deletes all subvolumes below it in the nesting hierarchy. For this reason, the top-level subvolume cannot be deleted.[58]

Any Btrfs file system always has a default subvolume, which is initially set to be the top-level subvolume, and is mounted by default if no subvolume selection option is passed to mount. The default subvolume can be changed as required.[58]

A Btrfs snapshot is actually a subvolume that shares its data (and metadata) with some other subvolume, using Btrfs' copy-on-write capabilities, and modifications to a snapshot are not visible in the original subvolume. Once a writable snapshot is made, it can be treated as an alternate version of the original file system. For example, to rollback to a snapshot, a modified original subvolume needs to be unmounted and the snapshot needs to be mounted in its place. At that point, the original subvolume may also be deleted.[57]

The copy-on-write (CoW) nature of Btrfs means that snapshots are quickly created, while initially consuming very little disk space. Since a snapshot is a subvolume, creating nested snapshots is also possible. Taking snapshots of a subvolume is not a recursive process; thus, if a snapshot of a subvolume is created, every subvolume or snapshot that the subvolume already contains is mapped to an empty directory of the same name inside the snapshot.[57][58]

Taking snapshots of a directory is not possible, as only subvolumes can have snapshots. However, there is a workaround that involves reflinks spread across subvolumes. This way, a new subvolume is created, containing cross-subvolume reflinks to the content of the targeted directory. Having that available, a snapshot of this new volume can be created.[51]

A subvolume in Btrfs is quite different from a traditional Logical Volume Manager (LVM) logical volume. With LVM, a logical volume is a separate block device, while a Btrfs subvolume is not and it cannot be treated or used that way.[57]

Send/receive

Given any pair of subvolumes (or snapshots), Btrfs can generate a binary diff between them (by using the btrfs send command) that can be replayed later (by using btrfs receive), possibly on a different Btrfs file system. The send/receive feature effectively creates (and applies) a set of data modifications required for converting one subvolume into another.[41][59]

The send/receive feature can be used with regularly scheduled snapshots for implementing a simple form of file system master/slave replication, or for the purpose of performing incremental backups.[41][59]

Quota groups

A quota group (or qgroup) imposes an upper limit to the space a subvolume or snapshot may consume. A new snapshot initially consumes no quota because its data is shared with its parent, but thereafter incurs a charge for new files and copy-on-write operations on existing files. When quotas are active, a quota group is automatically created with each new subvolume or snapshot. These initial quota groups are building blocks which can be grouped (with the btrfs qgroup command) into hierarchies to implement quota pools.[42]

Quota groups only apply to subvolumes and snapshots, while having quotas enforced on individual subdirectories, users, or user groups is not possible. However, workarounds are possible by using different subvolumes for all users or user groups that require a quota to be enforced.

In-place ext2/3/4 conversion

As the result of having very little metadata anchored in fixed locations, Btrfs can warp to fit unusual spatial layouts of the backend storage devices. The btrfs-convert tool exploits this ability to do an in-place conversion of any ext2/3/4 file system, by nesting the equivalent Btrfs metadata in its unallocated space — while preserving an unmodified copy of the original file system.[39]

The conversion involves creating a copy of the whole ext2/3/4 metadata, while the Btrfs files simply point to the same blocks used by the ext2/3/4 files. This makes the bulk of the blocks shared between the two filesystems before the conversion becomes permanent. Thanks to the copy-on-write nature of Btrfs, the original versions of the file data blocks are preserved during all file modifications. Until the conversion becomes permanent, only the blocks that were marked as free in ext2/3/4 are used to hold new Btrfs modifications, meaning that the conversion can be undone at any time.[39]

All converted files are available and writable in the default subvolume of the Btrfs. A sparse file holding all of the references to the original ext2/3/4 filesystem is created in a separate subvolume, which is mountable on its own as a read-only disk image, allowing both original and converted file systems to be accessed at the same time. Deleting this sparse file frees up the space and makes the conversion permanent.[39]

Union mounting / seed devices

When creating a new Btrfs, an existing Btrfs can be used as a read-only "seed" file system. The new file system will then act as a copy-on-write overlay on the seed, as a form of union mounting. The seed can be later detached from the Btrfs, at which point the rebalancer will simply copy over any seed data still referenced by the new file system before detaching. Mason has suggested this may be useful for a Live CD installer, which might boot from a read-only Btrfs seed on optical disc, rebalance itself to the target partition on the install disk in the background while the user continues to work, then eject the disc to complete the installation without rebooting.[60]

Encryption

In his 2009 interview, Chris Mason stated that support for encryption was planned for Btrfs.[61] In the meantime, a workaround for combining encryption with Btrfs is to use a full-disk encryption mechanism such as dm-crypt/LUKS on the underlying devices, and to create the Btrfs filesystem on top of that layer.

Checking and recovery

Unix systems traditionally rely on "fsck" programs to check and repair filesystems. The btrfsck program is now available but, as of May 2012, it is described by the authors as "relatively new code"[62] which has "not seen widespread testing on a large range of real-life breakage", and that "may cause additional damage in the process of repair".[62]

There is another tool, named btrfs-restore, that can be used to recover files from an unmountable filesystem, without modifying the broken filesystem itself (i.e., non-destructively).[63]

In normal use, Btrfs is mostly self-healing and can recover from broken root trees at mount time, thanks to making periodic data flushes to permanent storage every 30 seconds (which is the default period). Thus, isolated errors will cause a maximum of 30 seconds of filesystem changes to be lost at the next mount.[64] This period can be changed by specifying a desired value (in seconds) for the commit mount option.[65][66]

Design

Ohad Rodeh's original proposal at USENIX 2007 noted that B+ trees, which are widely used as on-disk data structures for databases, could not efficiently allow copy-on-write-based snapshots because its leaf nodes were linked together: if a leaf was copy-on-written, its siblings and parents would have to be as well, as would their siblings and parents and so on until the entire tree was copied. He suggested instead a modified B-tree (which has no leaf linkage), with a refcount associated to each tree node but stored in an ad-hoc free map structure and certain relaxations to the tree's balancing algorithms to make them copy-on-write friendly. The result would be a data structure suitable for a high-performance object store that could perform copy-on-write snapshots, while maintaining good concurrency.[14]

At Oracle later that year, Chris Mason began work on a snapshot-capable file system that would use this data structure almost exclusively—not just for metadata and file data, but also recursively to track space allocation of the trees themselves. This allowed all traversal and modifications to be funneled through a single code path, against which features such as copy-on-write, checksumming and mirroring needed to be implemented only once to benefit the entire file system.[48]

Btrfs is structured as several layers of such trees, all using the same B-tree implementation. The trees store generic items sorted on a 136-bit key. The first 64 bits of the key are a unique object id. The middle 8 bits are an item type field; its use is hardwired into code as an item filter in tree lookups. Objects can have multiple items of multiple types. The remaining right-hand 64 bits are used in type-specific ways. Therefore items for the same object end up adjacent to each other in the tree, ordered by type. By choosing certain right-hand key values, objects can further put items of the same type in a particular order.[48][67]

Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical block number of a child node. Leaf nodes contain item keys packed into the front of the node and item data packed into the end, with the two growing toward each other as the leaf fills up.[48]

Root tree

Every tree appears as an object in the root tree (or tree of tree roots). Some trees, such as file system trees and log trees, have a variable number of instances, each of which is given its own object id. Trees which are singletons (the data relocation, extent and chunk trees) are assigned special, fixed object ids ≤256. The root tree appears in itself as a tree with object id 1.

Trees refer to each other by object id. They may also refer to individual nodes in other trees as a triplet of the tree's object id, the node's level within the tree and its leftmost key value. Such references are independent of where the tree is actually stored.

File system tree

User-visible files and directories all live in a file system tree. There is one file system tree per subvolume. Subvolumes can nest, in which case they appear as a directory item (described below) whose data is a reference to the nested subvolume's file system tree.

Within the file system tree, each file and directory object has an inode item. Extended attributes and ACL entries are stored alongside in separate items.

Within each directory, directory entries appear as directory items, whose right-hand key values are a CRC32C hash of their filename. Their data is a location key, or the key of the inode item it points to. Directory items together can thus act as an index for path-to-inode lookups, but are not used for iteration because they are sorted by their hash, effectively randomly permuting them. This means user applications iterating over and opening files in a large directory would thus generate many more disk seeks between non-adjacent files—a notable performance drain in other file systems with hash-ordered directories such as ReiserFS,[68] ext3 (with Htree-indexes enabled[69]) and ext4, all of which have TEA-hashed filenames. To avoid this, each directory entry has a directory index item, whose right-hand key value of the item is set to a per-directory counter that increments with each new directory entry. Iteration over these index items thus returns entries in roughly the same order as they are stored on disk.

Besides inode items, files and directories also have a reference item whose right-hand key value is the object id of their parent directory. The data part of the reference item is the filename that inode is known by in that directory. This allows upward traversal through the directory hierarchy by providing a way to map inodes back to paths.

Files with hard links in multiple directories have multiple reference items, one for each parent directory. Files with multiple hard links in the same directory pack all of the links' filenames into the same reference item. This was a design flaw that limited the number of same-directory hard links to however many could fit in a single tree block. (On the default block size of 4 KB, an average filename length of 8 bytes and a per-filename header of 4 bytes, this would be less than 350.) Applications which made heavy use of multiple same-directory hard links, such as git, GNUS, GMame and BackupPC were later observed to fail after hitting this limit.[70] The limit was eventually removed[71] (and as of October 2012 has been merged[72] pending release in Linux 3.7) by introducing spillover extended reference items to hold hard link filenames which could not otherwise fit.

Extents

File data are kept outside the tree in extents, which are contiguous runs of disk blocks. Extent blocks default to 4KiB in size, do not have headers and contain only (possibly compressed) file data. In compressed extents, individual blocks are not compressed separately; rather, the compression stream spans the entire extent.

Files have extent data items to track the extents which hold their contents. The item's right-hand key value is the starting byte offset of the extent. This makes for efficient seeks in large files with many extents, because the correct extent for any given file offset can be computed with just one tree lookup.

Snapshots and cloned files share extents. When a small part of a large such extent is overwritten, the resulting copy-on-write may create three new extents: a small one containing the overwritten data, and two large ones with unmodified data on either side of the overwrite. To avoid having to re-write unmodified data, the copy-on-write may instead create bookend extents, or extents which are simply slices of existing extents. Extent data items allow for this by including an offset into the extent they are tracking: items for bookends are those with non-zero offsets.[67]

If the file data is small enough to fit inside a tree node, it is instead pulled in-tree and stored inline in the extent data item. Each tree node is stored in its own tree block—a single uncompressed block with a header. The tree block is regarded as a free-standing, single-block extent.

Extent allocation tree

The extent allocation tree acts as an allocation map for the file system. Unlike other trees, items in this tree do not have object ids and represent regions of space: their left-hand and right-hand key values are the starting offsets and lengths of the regions they represent.

The file system zones its allocated space into block groups, which are variable-sized allocation regions that alternate successively between preferring metadata extents (tree nodes) and data extents (file contents). The default ratio of data to metadata block groups is 1:2. They are intended to work like the Orlov block allocator and block groups in ext3 in allocating related files together and resisting fragmentation by leaving allocation gaps between groups. (ext3 block groups, however, have fixed locations computed from the size of the file system, whereas those in Btrfs are dynamic and are created as needed.) Each block group is associated with a block group item. Inode items in the file system tree include a reference to their current block group.[67]

Extent items contain a back-reference to the tree node or file occupying that extent. There may be multiple back-references if the extent is shared between snapshots. If there are too many back-references to fit in the item, they spill out into individual extent data reference items. Tree nodes, in turn, have back-references to their containing trees. This makes it possible to find which extents or tree nodes are in any region of space by doing a B-tree range lookup on a pair of offsets bracketing that region, then following the back-references. For relocating data, this allows an efficient upwards traversal from the relocated blocks to quickly find and fix all downwards references to those blocks, without having to walk the entire file system. This, in turn, allows the file system to efficiently shrink, migrate and defragment its storage online.

The extent allocation tree, as with all other trees in the file system, is copy-on-write. Writes to the file system may thus cause a cascade whereby changed tree nodes and file data result in new extents being allocated, causing the extent tree to itself change. To avoid creating a feedback loop, extent tree nodes which are still in memory but not yet committed to disk may be updated in-place to reflect new copy-on-written extents.

In theory, the extent allocation tree makes a conventional free-space bitmap unnecessary because the extent allocation tree acts as a B-tree version of a BSP tree. In practice, however, an in-memory red-black tree of page-sized bitmaps is used to speed up allocations. These bitmaps are persisted to disk (starting in Linux 2.6.37, via the space_cache mount option[73]) as special extents that are exempt from checksumming and copy-on-write. The extent items tracking these extents are stored in the root tree.

Checksum tree and scrubbing

CRC-32C checksums are computed for both data and metadata and stored as checksum items in a checksum tree. There is room of 256 bits for metadata checksums and up to a full leaf block (roughly 4 KB or more) for data checksums. More checksum algorithm options are planned for the future.[29][74]

There is one checksum item per contiguous run of allocated blocks, with per-block checksums packed end-to-end into the item data. If there are more checksums than can fit, they spill rightwards over into another checksum item in a new leaf. If the file system detects a checksum mismatch while reading a block, it first tries to obtain (or create) a good copy of this block from another device – if internal mirroring or RAID techniques are in use.[75][76]

Btrfs can initiate an online check of the entire file system by triggering a file system scrub job that is performed in the background. The scrub job scans the entire file system for integrity and automatically attempts to report and repair any bad blocks it finds along the way.[75][77]

Log tree

An fsync is a request to commit modified data immediately to stable storage. fsync-heavy workloads (such as databases) could potentially generate a great deal of redundant write I/O by forcing the file system to repeatedly copy-on-write and flush frequently modified parts of trees to storage. To avoid this, a temporary per-subvolume log tree is created to journal fsync-triggered copy-on-writes. Log trees are self-contained, tracking their own extents and keeping their own checksum items. Their items are replayed and deleted at the next full tree commit or (if there was a system crash) at the next remount.

Chunk and device trees

Block devices are divided into chunks of 256 MB or more. Chunks may be mirrored or striped across multiple devices. The mirroring/striping arrangement is transparent to the rest of the file system, which simply sees the single, logical address space that chunks are mapped into.

This is all tracked by the chunk tree, where each device is represented as a device item and each mapping from a logical chunk to its underlying physical chunks is stored in a chunk map item. The device tree is the inverse of the chunk tree, and contains device extent items which map byte ranges of block devices back to individual chunks. As in the extent allocation tree, this allows Btrfs to efficiently shrink or remove devices from volumes by locating the chunks they contain (and relocating their contents).

The file system, chunks and devices are all assigned a Universally Unique Identifier (UUID). The header of every tree node contains both the UUID of its containing chunk and the UUID of the file system. The chunks containing the chunk tree, the root tree, device tree and extent tree are always mirrored—even on single-device volumes. These are all intended to improve the odds of successful data salvage in the event of media errors.

Relocation trees

Defragmentation, shrinking and rebalancing operations require extents to be relocated. However, doing a simple copy-on-write of the relocating extent will break sharing between snapshots and consume disk space. To preserve sharing, an update-and-swap algorithm is used, with a special relocation tree serving as scratch space for affected metadata. The extent to be relocated is first copied to its destination. Then, by following backreferences upward through the affected subvolume's file system tree, metadata pointing to the old extent is progressively updated to point at the new one; any newly updated items are stored in the relocation tree. Once the update is complete, items in the relocation tree are swapped with their counterparts in the affected subvolume, and the relocation tree is discarded.[78]

Superblock

All the file system's trees—including the chunk tree itself—are stored in chunks, creating a potential chicken-and-egg problem when mounting the file system. To bootstrap into a mount, a list of physical addresses of chunks belonging to the chunk and root trees must be stored in the superblock.[79]

Superblock mirrors are kept at fixed locations:[80] 64 KiB into every block device, with additional copies at 64 MiB, 256 GiB and 1 PiB. When a superblock mirror is updated, its generation number is incremented. At mount time, the copy with the highest generation number is used. All superblock mirrors are updated in tandem, except in SSD mode which alternates updates among mirrors to provide some wear levelling.

See also

Notes

  1. ^ a b This is the Btrfs' own on-disk size limit. The limit is reduced down to 8 EiB on 64-bit systems and 2 EiB on 32-bit systems due to Linux kernel's internal limits, unless kernel's CONFIG_LBD configuration option (available since the 2.6.x kernel series) is enabled to remove these kernel limits.[3][4]

References

  1. ^ "Btrfs Contributors at kernel.org". kernel.org. 18 January 2016. Retrieved 20 January 2016.
  2. ^ a b "Suse Documentation: Storage Administration Guide – Large File Support in Linux". SUSE. Retrieved 12 August 2015.
  3. ^ Andreas Jaeger (15 February 2005). "Large File Support in Linux". users.suse.com. Retrieved 12 August 2015.
  4. ^ "Linux kernel configuration help for CONFIG_LBD in 2.6.29 on x86". kernel.xc.net. Retrieved 12 August 2015.
  5. ^ Jonathan Corbet (26 July 2010). "File creation times". LWN.net. Retrieved 15 August 2015.
  6. ^ "btrfs Wiki". kernel.org. Retrieved 19 April 2015.
  7. ^ "LZ4 For Btrfs Arrives While Its FSCK Remains M.I.A." phoronix.com. Retrieved 19 April 2015.
  8. ^ a b c d McPherson, Amanda (22 June 2009). "A Conversation with Chris Mason on BTRfs: the next generation file system for Linux". Linux Foundation. Archived from the original on 24 June 2012. Retrieved 1 September 2009. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
  9. ^ "Deduplication". kernel.org. Retrieved 19 April 2015.
  10. ^ Henson, Valerie (31 January 2008). Chunkfs: Fast file system check and repair. Melbourne, Australia. Event occurs at 18m 49s. Retrieved 5 February 2008. It's called Butter FS or B-tree FS, but all the cool kids say Butter FS
  11. ^ "Official Btrfs Wiki". BTRFS Wiki. Archived from the original on 25 August 2014. Retrieved 25 August 2014. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
  12. ^ Kerner, Sean Michael (30 October 2008). "A Better File System For Linux". InternetNews.com. Archived from citations%5d%5d the original on 24 June 2012. Retrieved 30 October 2008. {{cite web}}: Check |url= value (help); Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
  13. ^ a b Layton, Jeffrey B. (14 July 2009). "File System Evangelist and Thought Leader: An Interview with Valerie Aurora". Linux Magazine. Retrieved 27 March 2012.
  14. ^ a b Rodeh, Ohad (2007). B-trees, shadowing, and clones (PDF). USENIX Linux Storage & Filesystem Workshop. Also Rodeh, Ohad (2008). "B-trees, shadowing, and clones". ACM Transactions on Storage.
  15. ^ a b "Lead Btrfs File-System Developers Join Facebook". phoronix.com. Retrieved 19 April 2015.
  16. ^ Paul, Ryan (13 April 2009). "Panelists ponder the kernel at Linux Collaboration Summit". Ars TechnicaTemplate:Inconsistent citations. Archived from the original on 24 June 2012. Retrieved 22 August 2009. {{cite journal}}: Cite journal requires |journal= (help); Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
  17. ^ Ts'o, Theodore (1 August 2008). "Re: reiser4 for 2.6.27-rc1". linux-kernel (Mailing list). Retrieved 31 December 2010. {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  18. ^ "Development timeline". Btrfs wiki. 11 December 2008. Archived from the original on 20 December 2008. Retrieved 5 November 2011. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
  19. ^ Wuelfing, Britta (12 January 2009). "Kernel 2.6.29: Corbet Says Btrfs Next Generation Filesystem". Linux Magazine. Retrieved 5 November 2011.
  20. ^ "Red Hat Enterprise Linux 6 documentation: Technology Previews".
  21. ^ "Fedora Weekly News Issue 276". 25 May 2011.
  22. ^ "Debian 6.0 "Squeeze" released" (Press release). Debian. 6 February 2011. Retrieved 8 February 2011. Support has also been added for the ext4 and Btrfs filesystems...
  23. ^ "SLES 11 SP2 Release Notes". 21 August 2012. Retrieved 29 August 2012.
  24. ^ "Oracle Linux Technical Information". Retrieved 14 November 2012.
  25. ^ "SUSE Linux Enterprise Server 12 Release Notes". 5 November 2015. Retrieved 20 January 2016.
  26. ^ Brown, Eric (22 July 2011). "Linux 3.0 scrubs up Btrfs, gets more Xen". Linux devices. eWeek. Retrieved 8 November 2011.
  27. ^ Leemhuis, Thorsten (21 June 2011). "Kernel Log: Coming in 3.0 (Part 2) - Filesystems". The H Open. Retrieved 8 November 2011.
  28. ^ Sam Varghese. "iTWire". itwire.com. Retrieved 19 April 2015.
  29. ^ a b c "Btrfs Wiki: Features". btrfs.wiki.kernel.org. 27 November 2013. Retrieved 27 November 2013.
  30. ^ "Btrfs Wiki: Changelog". btrfs.wiki.kernel.org. 8 November 2013. Retrieved 27 November 2013.
  31. ^ "Using Btrfs with Multiple Devices". kernel.org. 7 November 2013. Retrieved 20 November 2013.
  32. ^ Chris Mason (2 February 2013). "RAID 5/6 code merged into Btrfs". LWN.net. Retrieved 4 December 2013.
  33. ^ "How to use Btrfs RAID5/6". marc.merlins.org. 23 March 2014. Retrieved 19 December 2014.
  34. ^ "RAID5/6 (Btrfs documentation)". kernel.org. 11 April 2014. Retrieved 19 December 2014.
  35. ^ "Compression". kernel.org. 25 June 2013. Retrieved 1 April 2014.
  36. ^ "Btrfs: add support for inode properties". kernel.org. 28 January 2014. Retrieved 1 April 2014.
  37. ^ "btrfs: Readonly snapshots". Retrieved 12 December 2011.
  38. ^ "Wiki FAQ: What checksum function does Btrfs use?". Btrfs wiki. Retrieved 15 June 2009.
  39. ^ a b c d Mason, Chris (1 April 2008). "Conversion from Ext3 (Btrfs documentation)". kernel.org. Retrieved 23 May 2012.
  40. ^ Mason, Chris (12 January 2009). "Btrfs changelog". Retrieved 12 February 2012.
  41. ^ a b c Corbet, Jonathan (11 July 2012), Btrfs send/receive, LWN.net, retrieved 14 November 2012
  42. ^ a b Jansen, Arne (2011), Btrfs Subvolume Quota Groups (PDF), Strato AG, retrieved 14 November 2012
  43. ^ "Btrfs Wiki: Deduplication". 13 September 2013. Retrieved 27 November 2013.
  44. ^ Corbet, Jonathan (2 November 2011). "A btrfs update at LinuxCon Europe". Retrieved 12 February 2012.
  45. ^ Mazzoleni, Andrea. "btrfs: lib: raid: New RAID library supporting up to six parities". Retrieved 16 March 2014.
  46. ^ "Btrfs Wiki: Incremental Backup". 27 May 2013. Retrieved 27 November 2013.
  47. ^ a b "Btrfs Project ideas". 21 February 2013. Retrieved 21 February 2013.
  48. ^ a b c d Aurora, Valerie (22 July 2009). "A short history of btrfs". LWN.net. Retrieved 5 November 2011.
  49. ^ Hilzinger, Marcel (22 April 2009). "Future of Btrfs Secured". Linux Magazine. Retrieved 5 November 2011.
  50. ^ Jonathan Corbet (5 May 2009). "The two sides of reflink()". LWN.net. Retrieved 17 October 2013.
  51. ^ a b "UseCases – btrfs documentation". kernel.org. Retrieved 4 November 2013.
  52. ^ "btrfs: allow cross-subvolume file clone". github.com. Retrieved 4 November 2013.
  53. ^ "Symlinks reference names, hardlinks reference meta-data and reflinks reference data". pixelbeat.org. 27 October 2010. Retrieved 17 October 2013.
  54. ^ Meyering, Jim (20 August 2009). "GNU coreutils NEWS: Noteworthy changes in release 7.5". savannah.gnu.org. Retrieved 30 August 2009.
  55. ^ Scrivano, Giuseppe (1 August 2009). "cp: accept the --reflink option". savannah.gnu.org. Retrieved 2 November 2009.
  56. ^ a b c d "SysadminGuide – Btrfs documentation". kernel.org. Retrieved 31 October 2013.
  57. ^ a b c "5.6 Creating Subvolumes and Snapshots". oracle.com. 2013. Retrieved 31 October 2013.
  58. ^ a b "5.7 Using the Send/Receive Feature". oracle.com. 2013. Retrieved 31 October 2013.
  59. ^ Mason, Chris (5 April 2012), Btrfs Filesystem: Status and New Features, Linux Foundation, retrieved 16 November 2012
  60. ^ Amanda McPherson (22 June 2009). "A Conversation with Chris Mason on BTRfs: the next generation file system for Linux". Linux Foundation. Retrieved 9 October 2014. In future releases we plan to add online fsck, deduplication, encryption and other features that have been on admin wish lists for a long time.
  61. ^ a b "btrfsck".
  62. ^ "Restore".
  63. ^ "Problem FAQ - btrfs Wiki". kernel.org. 31 July 2013. Retrieved 16 January 2014.
  64. ^ "kernel/git/torvalds/linux.git: Documentation: filesystems: add new btrfs mount options (Linux kernel source tree)". kernel.org. 21 November 2013. Retrieved 6 February 2014.
  65. ^ "Mount options - btrfs Wiki". kernel.org. 12 November 2013. Retrieved 16 January 2014.
  66. ^ a b c Mason, Chris. "Btrfs design". Btrfs wiki. Retrieved 8 November 2011.
  67. ^ Reiser, Hans (7 December 2001). "Re: Ext2 directory index: ALS paper and benchmarks". ReiserFS developers mailing list. Retrieved 28 August 2009.
  68. ^ Mason, Chris. "Acp". Oracle personal web page. Retrieved 5 November 2011.
  69. ^ Fasheh, Mark (9 October 2012), btrfs: extended inode refs, retrieved 7 November 2012
  70. ^ Torvalds, Linus (10 October 2012), "Pull btrfs update from Chris Mason", git.kernel.org, retrieved 7 November 2012
  71. ^ Larabel, Michael (24 December 2010). "Benchmarks Of The Btrfs Space Cache Option". Phoronix. Retrieved 16 November 2012.
  72. ^ "FAQ - btrfs Wiki: What checksum function does Btrfs use?". The btrfs Project. Retrieved 19 September 2013.
  73. ^ a b Bierman, Margaret; Grimmer, Lenz (August 2012). "How I Use the Advanced Capabilities of Btrfs". Retrieved 20 September 2013.
  74. ^ Salter, Jim (15 January 2014). "Bitrot and atomic COWs: Inside "next-gen" filesystems". Ars Technica. Retrieved 15 January 2014.
  75. ^ Coekaerts, Wim (28 September 2011). "btrfs scrub - go fix corruptions with mirror copies please!". Retrieved 20 September 2013.
  76. ^ Mason, Chris; Rodeh, Ohad; Bacik, Josef (9 July 2012), BTRFS: The Linux B-tree Filesystem (PDF), IBM Research, retrieved 12 November 2012
  77. ^ Mason, Chris (30 April 2008). "Multiple device support". Btrfs wiki. Archived from the original on 20 July 2011. Retrieved 5 November 2011. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
  78. ^ Bartell, Sean (20 April 2010). "Re: Restoring BTRFS partition". linux-btrfs (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)