Ext3: Difference between revisions
m →External links: + write barriers. |
GregorySmith (talk | contribs) Add more references for fsync issues. Remove the idea that a high quality component should be inherently trustworthy, since the data doesn't support that. |
||
Line 112: | Line 112: | ||
Consider the following scenario: If hard disk writes are done out-of-order (due to modern hard disks caching writes in order to [[amortized analysis|amortize]] write speeds), it is likely that one will write a commit block of a transaction before the other relevant blocks are written. If a power failure or unrecoverable crash should occur before the other blocks get written, the system will have to be rebooted. Upon reboot, the file system will replay the log as normal, and replay the "winners" (transactions with a commit block, including the invalid transaction above, which happened to be tagged with a valid commit block). The unfinished disk write above will thus proceed, but using corrupt journal data. ''The file system will thus mistakenly overwrite normal data with corrupt data while replaying the journal.'' There is a [http://uwsg.indiana.edu/hypermail/linux/kernel/0805.2/1470.html test program] available to trigger the problematic behavior. If checksums had been used, where the blocks of the "fake winner" transaction were tagged with a mutual checksum, the file system could have known better and not replayed the corrupt data onto the disk. Journal checksumming has been added to [[ext4]].<ref>[http://article.gmane.org/gmane.linux.file-systems/21373 ext4: Add the journal checksum feature]</ref> |
Consider the following scenario: If hard disk writes are done out-of-order (due to modern hard disks caching writes in order to [[amortized analysis|amortize]] write speeds), it is likely that one will write a commit block of a transaction before the other relevant blocks are written. If a power failure or unrecoverable crash should occur before the other blocks get written, the system will have to be rebooted. Upon reboot, the file system will replay the log as normal, and replay the "winners" (transactions with a commit block, including the invalid transaction above, which happened to be tagged with a valid commit block). The unfinished disk write above will thus proceed, but using corrupt journal data. ''The file system will thus mistakenly overwrite normal data with corrupt data while replaying the journal.'' There is a [http://uwsg.indiana.edu/hypermail/linux/kernel/0805.2/1470.html test program] available to trigger the problematic behavior. If checksums had been used, where the blocks of the "fake winner" transaction were tagged with a mutual checksum, the file system could have known better and not replayed the corrupt data onto the disk. Journal checksumming has been added to [[ext4]].<ref>[http://article.gmane.org/gmane.linux.file-systems/21373 ext4: Add the journal checksum feature]</ref> |
||
Filesystems going through the device mapper interface (including software [[RAID]] and [[Logical Volume Manager (Linux)|LVM]] implementations) may not support barriers, and will issue a warning if that mount option is used.<ref>[http://oss.sgi.com/archives/xfs/2007-12/msg00080.html Re: write barrier over device mapper supported or not?]</ref><ref>[http://madduck.net/blog/2006.08.11:xfs-zeroes/ XFS and zeroed files]</ref> There are also some disks that do not properly implement the write cache flushing extension necessary for barriers to work, which causes a similar warning.<ref>[http://forums.opensuse.org/archives/sls-archives/suse-linux/desktop-environments/379681-barrier-sync.html Barrier Sync]</ref> In these situations, where barriers are not supported or practical, reliable write ordering is possible by turning off the disk's write cache and using the data=journal mount option.<ref name="archives.free" /> Turning off the disk's write cache may be required even when barriers are available. |
Filesystems going through the device mapper interface (including software [[RAID]] and [[Logical Volume Manager (Linux)|LVM]] implementations) may not support barriers, and will issue a warning if that mount option is used.<ref>[http://oss.sgi.com/archives/xfs/2007-12/msg00080.html Re: write barrier over device mapper supported or not?]</ref><ref>[http://madduck.net/blog/2006.08.11:xfs-zeroes/ XFS and zeroed files]</ref> There are also some disks that do not properly implement the write cache flushing extension necessary for barriers to work, which causes a similar warning.<ref>[http://forums.opensuse.org/archives/sls-archives/suse-linux/desktop-environments/379681-barrier-sync.html Barrier Sync]</ref> In these situations, where barriers are not supported or practical, reliable write ordering is possible by turning off the disk's write cache and using the data=journal mount option.<ref name="archives.free" /> Turning off the disk's write cache may be required even when barriers are available. |
||
Applications like databases expect that a call to [[sync (Unix)|fsync()]] will flush pending writes to disk, and the barrier implementation doesn't always clear the drive's write cache in response to that call.<ref>[http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg272253.html Re: Proposal for "proper" durable fsync() and fdatasync()]</ref> There is also a potential issue with the barrier implementation related to error handling during events, such as a drive failure.<ref>[http://www.mjmwired.net/kernel/Documentation/block/barrier.txt I/O Barriers, as of kernel version 2.6.31]</ref> It is also known that sometimes some [[virtualization]] technologies do not properly forward fsync or flush commands to the underlying devices (files, volumes, disk) from a guest operating system.<ref>[http://www.mysqlperformanceblog.com/2011/03/21/virtualization-and-io-modes-extra-complexity/ Virtualization and IO Modes = Extra Complexity]</ref> Similarly, some hard disks or controllers implement cache flushing incorrectly or not at all, but still advertise that it is supported, and do not return any error when it is used.<ref>[http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/ SSD, XFS, LVM, fsync, write cache, barrier and lost transactions]</ref> There are so many ways to handle fsync and write cache handling incorrectly, it is safer to assume that cache flushing does not work unless it is explicitly tested, regardless of how reliable individual components are believed to be. |
|||
==ext4== |
==ext4== |
Revision as of 19:16, 27 January 2013
Developer(s) | Stephen Tweedie |
---|---|
Full name | Third extended file system |
Introduced | November 2001 with Linux 2.4.15 |
Partition IDs | 0x83 (MBR) EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (GPT) |
Structures | |
Directory contents | Table, hashed B-tree with dir_index enabled |
File allocation | bitmap (free space), table (metadata) |
Bad blocks | Table |
Limits | |
Max volume size | 2 TiB – 16 TiB |
Max file size | 16 GiB – 2 TiB |
Max no. of files | Variable, allocated at creation time[1] |
Max filename length | 255 bytes |
Allowed filename characters | All bytes except NULL ('\0') and '/' |
Features | |
Dates recorded | modification (mtime), attribute modification (ctime), access (atime) |
Date range | December 14, 1901 - January 18, 2038 |
Date resolution | 1s |
Attributes | allow-undelete, append-only, h-tree (directory), immutable, journal, no-atime, no-dump, secure-delete, synchronous-write, top (directory) |
File system permissions | Unix permissions, ACLs and arbitrary security attributes (Linux 2.6 and later) |
Transparent compression | No |
Transparent encryption | No (provided at the block device level) |
Data deduplication | No |
Other | |
Supported operating systems | Linux, BSD, Windows (through an IFS) |
ext3, or third extended filesystem, is a journaled file system that is commonly used by the Linux kernel. It is the default file system for many popular Linux distributions, including Debian. Stephen Tweedie first revealed that he was working on extending ext2 in Journaling the Linux ext2fs Filesystem in a 1998 paper, and later in a February 1999 kernel mailing list posting. The filesystem was merged with the mainline Linux kernel in November 2001 from 2.4.15 onward.[2][3][4] Its main advantage over ext2 is journaling, which improves reliability and eliminates the need to check the file system after an unclean shutdown. Its successor is ext4.
Advantages
The performance (speed) of ext3 is less attractive than competing Linux filesystems, such as ext4, JFS, ReiserFS and XFS. But ext3 has a significant advantage in that it allows in-place upgrades from ext2 without having to back up and restore data. Benchmarks suggest that ext3 also uses less CPU power than ReiserFS and XFS.[5][6] It is also considered safer than the other Linux file systems, due to its relative simplicity and wider testing base.[7][8]
ext3 adds the following features to ext2:
Without these features, any ext3 file system is also a valid ext2 file system. This situation has allowed well-tested and mature file system maintenance utilities for maintaining and repairing ext2 file systems to also be used with ext3 without major changes. The ext2 and ext3 file systems share the same standard set of utilities, e2fsprogs, which includes an fsck tool. The close relationship also makes conversion between the two file systems (both forward to ext3 and backward to ext2) straightforward.
ext3 lacks "modern" filesystem features, such as dynamic inode allocation and extents. This situation might sometimes be a disadvantage, but for recoverability, it is a significant advantage. The file system metadata is all in fixed, well-known locations, and data structures have some redundancy. In significant data corruption, ext2 or ext3 may be recoverable, while a tree-based file system may not.
Size limits
The max number of blocks for ext3 is 232. The size of a block can vary, affecting the max number of files and the max size of the file system[10]:
Block size | Maximum file size |
Maximum file system size |
---|---|---|
1 KiB | 16 GiB | 2 TiB |
2 KiB | 256 GiB | 8 TiB |
4 KiB | 2 TiB | 16 TiB |
8 KiB[limits 1] | 2 TiB | 32 TiB |
Journaling levels
There are three levels of journaling available in the Linux implementation of ext3:
- Journal (lowest risk)
- Both metadata and file contents are written to the journal before being committed to the main file system. Because the journal is relatively continuous on disk, this can improve performance, if the journal has enough space. In other cases, performance gets worse, because the data must be written twice—once to the journal, and once to the main part of the filesystem.[11]
- Ordered (medium risk)
- Only metadata is journaled; file contents are not, but it's guaranteed that file contents are written to disk before associated metadata is marked as committed in the journal. This is the default on many Linux distributions. If there is a power outage or kernel panic while a file is being written or appended to, the journal will indicate that the new file or appended data has not been "committed", so it will be purged by the cleanup process. (Thus appends and new files have the same level of integrity protection as the "journaled" level.) However, files being overwritten can be corrupted because the original version of the file is not stored. Thus it's possible to end up with a file in an intermediate state between new and old, without enough information to restore either one or the other (the new data never made it to disk completely, and the old data is not stored anywhere). Even worse, the intermediate state might intersperse old and new data, because the order of the write is left up to the disk's hardware.[12][13] XFS uses this form of journaling.[14]
- Writeback (highest risk)
- Only metadata is journaled; file contents are not. The contents might be written before or after the journal is updated. As a result, files modified right before a crash can become corrupted. For example, a file being appended to may be marked in the journal as being larger than it actually is, causing garbage at the end. Older versions of files could also appear unexpectedly after a journal recovery. The lack of synchronization between data and journal is faster in many cases. JFS uses this level of journaling, but ensures that any "garbage" due to unwritten data is zeroed out on reboot.
In all three modes, the internal structure of file system is assured to be consistent even after a crash. In any case, only the data content of files or directories which were being modified when the system crashed will be affected; the rest will be intact after recovery.
Disadvantages
Functionality
Since ext3 aims to be backwards compatible with the earlier ext2, many of the on-disk structures are similar to those of ext2. Consequently, ext3 lacks recent features, such as extents, dynamic allocation of inodes, and block suballocation.[15] A directory can have at most 31998 subdirectories, because an inode can have at most 32000 links.[16]
ext3, like most current Linux filesystems, cannot be fsck-ed while the filesystem is mounted for writing. Attempting to check a file system that is already mounted may detect bogus errors where changed data has not reached the disk yet, and corrupt the file system in an attempt to "fix" these errors.
Defragmentation
There is no online ext3 defragmentation tool that works on the filesystem level. There is an offline ext2 defragmenter, e2defrag
, but it requires that the ext3 filesystem be converted back to ext2 first. But e2defrag
may destroy data, depending on the feature bits turned on in the filesystem; it does not know how to treat many of the newer ext3 features.[17]
There are userspace defragmentation tools, like Shake[18] and defrag.[19][20] Shake works by allocating space for the whole file as one operation, which will generally cause the allocator to find contiguous disk space. If there are files which are used at the same time, Shake will try to write them next to one another. Defrag works by copying each file over itself. However, this strategy works only if the file system has enough free space. A true defragmentation tool does not exist for ext3.[21]
However, as the Linux System Administrator Guide states, "Modern Linux filesystem(s) keep fragmentation at a minimum by keeping all blocks in a file close together, even if they can't be stored in consecutive sectors. Some filesystems, like ext3, effectively allocate the free block that is nearest to other blocks in a file. Therefore it is not necessary to worry about fragmentation in a Linux system."[22]
While ext3 is more resistant to file fragmentation than the FAT filesystem, ext3 can get fragmented over time or for specific usage patterns, like slowly-writing large files.[23][24] Consequently, ext4, the successor to ext3, is planned to eventually include an online filesystem defragmentation utility,[25] and currently supports extents (contiguous file regions).
Undelete
ext3 does not support the recovery of deleted files. The ext3 driver actively deletes files by wiping file inodes,[26] for crash safety reasons.
There are still several techniques[27] and some free[28] and commercial[29] software for recovery of deleted or lost files using file system journal analysis; however, they do not guarantee any specific file recovery.
Compression
e3compr[30] is an unofficial patch for ext3 that does transparent compression. It is a direct port of e2compr and still needs further development. It compiles and boots well with upstream kernels[citation needed], but journaling is not implemented yet.
Lack of snapshots support
Unlike a number of modern file systems, ext3 does not have native support for snapshots—the ability to quickly capture the state of the filesystem at arbitrary times. Instead, it relies on less-space-efficient, volume-level snapshots provided by the Linux LVM. The Next3 file system is a modified version of ext3 which offers snapshots support, yet retains compatibility with the ext3 on-disk format.[31]
No checksumming in journal
ext3 does not do checksumming when writing to the journal. On a storage device with extra cache, if barrier=1 is not enabled as a mount option (in /etc/fstab), and if the hardware is doing out-of-order write caching, one runs the risk of severe filesystem corruption during a crash.[32][33][34] This is because storage devices with write caches report to the system that the data has been completely written, even if it was written to the (volatile) cache.
Consider the following scenario: If hard disk writes are done out-of-order (due to modern hard disks caching writes in order to amortize write speeds), it is likely that one will write a commit block of a transaction before the other relevant blocks are written. If a power failure or unrecoverable crash should occur before the other blocks get written, the system will have to be rebooted. Upon reboot, the file system will replay the log as normal, and replay the "winners" (transactions with a commit block, including the invalid transaction above, which happened to be tagged with a valid commit block). The unfinished disk write above will thus proceed, but using corrupt journal data. The file system will thus mistakenly overwrite normal data with corrupt data while replaying the journal. There is a test program available to trigger the problematic behavior. If checksums had been used, where the blocks of the "fake winner" transaction were tagged with a mutual checksum, the file system could have known better and not replayed the corrupt data onto the disk. Journal checksumming has been added to ext4.[35]
Filesystems going through the device mapper interface (including software RAID and LVM implementations) may not support barriers, and will issue a warning if that mount option is used.[36][37] There are also some disks that do not properly implement the write cache flushing extension necessary for barriers to work, which causes a similar warning.[38] In these situations, where barriers are not supported or practical, reliable write ordering is possible by turning off the disk's write cache and using the data=journal mount option.[32] Turning off the disk's write cache may be required even when barriers are available.
Applications like databases expect that a call to fsync() will flush pending writes to disk, and the barrier implementation doesn't always clear the drive's write cache in response to that call.[39] There is also a potential issue with the barrier implementation related to error handling during events, such as a drive failure.[40] It is also known that sometimes some virtualization technologies do not properly forward fsync or flush commands to the underlying devices (files, volumes, disk) from a guest operating system.[41] Similarly, some hard disks or controllers implement cache flushing incorrectly or not at all, but still advertise that it is supported, and do not return any error when it is used.[42] There are so many ways to handle fsync and write cache handling incorrectly, it is safer to assume that cache flushing does not work unless it is explicitly tested, regardless of how reliable individual components are believed to be.
ext4
On June 28, 2006, Theodore Ts'o, the principal developer of ext3,[43] announced an enhanced version, called ext4. On October 11, 2008, the patches that mark ext4 as stable code were merged in the Linux 2.6.28 source code repositories, marking the end of the development phase and recommending its adoption. In 2008, Ts'o stated that although ext4 has improved features, it is not a major advance, it uses old technology, and is a stop-gap; Ts'o believes that Btrfs is the better direction, because "it offers improvements in scalability, reliability, and ease of management".[44] Btrfs also has "a number of the same design ideas that reiser3/4 had".[45]
See also
References
- ^ The maximum number of inodes (and hence the maximum number of files and directories) is set when the file system is created. If V is the volume size in bytes, then the default number of inodes is given by V/213 (or the number of blocks, whichever is less), and the minimum by V/223. The default was deemed sufficient for most applications. The max number of subdirectories in one directory is fixed to 32000.
- ^ Stephen C. Tweedie (1998). "Journaling the Linux ext2fs Filesystem" (PDF). Proceedings of the 4th Annual LinuxExpo, Durham, NC. Retrieved 2007-06-23.
{{cite journal}}
: Unknown parameter|month=
ignored (help) - ^ Stephen C. Tweedie (February 17, 1999). "Re: fsync on large files". Linux kernel mailing list.
- ^ Rob Radez (November 23, 2001). "2.4.15-final". Linux kernel mailing list.
- ^ Justin Piszcz. "Benchmarking Filesystems Part II". Linux Gazette (122).
- ^ Hans Ivers. "Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch".
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ Roderick W. Smith (2003-10-09). "Introduction to Linux filesystems and files". Linux.com.[dead link ]
- ^ James Trageser (2010-04-23). "Which Linux filesystem to choose for your PC? Ext2, Ext3, Ext4, ReiserFS (Reiser3), Reiser4, XFS, Btrfs".
- ^ Mingming Cao. "Directory indexing". Features found in Linux 2.6.
- ^ Matthew Wilcox. "Documentation/filesystems/ext2.txt". Linux kernel source documentation.
- ^ Daniel Robbins (2001-12-01). "Advannced filesystem implementor's guide, Part 8". IBM developerWorks.
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ curious onloooker: Speeding up ext3 filesystems
- ^ Common threads: Advanced filesystem implementor's guide, Part 8
- ^ "Failure Analysis of SGI XFS File System" (PDF).
- ^ Rob Radez (2005). "Extents,delayed allocation". future of ext3.
- ^ "How many sub-directories?".
- ^ Andreas Dilger. "Post to the ext3-users mailing list". ext3-users mailing list post.
- ^ Vleu.net: Shake
- ^ Defrag written in shell
- ^ http://bazaar.launchpad.net/~jdong/pyfragtools/trunk/files Defrag written in Python
- ^ RE: searching for ext3 defrag/file move program
- ^ http://www.tldp.org/LDP/sag/html/filesystems.html
- ^ http://trac.transmissionbt.com/ticket/849 "The default Ubuntu filesystem ("ext3") will fragment large (>1GB), slowly-growing files (<1MB/s)."
- ^ "We found heavily fragmented free areas on an intensively used IMAP server which stores all its emails in individual files - although more than 900 GB of the total disk space of 1.4 TB were still available." http://www.heise-online.co.uk/open/Tuning-the-Linux-file-system-Ext3--/features/110398/3
- ^ http://kernelnewbies.org/Ext4#head-38e6ac2b5f58f10989d72386e6f9cc2ef7217fb0
- ^ Linux ext3 FAQ
- ^ HOWTO recover deleted files on an ext3 file system
- ^ PhotoRec - GPL'd File Recovery
- ^ UFS Explorer Standard Recovery version 4
- ^ e3compr - ext3 compression
- ^ Corbet, Jonathan. "The Next3 filesystem". LWN.
- ^ a b Re: Frequent metadata corruption with ext3 + hard power-off
- ^ Re: Frequent metadata corruption with ext3 + hard power-off
- ^ Red Hat Enterprise Linux, Chapter 20. Write Barriers
- ^ ext4: Add the journal checksum feature
- ^ Re: write barrier over device mapper supported or not?
- ^ XFS and zeroed files
- ^ Barrier Sync
- ^ Re: Proposal for "proper" durable fsync() and fdatasync()
- ^ I/O Barriers, as of kernel version 2.6.31
- ^ Virtualization and IO Modes = Extra Complexity
- ^ SSD, XFS, LVM, fsync, write cache, barrier and lost transactions
- ^ LKML: "Theodore Ts'o": Proposal and plan for ext2/3 future development work
- ^ Paul, Ryan (2009-04-13). "Panelists ponder the kernel at Linux Collaboration Summit" (Document). Ars TechnicaTemplate:Inconsistent citations
{{cite document}}
: Unknown parameter|accessdate=
ignored (help); Unknown parameter|url=
ignored (help)CS1 maint: postscript (link) - ^ Theodore Ts'o (2008-08-01). "Re: reiser4 for 2.6.27-rc1". linux-kernel (Mailing list). Retrieved 2010-12-31.
{{cite mailing list}}
: Unknown parameter|mailinglist=
ignored (|mailing-list=
suggested) (help)
External links
- "Linux ext3 FAQ". as of 2004-10-14.
- Introducing ext3 – IBM developerWorks Advanced filesystem implementor's guide, Part 7
- Paragon ExtBrowser Free ext2/ext3 Windows driver
- Ext2 File System For Windows GPL ext2/ext3 file system driver for Windows 2000/XP/2003/VISTA/2008 (opensource, supports read & write) supports larger disks (max 256 I-nodes)
- Ext2 Installable File System For Windows ext2/ext3 file system driver for MS Windows NT/2000/XP (freeware, supports read & write on Windows NT4.0/2000/XP/2003/Vista on x86/AMD64) does not support larger disks (max. 128 bit I-nodes)
- EXT2 IFS ext2/ext3 file system driver (read only) for MS Windows NT/2000/XP (opensource), latest version in the web archive
- Explore2fs An explorer-like GUI tool for accessing ext2/ext3 filesystems under MS Windows
- "Ext2read" A windows application to read/copy ext2/ext3/ext4 files with extent and LVM2 support.
- UFS Explorer Standard Recovery version 4 Commercial data recovery and file undelete software for Ext2/Ext3 file systems.
- ext2/ext3 resizing tools
- Presentation on EXT3 Journaling Filesystem by Dr. Stephen Tweedie at the Ottawa Linux Symposium, 20 July 2000
- State of the Art: Where we are with the Ext3 filesystem by Mingming Cao, Theodore Y. Ts'o, Badari Pulavarty, Suparna Bhattacharya, IBM Linux Technology Center, 2005
- Tutorial – Determining Your EXT3 Size Limits
- HTree
- fuse-ext2 An open source ext2/ext3 file system driver for FUSE. (Supports Mac OS X 10.4 and later (Universal Binary), using MacFuse)
- Windows port of Ext2/Ext4 and other FS in CROSSMETA
- Red Hat Enterprise Linux, Chapter 20. Write Barriers.