Jump to content

Oracle ZFS

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Gigglesworth (talk | contribs) at 23:18, 3 September 2010 (Storage pools). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

ZFS
Developer(s)Sun Microsystems
Full nameZFS
IntroducedNovember 2005 with OpenSolaris
Structures
Directory contentsExtensible hash table
Limits
Max volume size16 EB
Max file size16 EB (264 bytes)
Max no. of files248
Max filename length255 bytes
Features
ForksYes (called Extended Attributes)
AttributesPOSIX
File system
permissions
POSIX, NFSv4 ACLs
Transparent
compression
Yes
Transparent
encryption
Yes (currently beta)[1]
Data deduplicationYes
Other
Supported
operating systems
Solaris, OpenSolaris, FreeBSD, Mac OS X Server 10.5, Linux via ZFS-FUSE

In computing, ZFS is a combined file system and logical volume manager designed by Sun Microsystems. The features of ZFS include support for high storage capacities, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, continuous integrity checking and automatic repair, RAID-Z and native NFSv4 ACLs. ZFS is implemented as open-source software, licensed under the Common Development and Distribution License (CDDL). The ZFS name is a trademark of Sun.[2]

History

ZFS was designed and implemented by a team at Sun led by Jeff Bonwick. It was announced on September 14, 2004.[3] Source code for ZFS was integrated into the main trunk of Solaris development on October 31, 2005[4] and released as part of build 27 of OpenSolaris on November 16, 2005. Sun announced that ZFS was included in the 6/06 update to Solaris 10 in June 2006, one year after the opening of the OpenSolaris community.[5]

The name originally stood for "Zettabyte File System". The original name selectors happened to like the name, and a ZFS file system has the ability to store 258 zettabytes, where each ZB is 270 bytes.[6]

Version numbers

As new features are introduced the version number of the ZPool and Z file system are incremented to designate the format and features available. [7][8] Notable ZFS versions include:

  • 14 - Supported by OpenSolaris 2009.06, FreeBSD 8.1
  • 15 - Supported by Solaris 10 10/09
  • 17 - Triple Parity RAID-Z
  • 21 - Deduplication

Features

Storage pools

Unlike traditional file systems, which reside on single devices and thus require a volume manager to use more than one device, ZFS filesystems are built on top of virtual storage pools called zpools. A zpool is constructed of virtual devices (vdevs), which are themselves constructed of block devices: files, hard drive partitions, or entire drives, with the last being the recommended usage.[9] Thus, a vdev can be viewed as a group of hard drives. This means a zpool consists of one or more groups of drives. Block devices within a vdev may be configured in different ways, depending on needs and space available: non-redundantly (similar to RAID 0), as a mirror (RAID 1) of two or more devices, as a RAID-Z group of three or more devices (Similar to RAID-5), or as a RAID-Z2 group of four or more devices.[10] In July 2009, triple-parity RAID-Z3 was added to OpenSolaris.[11][12]

In addition, pools can have hot spares to compensate for failing disks. ZFS also supports both read and write caching, for which special devices can be used. Solid State Devices can be used for the L2ARC, or Level 2 ARC, speeding up read operations, while NVRAM buffered SLC memory can be boosted with supercapacitors to implement a fast, non-volatile write cache, improving synchronous writes. Finally, when mirroring, block devices can be grouped according to physical chassis, so that the filesystem can continue in the case of the failure of an entire chassis.

Storage pool composition is not limited to similar devices but can consist of ad-hoc, heterogeneous collections of devices, which ZFS seamlessly pools together, subsequently doling out space to diverse filesystems as needed. Arbitrary storage device types can be added to existing pools to expand their size at any time. [13]

The storage capacity of all vdevs is available to all of the file system instances in the zpool. A quota can be set to limit the amount of space a file system instance can occupy, and a reservation can be set to guarantee that space will be available to a file system instance.

Capacity

ZFS is a 128-bit file system, so it can address 18 quintillion (1.84 × 1019) times more data than current 64-bit systems. The limitations of ZFS are designed to be so large that they would never be encountered, given the known limits of physics (and the number of atoms in the Earth's crust to build such a storage device). Some theoretical limits in ZFS are:

  • 264 — Number of snapshots of any file system[14]
  • 248 — Number of entries in any individual directory[15]
  • 16 EB (264 bytes) — Maximum size of a file system
  • 16 EB — Maximum size of a single file
  • 16 EB — Maximum size of any attribute
  • 256 ZB (278 bytes) — Maximum size of any zpool
  • 256 — Number of attributes of a file (actually constrained to 248 for the number of files in a ZFS file system)
  • 264 — Number of devices in any zpool
  • 264 — Number of zpools in a system
  • 264 — Number of file systems in a zpool

Copy-on-write transactional model

ZFS uses a copy-on-write transactional object model. All block pointers within the filesystem contain a 32-bit checksum or 256-bit hash (currently a choice between Fletcher-2, Fletcher-4, or SHA-256)[16] of the target block which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, then any metadata blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and an intent log is used when synchronous write semantics are required. The blocks are arranged in a tree, as are their checksums (see Merkle signature scheme).

Snapshots and clones

An advantage of copy-on-write is that when ZFS writes new data, the blocks containing the old data can be retained, allowing a snapshot version of the file system to be maintained. ZFS snapshots are created very quickly, since all the data composing the snapshot is already stored; they are also space efficient, since any unchanged data is shared among the file system and its snapshots.

Writeable snapshots ("clones") can also be created, resulting in two independent file systems that share a set of blocks. As changes are made to any of the clone file systems, new data blocks are created to reflect those changes, but any unchanged blocks continue to be shared, no matter how many clones exist.

Dynamic striping

Dynamic striping across all devices to maximize throughput means that as additional devices are added to the zpool, the stripe width automatically expands to include them; thus all disks in a pool are used, which balances the write load across them.

Variable block sizes

ZFS uses variable-sized blocks of up to 128 kilobytes. The currently available code allows the administrator to tune the maximum block size used as certain workloads do not perform well with large blocks. Automatic tuning to match workload characteristics is contemplated.[citation needed]

If data compression (LZJB) is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve IO throughput (though at the cost of increased CPU use for the compression and decompression operations).

Lightweight filesystem creation

In ZFS, filesystem manipulation within a storage pool is easier than volume manipulation within a traditional filesystem; the time and effort required to create or resize a ZFS filesystem is closer to that of making a new directory than it is to volume manipulation in some other systems.

Cache management

ZFS also uses the ARC, a new method for cache management, instead of the traditional Solaris virtual memory page cache.

Adaptive endianness

Pools and their associated ZFS file systems can be moved between different platform architectures, including systems implementing different byte orders. The ZFS block pointer format stores filesystem metadata in an endian-adaptive way; individual metadata blocks are written with the native byte order of the system writing the block. When reading, if the stored endianness doesn't match the endianness of the system, the metadata is byte-swapped in memory.

This does not affect the stored data itself; as is usual in POSIX systems, files appear to applications as simple arrays of bytes, so applications creating and reading data remain responsible for doing so in a way independent of the underlying system's endianness.

Deduplication

Deduplication support has been added to the ZFS source repository at the end of October 2009.[17] The OpenSolaris ZFS development packages has been available since December 3 (build 128), and the OpenSolaris release packages will be rolled in the first quarter of 2010.

Additional capabilities

  • Explicit I/O priority with deadline scheduling.
  • Claimed globally optimal I/O sorting and aggregation.
  • Multiple independent prefetch streams with automatic length and stride detection.
  • Parallel, constant-time directory operations.
  • End-to-end checksumming, using a kind of "Data Integrity Field", allowing data corruption detection (and recovery if you have redundancy in the pool).
  • Transparent filesystem compression. Supports LZJB and gzip.[18]
  • Intelligent scrubbing and resilvering (resyncing).[19]
  • Load and space usage sharing between disks in the pool.[20]
  • Ditto blocks: Configurable data replication per filesystem, with zero, one or two extra copies requested per write for user data, and with that same base number of copies plus one or two for metadata (according to metadata importance).[21] If the pool has several devices, ZFS tries to replicate over different devices. Ditto blocks are primarily an additional protection against corrupted sectors, not against total disk failure.[22]
  • ZFS design (copy-on-write + superblocks) is safe when using disks with write cache enabled, if they honor the write barriers. This feature provides safety and a performance boost compared with some other filesystems.
  • When entire disks are added to a ZFS pool, ZFS automatically enables their write cache. This is not done when ZFS only manages discrete slices of the disk, since it doesn't know if other slices are managed by non-write-cache safe filesystems, like UFS.
  • Per-user and per-group quotas support.[23]
  • Planned features:
    • Filesystem encryption was due for integration in Q1 2010[1] but is currently still not available in production environments under Solaris or OpenSolaris.
    • The so-called Block Pointer rewrite functionality is due to be added in the same time frame, paving the way for resizing pools, defragmentation, (re-)applying compression on filesystems and so on.[24]
    • Recovery tools will be added for recovering from the loss of a pool.

Limitations

  • Capacity expansion is normally achieved by adding groups of disks as a top-level vdev: simple device, RAID-Z, RAID-Z2, RAID-Z3, or mirrored. Newly written data will dynamically start to use all available vdevs. It is also possible to expand the array by iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to heal itself — the heal time will depend on amount of stored information, not the disk size. The new free space will not be available until all the disks have been swapped.
  • It is currently not possible to reduce the number of top-level vdevs in a pool nor otherwise reduce pool capacity.[25] This functionality was said to be in development already in 2007[26]. It is not available as of Solaris 10 10/09 (AKA update 8).
  • It is not possible to add a disk as a column to a RAID-Z, RAID-Z2, or RAID-Z3 vdev. This feature depends on the block pointer rewrite functionality due to be added soon. You can however create a new RAID-Z vdev and add it to the zpool.
  • Vdevs cannot be nested, so a mirror or RAID-Z top-level vdev can only contain files or disks. Mirrors of mirrors (or other combinations) are not allowed.
  • Reconfiguring the number of top-level vdev requires copying data offline, destroying the pool, and recreating the pool with the new top-level vdev configuration.
  • ZFS is not a native cluster, distributed, or parallel file system and cannot provide concurrent access from multiple hosts as ZFS is a local file system. Sun's Lustre distributed filesystem will adapt ZFS as back-end storage for both data and metadata in version 3.0, which is scheduled to be released in 2010.[27]
  • ZFS expects a disk cache flush command to commit cached data to media. Some virtualization software are configured by default to ignore cache flush commands, and some consumer-grade hardware 'lies' about actually executing the command as well. For example, VirtualBox can be, but is not by default configured to properly respect cache flushes (configuration would be using the procedure described in section 11.1.3 Responding to guest IDE flush requests of the Sun VirtualBox User Manual[28]); consumer grade USB disk enclosures are said to be particularly vulnerable to this problem. In the event of an outage or fault this can quite possibly lead to damage to the pool; recovery can be attempted by importing the pool as of few transactions ago (i.e. an older uberblock), losing minutes/seconds of data. Recovery enhancement is expect to be integrated in Q1 2010 (already in the latest development versions of OpenSolaris). A scrub is used to verify the integrity; however, some files may still need to be restored from backups, in the unlikely event they have already been deleted, blocks freed and then overwritten.
  • ZFS has no defragmentation utility. Usage of COW with often changed files leads to high fragmentation.
  • "Copies" are not the same thing as "replicas" for purposes of data recovery, which can lead to permanent data loss.[29]

Platforms

ZFS is part of Sun's own Solaris operating system and is thus available on both SPARC and x86-based systems. Since the code for ZFS is open source, a port to other operating systems and platforms can be produced without Sun's involvement.

OpenSolaris

OpenSolaris 2008.05 and 2009.06 use ZFS as their default filesystem. There are a half dozen 3rd party distributions.

FreeBSD

FreeBSD version 8 includes a much-updated implementation of ZFS.[30]. zpool version 13 is supported in 8.0-RELEASE.[30] zpool version 14 support was added to the 8-STABLE branch on 11 January 2010,[31] and is included in 8.1-RELEASE.

Pawel Jakub Dawidek ported ZFS to FreeBSD, and it has been part of FreeBSD since version 7.0.[32] 7-stable is on zpool version 13 while 8-stable and the current development branch uses ZFS version 14. Moreover, zfsboot has been implemented in both branches.[33][34] It's fully functional; the only missing features are kernel CIFS server and iSCSI.

GNU/kFreeBSD

GNU/kFreeBSD is a special case, because by virtue of being based on the kernel of FreeBSD, it provides the kernel side of ZFS support (see above). However, it depends on the distribution of GNU/kFreeBSD whether the necessary userland tools are available. The only distribution of this system to the date (Debian GNU/kFreeBSD) provides ZFS utilities in the zfsutils package.

NetBSD

ZFS port was started as a part of the 2007 Google Summer of Code and in August 2009 the code has made it into NetBSD's source tree.[35]

Mac OS X

The first indication of Apple Inc.'s interest in ZFS was an April 2006 post on the opensolaris.org zfs-discuss mailing list where an Apple employee mentioned being interested in porting ZFS to their Mac OS X operating system.[36]

In the release version of Mac OS X 10.5, ZFS was available in read-only mode from the command line, which lacks the possibility to create zpools or write to them.[37] Before the 10.5 release, Apple released the "ZFS Beta Seed v1.1", which allowed read-write access and the creation of zpools,[38] however the installer for the "ZFS Beta Seed v1.1" has been reported to only work on version 10.5.0, and has not been updated for version 10.5.1 and above.[39]

In August 2007, Apple opened a ZFS project on their Mac OS Forge site. On that site, Apple provided the source code and binaries of their port of ZFS which includes read-write access, but there was no installer available[40] until a third-party developer created one.[41]

In October 2009, Apple announced a shutdown of the ZFS project on Mac OS Forge. No explanation was given, just the following statement: "The ZFS project has been discontinued. The mailing list and repository will also be removed shortly." Versions of the previously released source and binaries, as well as the wiki, have been preserved and development has been adopted by a group of enthusiasts.[42][43]

Complete ZFS support was once advertised as a feature of Snow Leopard Server (Mac OS X Server 10.6). However, all references to this feature have been silently removed; it is no longer listed on the Snow Leopard Server features page.[44] Apple has not commented regarding the omission.

Linux

Porting ZFS to Linux is complicated by the fact that the GNU General Public License, which governs the Linux kernel, is incompatible with the Sun CDDL under which ZFS is distributed. According to some developers a single derived work of both projects cannot be legally distributed, as it is not possible to simultaneously meet both licenses' requirements. To include ZFS in the Linux kernel it would have to be cleanly reimplemented, and patents may hamper this.[45]

Another solution to this problem was to port ZFS to Linux's FUSE system so the filesystem runs in userspace instead, where it is not considered a derived work of the kernel. A project to do this was sponsored by Google's Summer of Code program in 2006[46] The original ZFS on FUSE project is available here. Sun Microsystems has stated that a Linux port is being investigated.[47] Development for ZFS on FUSE/Linux now takes place at zfs-fuse.net.

A native port of ZFS for Linux is being worked on. This ZFS on Linux port was produced at the Lawrence Livermore National Laboratory (LLNL) under Contract No. DE-AC52-07NA27344 (Contract 44) between the U.S. Department of Energy (DOE) and Lawrence Livermore National Security, LLC (LLNS) for the operation of LLNL. It has been approved for release under LLNL-CODE-403049. It does not yet implement ZPL layer, thus it's not possible to use it as a filesystem.

Another native port is also being worked on by KQ Infotech [1]. No code has been released yet, although closed beta testing has started.[48]

See also

References

  1. ^ a b "OpenSolaris.org". Sun Microsystems. Retrieved 2007-10-21.
  2. ^ "Sun Trademarks - ZFS". Sun Microsystems.
  3. ^ "ZFS: the last word in file systems". Sun Microsystems. September 14, 2004. Retrieved 2006-04-30.
  4. ^ Jeff Bonwick (October 31, 2005). "ZFS: The Last Word in Filesystems". Jeff Bonwick's Blog. Retrieved 2006-04-30.
  5. ^ "Sun Celebrates Successful One-Year Anniversary of OpenSolaris". Sun Microsystems. June 20, 2006.
  6. ^ "ZFS FAQ at OpenSolaris.org". Sun Microsystems. Retrieved 2009-03-03.
  7. ^ "Solaris ZFS Administration Guide, Appendix A ZFS Version Descriptions". Sun Microsystems. 2009. Retrieved 2010-08-17.
  8. ^ "Version". Sun Microsystems. Retrieved 2010-08-17.
  9. ^ "Solaris ZFS Administration Guide". Sun Microsystems. Retrieved 2007-10-02.
  10. ^ "ZFS Best Practices Guide". Solaris Performance Wiki. Retrieved 2007-10-02.
  11. ^ Leventhal, Adam. "Bug ID: 6854612 triple-parity RAID-Z". Sun Microsystems. Retrieved 2009-07-17.
  12. ^ Leventhal, Adam (2009-07-16). "6854612 triple-parity RAID-Z". zfs-discuss (Mailing list). Retrieved 2009-07-17. {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  13. ^ "Solaris ZFS Enables Hybrid Storage Pools—Shatters Economic and Performance Barriers"
  14. ^ "Solaris ZFS Administration Guide". Sun Microsystems. Retrieved 2007-10-05.
  15. ^ "Solaris ZFS Administration Guide". Sun Microsystems. Retrieved 2007-10-05.
  16. ^ "ZFS On-Disk Specification" (PDF). Sun Microsystems, Inc. 2006. See section 2.4.
  17. ^ ""ZFS Deduplication"".
  18. ^ "Solaris ZFS Administration Guide". Chapter 6 Managing ZFS File Systems. Retrieved 2009-03-17.
  19. ^ "Smokin' Mirrors". Jeff Bonwick's Weblog. 2006-05-02. Retrieved 2007-02-23.
  20. ^ "ZFS Block Allocation". Jeff Bonwick's Weblog. 2006-11-04. Retrieved 2007-02-23.
  21. ^ "Ditto Blocks - The Amazing Tape Repellent". Flippin' off bits Weblog. 2006-05-12. Retrieved 2007-03-01.
  22. ^ "Adding new disks and ditto block behaviour". Retrieved 2009-10-19.
  23. ^ "OpenSolaris.org". Sun Microsystems. Retrieved 2009-05-22.
  24. ^ Jeff Bonwick Keynote at Kernel Conference Australia 2009
  25. ^ "Bug ID 4852783: reduce pool capacity". OpenSolaris Project. Retrieved 2009-03-28.
  26. ^ Goebbels, Mario (2007-04-19). "Permanently removing vdevs from a pool". zfs-discuss (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  27. ^ ""Lustre Roadmap"".
  28. ^ ""Sun VirtualBox User Manual version 3.0.4" (PDF).
  29. ^ ""The official error message for a vdev going missing, even if it's replaced"".
  30. ^ a b "FreeBSD 8.0-RELEASE Release Notes". FreeBSD. Retrieved 2009-11-27.
  31. ^ "FreeBSD 8.0-STABLE Subversion logs". FreeBSD. Retrieved 2010-02-05.
  32. ^ Dawidek, Pawel (April 6, 2007). "ZFS committed to the FreeBSD base". Retrieved 2007-04-06.
  33. ^ "Revision 192498". May 20, 2009. Retrieved 2009-05-22.
  34. ^ "ZFS v13 in 7-STABLE". May 21, 2009. Retrieved 2009-05-22.
  35. ^ "NetBSD Google Summer of Code projects: ZFS".
  36. ^ "Porting ZFS to OSX". zfs-discuss. April 27, 2006. Retrieved 2006-04-30.
  37. ^ "Apple: Leopard offers limited ZFS read-only". MacNN. June 12, 2007. Retrieved 2007-06-23.
  38. ^ "Apple delivers ZFS Read/Write Developer Preview 1.1 for Leopard". Ars Technica. October 7, 2007. Retrieved 2007-10-07.
  39. ^ Ché Kristo (November 18, 2007). "ZFS Beta Seed v1.1 will not install on Leopard.1 (10.5.1) « ideas are free". Retrieved 2007-12-30.
  40. ^ http://zfs.macosforge.org
  41. ^ http://alblue.blogspot.com/2008/11/zfs-119-on-mac-os-x.html
  42. ^ http://code.google.com/p/maczfs/
  43. ^ http://groups.google.com/group/zfs-macos/?pli=1
  44. ^ "Snow Leopard". June 9, 2009. Retrieved 2008-06-10.
  45. ^ Jeremy Andrews (April 19, 2007). "Linux: ZFS, Licenses and Patents". Retrieved 2007-04-21.
  46. ^ Ricardo Correia (March 16, 2009). "ZFS on FUSE/Linux". Retrieved 2009-03-16.
  47. ^ "Fast Track to Solaris 10 Adoption: ZFS Technology". Solaris 10 Technical Knowledge Base. Sun Microsystems. Retrieved 2006-04-24.
  48. ^ Darshin (August 24, 2010). "ZFS Port to Linux (all versions)". Retrieved 2010-08-31.

Bibliography

  • Watanabe, Scott (November 23, 2009). "Solaris ZFS Essentials" (Document). Prentice Hall. p. 256. {{cite document}}: Unknown parameter |edition= ignored (help); Unknown parameter |format= ignored (help); Unknown parameter |isbn= ignored (help); Unknown parameter |url= ignored (help)