Data corruption: Difference between revisions
Line 89: | Line 89: | ||
The important thing to notice is that merely adding lot of checksums everywhere does not necessarily increase data integrity, as explained by [[CERN]] <ref>{{cite web|url=https://indico.desy.de/getFile.py/access?contribId=65&sessionId=42&resId=0&materialId=slides&confId=257 }}</ref> |
The important thing to notice is that merely adding lot of checksums everywhere does not necessarily increase data integrity, as explained by [[CERN]] <ref>{{cite web|url=https://indico.desy.de/getFile.py/access?contribId=65&sessionId=42&resId=0&materialId=slides&confId=257 }}</ref> |
||
{{quote| |
{{quote| |
||
* Checksumming? -> not necessarily enough |
* "Checksumming? -> not necessarily enough" |
||
* End-to-end checksumming (ZFS has a point)}} |
* "End-to-end checksumming (ZFS has a point)"}} |
||
and Amazon <ref>{{cite web |
and Amazon <ref>{{cite web |
||
| url = http://perspectives.mvdirona.com/2012/02/26/ObservationsOnErrorsCorrectionsTrustOfDependentSystems.aspx|title=Observations on Errors, Corrections, & Trust of Dependent Systems}}</ref>. For instance, hardware raid cards sometimes have Reed-Solomon error correcting codes and hard disks have [[Data Integrity Field]] checksums but still data corruption is encountered despite these countermeasures. The problem is that different layers in the storage stack might be quite safe, but as data passes from one layer to another layer, errors might creep in. For instance, the data are validated on the hard disk, and is validated in the raid controller, but the data is not validated as the hard disk hands over the data to the raid controller. Thus passing the boundary, the transition, might corrupt data. The more layers one have, the higher the risk of data corruption. To counter this, one needs end-to-end validation. This means that data at the lowest level of the software stack, i.e. the hard disk, must be compared with the data at the highest level of the software stack, i.e. server's RAM or application, by checksums. Only by comparing the data at opposite ends of the software stack can one be sure there is no data corruption, i.e. use end-to-end checksums. To be able to do this, one typically needs a single system that manages all of the data across all boundaries, from hard disk, volume manager, raid level up to RAM. [[ZFS]] is one filesystem that does this as it is monolithic and is responsible for all layers in the storage stack, and [[ZFS#Data Integrity|initial research shows it to be safe]] because of the end-to-end checksums. [[BTRFS]] is also a monolithic filesystem built on the same design integrity grounds as ZFS and has end-to-end checksums<ref>{{cite web |
| url = http://perspectives.mvdirona.com/2012/02/26/ObservationsOnErrorsCorrectionsTrustOfDependentSystems.aspx|title=Observations on Errors, Corrections, & Trust of Dependent Systems}}</ref>. For instance, hardware raid cards sometimes have Reed-Solomon error correcting codes and hard disks have [[Data Integrity Field]] checksums but still data corruption is encountered despite these countermeasures. The problem is that different layers in the storage stack might be quite safe, but as data passes from one layer to another layer, errors might creep in. For instance, the data are validated on the hard disk, and is validated in the raid controller, but the data is not validated as the hard disk hands over the data to the raid controller. Thus passing the boundary, the transition, might corrupt data. The more layers one have, the higher the risk of data corruption. To counter this, one needs end-to-end validation. This means that data at the lowest level of the software stack, i.e. the hard disk, must be compared with the data at the highest level of the software stack, i.e. server's RAM or application, by checksums. Only by comparing the data at opposite ends of the software stack can one be sure there is no data corruption, i.e. use end-to-end checksums. To be able to do this, one typically needs a single system that manages all of the data across all boundaries, from hard disk, volume manager, raid level up to RAM. [[ZFS]] is one filesystem that does this as it is monolithic and is responsible for all layers in the storage stack, and [[ZFS#Data Integrity|initial research shows it to be safe]] because of the end-to-end checksums. [[BTRFS]] is also a monolithic filesystem built on the same design integrity grounds as ZFS and has end-to-end checksums<ref>{{cite web |
Revision as of 14:51, 17 August 2015
This article needs additional citations for verification. (November 2013) |
Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer, transmission and storage systems use a number of measures to provide end-to-end data integrity, or lack of errors.
In general, when data corruption occurs, a file containing that data will produce unexpected results when accessed by the system or the related application; results could range from a minor loss of data to a system crash. For example, if a Microsoft Word file is corrupted, when a person tries to open that file with MS Word, they may get an error message, thus the file would not be opened or the file might open with some of the data corrupted. The image to the right is a corrupted jpg file in which most of the information has been lost.
Some programs can give a suggestion to repair the file automatically (after the error), and some programs cannot repair it. It depends on the level of corruption, and the built-in functionality of the application to handle the error. There are various causes of the corruption.
Overview
There are two types of data corruption associated with computer systems:
- Undetected
- Also known as silent data corruption; such problems are the most dangerous errors as there is no indication that the data is incorrect.
- Detected
- Detected errors may be permanent with the loss of data or maybe temporary where some part of the system is able to detect and correct the error, in this latter case there is no data corruption.
Data corruption can occur at any level in a system, from the host to the storage medium. Modern systems attempt to detect corruption at many layers and then recover or correct the corruption; this is almost always successful but very rarely the information arriving in the systems memory is corrupted and can cause unpredictable results.
Data corruption during transmission has a variety of causes. Interruption of data transmission causes information loss. Environmental conditions can interfere with data transmission, especially when dealing with wireless transmission methods. Heavy clouds can block satellite transmissions. Wireless networks are susceptible to interference from devices such as microwave ovens.
Hardware and software failure are the two main causes for data loss. Background radiation, head crashes, and aging or wear of the storage device fall into the former category, while software failure typically occurs due to bugs in the code. Cosmic rays cause most soft errors in DRAM.[1]
Silent data corruption
The worst type of errors are those that go unnoticed, and are not even detected by the disk firmware or the host operating system. This is known as silent corruption.
There are many error sources beyond the disk storage subsystem itself. For instance, cables might be slightly loose, the power supply might be unreliable,[2] external vibrations such as a loud sound,[3] the network might introduce undetected corruption,[4] cosmic radiation and many other causes of soft memory errors, etc. In 39,000 storage systems that were analyzed, firmware bugs accounted for 5–10% of storage failures.[5] All in all, the error rates as observed by a CERN study on silent corruption are far higher than one in every 1016 bits.[6] Webshop Amazon.com confirms these high data corruption rates.[7]
The main problem is that hard disk capacities have increased substantially, but their error rates remain unchanged. The data corruption rate has always been roughly constant in time, meaning that modern disks are not much safer than old disks. In old disks the probability of data corruption was very small because they stored tiny amounts of data. In modern disks the probability is much larger because they store much more data, whilst not being safer. That way, silent data corruption has not been a serious concern while storage devices remained relatively small and slow. Hence, the users of small disks very rarely faced silent corruption, so the data corruption was not considered a problem that required a solution. But in modern times and with the advent of larger drives and very fast RAID setups, users are capable of transferring 1016 bits in a reasonably short time, thus easily reaching the data corruption thresholds.[8]
As an example, ZFS creator Jeff Bonwick stated that the fast database at Greenplum – a database software company specializing in large-scale data warehousing and analytics – faces silent corruption every 15 minutes.[9] As another example, a real-life study performed by NetApp on more than 1.5 million HDDs over 41 months found more than 400,000 silent data corruptions, out of which more than 30,000 were not detected by the hardware RAID controller. Another study, performed by CERN over six months and involving about 97 petabytes of data, found that about 128 megabytes of data became permanently corrupted.[10][11]
Silent data corruption may result in cascading failures, in which the system may run for a period of time with undetected initial error causing increasingly more problems until it is ultimately detected.[12] For example, a failure affecting file system metadata can result in multiple files being partially damaged or made completely inaccessible as the file system is used in its corrupted state.
Countermeasures
When data corruption behaves as a Poisson process, where each bit of data has an independently low probability of being changed, data corruption can generally be detected by the use of checksums, and can often be corrected by the use of error correcting codes.
If an uncorrectable data corruption is detected, procedures such as automatic retransmission or restoration from backups can be applied. Certain levels of RAID disk arrays have the ability to store and evaluate parity bits for data across a set of hard disks and can reconstruct corrupted data upon the failure of a single or multiple disks, depending on the level of RAID implemented.
Many errors are detected and corrected by the hard disk drives using the ECC/CRC codes[13] which are stored on disk for each sector. If the disk drive detects multiple read errors on a sector it may make a copy of the failing sector on another part of the disk, by remapping the failed sector of the disk to a spare sector without the involvement of the operating system (though this may be delayed until the next write to the sector). This "silent correction" can be monitored using S.M.A.R.T. and tools available for most operating systems to automatically check the disk drive for impending failures by watching for deteriorating SMART parameters.
Some file systems, such as Btrfs and ZFS, use internal data and metadata checksumming to detect silent data corruption. In addition, if a corruption is detected and the file system uses internal RAID mechanisms that provide data redundancy, such file systems can also reconstruct corrupted data in a transparent way.[14] This approach allows improved data integrity protection covering the entire data paths, which is usually known as end-to-end data protection.[15]
"Data scrubbing" is another method to reduce the likelihood of data corruption, as disk errors are caught and recovered from before multiple errors accumulate and overwhelm the number of parity bits. Instead of parity being checked on each read, the parity is checked during a regular scan of the disk, often done as a low priority background process. Note that the "data scrubbing" operation activates a parity check. If a user simply runs a normal program that reads data from the disk, then the parity would not be checked unless parity-check-on-read was both supported and enabled on the disk subsystem.
If appropriate mechanisms are employed to detect and remedy data corruption, data integrity can be maintained. This is particularly important in commercial applications (e.g. banking), where an undetected error could either corrupt a database index or change data to drastically affect an account balance, and in the use of encrypted or compressed data, where a small error can make an extensive dataset unusable.[6]
End-to-end checksums
The important thing to notice is that merely adding lot of checksums everywhere does not necessarily increase data integrity, as explained by CERN [16]
- "Checksumming? -> not necessarily enough"
- "End-to-end checksumming (ZFS has a point)"
and Amazon [17]. For instance, hardware raid cards sometimes have Reed-Solomon error correcting codes and hard disks have Data Integrity Field checksums but still data corruption is encountered despite these countermeasures. The problem is that different layers in the storage stack might be quite safe, but as data passes from one layer to another layer, errors might creep in. For instance, the data are validated on the hard disk, and is validated in the raid controller, but the data is not validated as the hard disk hands over the data to the raid controller. Thus passing the boundary, the transition, might corrupt data. The more layers one have, the higher the risk of data corruption. To counter this, one needs end-to-end validation. This means that data at the lowest level of the software stack, i.e. the hard disk, must be compared with the data at the highest level of the software stack, i.e. server's RAM or application, by checksums. Only by comparing the data at opposite ends of the software stack can one be sure there is no data corruption, i.e. use end-to-end checksums. To be able to do this, one typically needs a single system that manages all of the data across all boundaries, from hard disk, volume manager, raid level up to RAM. ZFS is one filesystem that does this as it is monolithic and is responsible for all layers in the storage stack, and initial research shows it to be safe because of the end-to-end checksums. BTRFS is also a monolithic filesystem built on the same design integrity grounds as ZFS and has end-to-end checksums[18]. However, if one has a storage stack consisting of a separate filesystem, separate volume manager and separate raid controller it might be less optimal from a data corruption view point as the data is not validated as it passes different layers.
See also
- Various resources:
- Countermeasures:
References
- ^ Scientific American (2008-07-21). "Solar Storms: Fast Facts". Nature Publishing Group. Retrieved 2009-12-08.
- ^ Eric Lowe (16 November 2005). "ZFS saves the day(-ta)!" (Blog). Oracle – Core Dumps of a Kernel Hacker's Brain – Eric Lowe's Blog. Oracle. Retrieved 9 June 2012.
- ^ bcantrill (31 December 2008). "Shouting in the Datacenter" (Video file). YouTube. Google. Retrieved 9 June 2012.
- ^ jforonda (31 January 2007). "Faulty FC port meets ZFS" (Blog). Blogger – Outside the Box. Google. Retrieved 9 June 2012.
- ^ "Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics" (PDF). USENIX. Retrieved 2014-01-18.
- ^ a b Bernd Panzer-Steindel (8 April 2007). "Draft 1.3". Data integrity. CERN. Retrieved 9 June 2012.
- ^ "Observations on Errors, Corrections, & Trust of Dependent Systems".
- ^ "Silent data corruption in disk arrays: A solution" (PDF). NEC. 2009. Retrieved 2013-10-24.
- ^ "A Conversation with Jeff Bonwick and Bill Moore". Association for Computing Machinery. November 15, 2007. Retrieved December 6, 2010.
- ^ David S. H. Rosenthal (October 1, 2010). "Keeping Bits Safe: How Hard Can It Be?". ACM Queue. Retrieved 2014-01-02.
- ^ Lakshmi N. Bairavasundaram; Garth R. Goodson; Shankar Pasupathy; Jiri Schindler (June 2007). "An Analysis of Latent Sector Errors in Disk Drives". Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS'07). San Diego, California, United States: ACM: 289–300. doi:10.1145/1254882.1254917. Retrieved 9 June 2012.
- ^ David Fiala; Frank Mueller; Christian Engelmann; Rolf Riesen; Kurt Ferreira; Ron Brightwell (November 2012). "Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing" (PDF). fiala.me. IEEE. Retrieved 2015-01-26.
- ^ "Read Error Severities and Error Management Logic". Retrieved 4 April 2012.
- ^ Margaret Bierman; Lenz Grimmer (August 2012). "How I Use the Advanced Capabilities of Btrfs". Oracle Corporation. Retrieved 2014-01-02.
- ^ Yupu Zhang; Abhishek Rajimwale; Andrea C. Arpaci-Dusseau; Remzi H. Arpaci-Dusseau (2010-02-04). "End-to-end Data Integrity for File Systems: A ZFS Case Study" (PDF). Computer Sciences Department, University of Wisconsin. Retrieved 2014-08-12.
- ^ https://indico.desy.de/getFile.py/access?contribId=65&sessionId=42&resId=0&materialId=slides&confId=257.
{{cite web}}
: Missing or empty|title=
(help) - ^ "Observations on Errors, Corrections, & Trust of Dependent Systems".
- ^ Margaret Bierman; Lenz Grimmer (August 2012). "How I Use the Advanced Capabilities of Btrfs". Oracle Corporation. Retrieved 2014-01-02.
External links
- SoftECC: A System for Software Memory Integrity Checking
- A Tunable, Software-based DRAM Error Detection and Correction Library for HPC
- Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing
- End-to-end Data Integrity for File Systems: A ZFS Case Study
- DRAM Errors in the Wild: A Large-Scale Field Study
- A study on silent corruptions, and an associated paper on data integrity (CERN, 2007)
- End-to-end Data Protection in SAS and Fibre Channel Hard Disk Drives (HGST)