Jump to content

High availability: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m Reverted edits by 2409:40E3:9:1B63:8000:0:0:0 (talk) to last version by MrOllie
No edit summary
 
(38 intermediate revisions by 25 users not shown)
Line 1: Line 1:
{{Short description|Systems with high up-time, a.k.a. "always on"}}
{{Short description|Systems with high up-time, a.k.a. "always on"}}
{{redirect|Always-on|the software restriction|Always-on DRM}}
{{redirect|Always-on|the software restriction|Always-on DRM}}
{{merge from|Resilience (network)|date=October 2022|discuss=Talk:High_availability#Merge_from_Resilience_(network)}}
{{Use American English|date = March 2019}}
{{Use American English|date = March 2019}}
{{Use mdy dates|date = March 2019}}
{{Use mdy dates|date = March 2019}}


'''High availability''' ('''HA''') is a characteristic of a system which aims to ensure an agreed level of operational performance, usually [[uptime]], for a higher than normal period.
'''High availability''' ('''HA''') is a characteristic of a system that aims to ensure an agreed level of operational performance, usually [[uptime]], for a higher than normal period.<ref>{{Cite web |last=Robert |first=Sheldon |date=April 2024 |title=high availability (HA) |url=https://www.techtarget.com/searchdatacenter/definition/high-availability |website=[[Techtarget]]}}</ref>


Modernization has resulted in an increased reliance on these systems. For example, hospitals and data centers require high availability of their systems to perform routine daily activities. [[Availability]] refers to the ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is – from the user's point of view – ''unavailable''.<ref>{{cite book|author=Floyd Piedad, Michael Hawkins|title=High Availability: Design, Techniques, and Processes|url=https://books.google.com/books?id=kHB0HdQ98qYC&q=high+availability+floyd+piedad+book|isbn=9780130962881|publisher=Prentice Hall|year=2001}}</ref> Generally, the term ''[[downtime]]'' is used to refer to periods when a system is unavailable.
There is now more dependence on these systems as a result of modernization. For instance, in order to carry out their regular daily tasks, hospitals and data centers need their systems to be highly available. [[Availability]] refers to the ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is – from the user's point of view – ''unavailable''.<ref>{{cite book|author=Floyd Piedad, Michael Hawkins|title=High Availability: Design, Techniques, and Processes|url=https://books.google.com/books?id=kHB0HdQ98qYC&q=high+availability+floyd+piedad+book|isbn=9780130962881|publisher=Prentice Hall|year=2001}}</ref> Generally, the term ''[[downtime]]'' is used to refer to periods when a system is unavailable.

==Resilience==
High availability is a property of [[computer network|network]] '''resilience''', the ability to "provide and maintain an acceptable level of service in the face of [[Fault (technology)|faults]] and challenges to normal operation."<ref>{{Cite web|url=https://resilinets.org/definitions.html#Resilience|title=Definitions - ResiliNetsWiki|website=resilinets.org}}</ref> Threats and challenges for services can range from simple misconfiguration over large scale natural disasters to targeted attacks.<ref>{{Cite web|url=https://webarchiv.ethz.ch/error404.html|title=Webarchiv ETHZ / Webarchive ETH|website=webarchiv.ethz.ch}}</ref> As such, network resilience touches a very wide range of topics. In order to increase the resilience of a given communication network, the probable challenges and risks have to be identified and appropriate resilience metrics have to be defined for the service to be protected.<ref>{{Cite journal|url=https://ieeexplore.ieee.org/document/5936160|title=Network resilience: a systematic approach|first1=Paul|last1=Smith|first2=David|last2=Hutchison|first3=James P.G.|last3=Sterbenz|first4=Marcus|last4=Schöller|first5=Ali|last5=Fessi|first6=Merkouris|last6=Karaliopoulos|first7=Chidung|last7=Lac|first8=Bernhard|last8=Plattner|date=July 3, 2011|journal=IEEE Communications Magazine|volume=49|issue=7|pages=88–97|via=IEEE Xplore|doi=10.1109/MCOM.2011.5936160|s2cid=10246912 }}</ref>

The importance of network resilience is continuously increasing, as communication networks are becoming a fundamental component in the operation of critical infrastructures.<ref>{{Cite web |last=accesstel |title=operational resilience {{!}} telcos {{!}} accesstel {{!}} risk {{!}} crisis |url=https://accesstel.com.au/insight/how-to-achieve-true-operational-resilience-in-telcos/ |access-date=2023-05-08 |website=accesstel |date=June 9, 2022 |language=en-AU}}</ref> Consequently, recent efforts focus on interpreting and improving network and computing resilience with applications to critical infrastructures.<ref>{{Cite web|url=https://www.kth.se/ac/research/secure-control-systems/cerces/cerces-center-for-resilient-critical-infrastructures-1.609722|title=The CERCES project - Center for Resilient Critical Infrastructures at KTH Royal Institute of Technology.|access-date=August 26, 2023|archive-date=October 19, 2018|archive-url=https://web.archive.org/web/20181019205558/https://www.kth.se/ac/research/secure-control-systems/cerces/cerces-center-for-resilient-critical-infrastructures-1.609722|url-status=dead}}</ref> As an example, one can consider as a resilience objective the provisioning of services over the network, instead of the services of the network itself. This may require coordinated response from both the network and from the services running on top of the network.<ref>{{Cite journal|url=https://ieeexplore.ieee.org/document/8477177|title=A Benders Decomposition Approach for Resilient Placement of Virtual Process Control Functions in Mobile Edge Clouds|first1=Peiyue|last1=Zhao|first2=György|last2=Dán|date=December 3, 2018|journal=IEEE Transactions on Network and Service Management|volume=15|issue=4|pages=1460–1472|via=IEEE Xplore|doi=10.1109/TNSM.2018.2873178|s2cid=56594760 }}</ref>

These services include:
* supporting [[distributed processing]]
* supporting [[Cloud storage|network storage]]
* maintaining service of communication services such as
** [[video conferencing]]
** [[instant messaging]]
** [[online collaboration]]
* access to applications and data as needed

Resilience and [[survivability]] are interchangeably used according to the specific context of a given study.<ref>
[https://smartech.gatech.edu/handle/1853/24397 Castet J., Saleh J. Survivability and Resiliency of Spacecraft and Space-Based Networks: a Framework for Characterization and Analysis", '' American Institute of Aeronautics and Astronautics, AIAA Technical Report 2008-7707. Conference on Network Protocols (ICNP 2006)'', Santa Barbara, California, USA, November 2006]</ref>


==Principles==
==Principles==
{{Further|Design for availability}}
{{Further|Design for availability}}


There are three principles of [[systems design]] in [[reliability engineering]] which can help achieve high availability.
There are three principles of [[systems design]] in [[reliability engineering]] that can help achieve high availability.


# Elimination of [[single point of failure|single points of failure]]. This means adding or building redundancy into the system so that failure of a component does not mean failure of the entire system.
# Elimination of [[single points of failure]]. This means adding or building redundancy into the system so that failure of a component does not mean failure of the entire system.
# Reliable crossover. In [[Redundancy (engineering)|redundant system]]s, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover.
# Reliable crossover. In [[Redundancy (engineering)|redundant system]]s, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover.
# Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure – but the maintenance activity must.
# Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure – but the maintenance activity must.
Line 20: Line 36:
==Scheduled and unscheduled downtime==
==Scheduled and unscheduled downtime==
{{unreferenced section|date=June 2008}}
{{unreferenced section|date=June 2008}}
A distinction can be made between scheduled and unscheduled downtime. Typically, [[scheduled downtime]] is a result of [[Maintenance (technical)|maintenance]] that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Scheduled downtime events might include patches to [[system software]] that require a [[booting|reboot]] or system configuration changes that only take effect upon a reboot. In general, scheduled downtime is usually the result of some logical, management-initiated event. Unscheduled downtime events typically arise from some physical event, such as a hardware or software failure or environmental anomaly. Examples of unscheduled downtime events include power outages, failed [[CPU]] or [[RAM]] components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, security breaches, or various [[Application software|application]], [[middleware]], and [[operating system]] failures.
A distinction can be made between scheduled and unscheduled downtime. Typically, [[scheduled downtime]] is a result of [[maintenance]] that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Scheduled downtime events might include patches to [[system software]] that require a [[booting|reboot]] or system configuration changes that only take effect upon a reboot. In general, scheduled downtime is usually the result of some logical, management-initiated event. Unscheduled downtime events typically arise from some physical event, such as a hardware or software failure or environmental anomaly. Examples of unscheduled downtime events include power outages, failed [[CPU]] or [[RAM]] components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, security breaches, or various [[Application software|application]], [[middleware]], and [[operating system]] failures.


If users can be warned away from scheduled downtimes, then the distinction is useful. But if the requirement is for true high availability, then downtime is downtime whether or not it is scheduled.
If users can be warned away from scheduled downtimes, then the distinction is useful. But if the requirement is for true high availability, then downtime is downtime whether or not it is scheduled.


Many computing sites exclude scheduled downtime from availability calculations, assuming that it has little or no impact upon the computing user community. By doing this, they can claim to have phenomenally high availability, which might give the illusion of [[continuous availability]]. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any [[single point of failure]] and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements. For certain systems, scheduled downtime does not matter, for example system downtime at an office building after everybody has gone home for the night.
Many computing sites exclude scheduled downtime from availability calculations, assuming that it has little or no impact upon the computing user community. By doing this, they can claim to have phenomenally high availability, which might give the illusion of [[continuous availability]]. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any [[single point of failure]] and allow online hardware, network, operating system, [[middleware]], and application upgrades, patches, and replacements. For certain systems, scheduled downtime does not matter, for example, system downtime at an office building after everybody has gone home for the night.


== Percentage calculation ==
==Percentage calculation==
Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. [[Service level agreement]]s often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable.
Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. [[Service level agreement]]s often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable.


Line 32: Line 48:
{| class="wikitable" style="text-align:right;"
{| class="wikitable" style="text-align:right;"
!Availability %
!Availability %
!Downtime per year{{NoteTag|Using 365.25 days per year. For consistency, all times are rounded to two decimal digits.}}
!Downtime per year{{NoteTag|Using {{val|365.25}} days per year; respectively, a quarter is a ¼ of that value (i.e., {{val|91.3125}} days), and a month is a twelfth of it (i.e., {{val|30.4375}} days). For consistency, all times are rounded to two decimal digits.}}
!Downtime per quarter
!Downtime per quarter
!Downtime per month
!Downtime per month
Line 45: Line 61:
|2.40 hours
|2.40 hours
|-
|-
| align="left" |95% ("one and a half nines")
| align="left" |95% ("one nine five")
|18.26 days
|18.26 days
|4.56 days
|4.56 days
Line 52: Line 68:
|1.20 hours
|1.20 hours
|-
|-
| align="left" |97% ("one and three quarters nines")
| align="left" |97% ("one nine seven")
|10.96 days
|10.96 days
|2.74 days
|2.74 days
Line 59: Line 75:
|43.20 minutes
|43.20 minutes
|-
|-
| align="left" |98% ("one and seven eights nines")
| align="left" |98% ("one nine eight")
|7.31 days
|7.31 days
|43.86 hours
|43.86 hours
Line 73: Line 89:
|14.40 minutes
|14.40 minutes
|-
|-
| align="left" |99.5% ("two and a half nines")
| align="left" |99.5% ("two nines five")
|1.83 days
|1.83 days
|10.98 hours
|10.98 hours
Line 80: Line 96:
|7.20 minutes
|7.20 minutes
|-
|-
| align="left" |99.8% ("two and seven eighths nines")
| align="left" |99.8% ("two nines eight")
|17.53 hours
|17.53 hours
|4.38 hours
|4.38 hours
Line 94: Line 110:
|1.44 minutes
|1.44 minutes
|-
|-
| align="left" |99.95% ("three and a half nines")
| align="left" |99.95% ("three nines five")
|4.38 hours
|4.38 hours
|65.7 minutes
|65.7 minutes
Line 108: Line 124:
|8.64 seconds
|8.64 seconds
|-
|-
| align="left" |99.995% ("four and a half nines")
| align="left" |99.995% ("four nines five")
|26.30 minutes
|26.30 minutes
|6.57 minutes
|6.57 minutes
Line 149: Line 165:
|604.80 microseconds
|604.80 microseconds
|86.40 microseconds
|86.40 microseconds
|-
| align="left" |99.99999999% ("ten nines")
|3.16 milliseconds
|788.40 microseconds
|262.80 microseconds
|60.48 microseconds
|8.64 microseconds
|-
| align="left" |99.999999999% ("eleven nines")
|315.58 microseconds
|78.84 microseconds
|26.28 microseconds
|6.05 microseconds
|864.00 [[nanosecond]]s
|-
| align="left" |100%
|0 seconds
|0 seconds
|0 seconds
|0 seconds
|0 seconds
|}
|}
</div>
</div>


The terms [[uptime]] and [[availability]] are often used interchangeably, but do not always refer to the same thing. For example, a system can be "up" with its services not "available" in the case of a [[network outage]]. Or a system undergoing software maintenance can be "available" to be worked on by a [[system administrator]], but its services do not appear "up" to the [[end user]] or customer. The subject of the terms is thus important here: whether the focus of a discussion is the server hardware, server OS, functional service, software service/process, or similar, it is only if there is a single, consistent subject of the discussion that the words uptime and availability can be used synonymously.
The terms [[uptime]] and [[availability]] are often used interchangeably but do not always refer to the same thing. For example, a system can be "up" with its services not "available" in the case of a [[network outage]]. Or a system undergoing software maintenance can be "available" to be worked on by a [[system administrator]], but its services do not appear "up" to the [[end user]] or customer. The subject of the terms is thus important here: whether the focus of a discussion is the server hardware, server OS, functional service, software service/process, or similar, it is only if there is a single, consistent subject of the discussion that the words uptime and availability can be used synonymously.


=== "Five by five" trick ===
=== Five-by-five mnemonic ===
A simple mnemonic rule implies remembering only that "5 nines" allow approximately 5 minutes of downtime per year. Calculation can be based by multiplying or dividing by 10: 4 nines is 50 minutes and 3 nines is 500 minutes. In the opposite direction, 6 nines is 0.5 minutes (30 sec) and 7 nines is 3 seconds.
A simple mnemonic rule states that ''5 nines'' allows approximately 5 minutes of downtime per year. Variants can be derived by multiplying or dividing by 10: 4 nines is 50 minutes and 3 nines is 500 minutes. In the opposite direction, 6 nines is 0.5 minutes (30 sec) and 7 nines is 3 seconds.


=== "Powers of 10" trick ===
=== "Powers of 10" trick ===
Another memory trick is to calculate <math>8.64 \times 10^x</math>, for <math>n + x = 4</math>, where <math>n</math> is the number of desired nines. More concisely <math>8.64 \times 10^{(4-n)}</math> seconds/day of downtime.
Another memory trick to calculate the allowed downtime duration for an "<math>n</math>-nines" availability percentage is to use the formula <math>8.64 \times 10^{4-n}</math> seconds per day.


For example, 90% ("one nine") yields the exponent <math>4 - 1 = 3</math>, and therefore the allowed downtime is <math>8.64 \times 10^3</math> seconds per day.
For example:


90% ("one nine") is <math>1 + x = 4</math>, so <math>x = 3</math> and therefore <math>8.64 \times 10^3</math> seconds/day.
Also, 99.999% ("five nines") gives the exponent <math>4 - 5 = -1</math>, and therefore the allowed downtime is <math>8.64 \times 10^{-1}</math> seconds per day.

99% ("two nines") is <math>2 + x = 4</math>, so <math>x = 2</math> and therefore <math>8.64 \times 10^2</math> seconds/day.

99.9% ("three nines") is <math>3 + x = 4</math>, so <math>x = 1</math> and therefore <math>8.64 \times 10^1</math> seconds/day.

99.99% ("four nines") is <math>4 + x = 4</math>, so <math>x = 0</math> and therefore <math>8.64 \times 10^0</math> seconds/day.

99.999% ("five nines") is <math>5 + x = 4</math>, so <math>x = -1</math> and therefore <math>8.64 \times 10^{-1}</math> seconds/day.

99.9999% ("six nines") is <math>6 + x = 4</math>, so <math>x = -2</math> and therefore <math>8.64 \times 10^{-2}</math> seconds/day.

99.99999% ("seven nines") is <math>7 + x = 4</math>, so <math>x = -3</math> and therefore <math>8.64 \times 10^{-3}</math> seconds/day.

99.999999% ("eight nines") is <math>8 + x = 4</math>, so <math>x = -4</math> and therefore <math>8.64 \times 10^{-4}</math> seconds/day.

99.9999999% ("nine nines") is <math>9 + x = 4</math>, so <math>x = -5</math> and therefore <math>8.64 \times 10^{-5}</math> seconds/day.


=== "Nines" ===
=== "Nines" ===
{{Main article|Nine (purity)}}
{{Main article|Nine (purity)}}


Percentages of a particular order of magnitude are sometimes referred to by the [[List of unusual units of measurement#Nines|number of nines]] or "class of nines" in the digits. For example, electricity that is delivered without interruptions ([[Power outage|blackout]]s, [[brownout (electricity)|brownout]]s or [[Voltage spike|surge]]s) 99.999% of the time would have 5 nines reliability, or class five.<ref>[http://www.cs.kent.edu/~walker/classes/aos.s00/lectures/L25.ps Lecture Notes] M. Nesterenko, Kent State University</ref> In particular, the term is used in connection with [[mainframes]]<ref>[http://comet.lehman.cuny.edu/cocchi/CIS345/LargeComputing/05_Availability.ppt Introduction to the new mainframe: Large scale commercial computing Chapter 5 Availability] IBM (2006)</ref><ref>[https://www.youtube.com/watch?v=DPcM5UePTY0 IBM zEnterprise EC12 Business Value Video] at ''youtube.com''</ref> or enterprise computing, often as part of a [[service-level agreement]].
Percentages of a particular order of magnitude are sometimes referred to by the [[List of unusual units of measurement#Nines|number of nines]] or "class of nines" in the digits. For example, electricity that is delivered without interruptions ([[Power outage|blackout]]s, [[brownout (electricity)|brownout]]s or [[Voltage spike|surge]]s) 99.999% of the time would have 5 nines reliability, or class five.<ref>[http://www.cs.kent.edu/~walker/classes/aos.s00/lectures/L25.ps Lecture Notes] M. Nesterenko, Kent State University</ref> In particular, the term is used in connection with [[mainframes]]<ref>[http://comet.lehman.cuny.edu/cocchi/CIS345/LargeComputing/05_Availability.ppt Introduction to the new mainframe: Large scale commercial computing Chapter 5 Availability] {{Webarchive|url=https://web.archive.org/web/20160304190213/http://comet.lehman.cuny.edu/cocchi/CIS345/LargeComputing/05_Availability.ppt |date=March 4, 2016 }} IBM (2006)</ref><ref>[https://www.youtube.com/watch?v=DPcM5UePTY0 IBM zEnterprise EC12 Business Value Video] at ''youtube.com''</ref> or enterprise computing, often as part of a [[service-level agreement]].


Similarly, percentages ending in a 5 have conventional names, traditionally the number of nines, then "five", so 99.95% is "three nines five", abbreviated 3N5.<ref>{{cite book |title = Precious metals, Volume 4 |publisher = Pergamon Press |year = 1981 |isbn = 9780080253695 |page = [https://books.google.com/books?id=KOMYAAAAIAAJ&q=%22four+nines+five%22 page 262] }}</ref><ref>{{cite book|title=PVD for Microelectronics: Sputter Desposition to Semiconductor Manufacturing|year=1998|page=[https://books.google.com/books?id=hmvtBE_7i00C&pg=PA387&lpg=PA387&dq=%22four+nines+five%22 387]}}<!-- NOTE: reference has typo, writing 99.9995% as "four-nines-five" while it is actually *five* nines five. --></ref> This is casually referred to as "three and a half nines",<ref>{{cite book |title=Site Reliability Engineering: How Google Runs Production Systems |first1=Niall Richard |last1=Murphy |first2=Betsy |last2=Beyer |first3=Jennifer |last3=Petoff |first4=Chris |last4=Jones |year=2016 |page=[https://books.google.com/books?id=_4rPCwAAQBAJ&pg=PA38&dq=%22three+and+a+half+nines%22 38] }}</ref> but this is incorrect: a 5 is only a factor of 2, while a 9 is a factor of 10, so a 5 is 0.3 nines (per below formula: <math>\log_{10} 2 \approx 0.3</math>):{{NoteTag|See [[Mathematical coincidence#Concerning base 2|mathematical coincidences concerning base 2]] for details on this approximation.}} 99.95% availability is 3.3 nines, not 3.5 nines.<ref>{{cite web |url = http://www.joshdeprez.com/post/67-nines-of-availability/ |title = Nines of Nines |author = Josh Deprez |date = 2016-04-23 |access-date = May 31, 2016 |archive-date = September 4, 2016 |archive-url = https://web.archive.org/web/20160904175507/http://www.joshdeprez.com/post/67-nines-of-availability/ |url-status = dead }}</ref> More simply, going from 99.9% availability to 99.95% availability is a factor of 2 (0.1% to 0.05% unavailability), but going from 99.95% to 99.99% availability is a factor of 5 (0.05% to 0.01% unavailability), over twice as much.{{NoteTag|"Twice as much" on a logarithmic scale, meaning two ''factors'' of 2: <math>\times 2 \times 2 < \times 5</math>}}
Similarly, percentages ending in a 5 have conventional names, traditionally the number of nines, then "five", so 99.95% is "three nines five", abbreviated 3N5.<ref>{{cite book |title = Precious metals, Volume 4 |publisher = Pergamon Press |year = 1981 |isbn = 9780080253695 |page = [https://books.google.com/books?id=KOMYAAAAIAAJ&q=%22four+nines+five%22 page 262] }}</ref><ref>{{cite book|title=PVD for Microelectronics: Sputter Desposition to Semiconductor Manufacturing|year=1998|page=[https://books.google.com/books?id=hmvtBE_7i00C&pg=PA387&lpg=PA387&dq=%22four+nines+five%22 387]}}<!-- NOTE: reference has typo, writing 99.9995% as "four-nines-five" while it is actually *five* nines five. --></ref> This is casually referred to as "three and a half nines",<ref>{{cite book |title=Site Reliability Engineering: How Google Runs Production Systems |first1=Niall Richard |last1=Murphy |first2=Betsy |last2=Beyer |first3=Jennifer |last3=Petoff |first4=Chris |last4=Jones |year=2016 |page=[https://books.google.com/books?id=_4rPCwAAQBAJ&pg=PA38&dq=%22three+and+a+half+nines%22 38] }}</ref> but this is incorrect: a 5 is only a factor of 2, while a 9 is a factor of 10, so a 5 is 0.3 nines (per below formula: <math>\log_{10} 2 \approx 0.3</math>):{{NoteTag|See [[Mathematical coincidence#Concerning base 2|mathematical coincidences concerning base 2]] for details on this approximation.}} 99.95% availability is 3.3 nines, not 3.5 nines.<ref>{{cite web |url = http://www.joshdeprez.com/post/67-nines-of-availability/ |title = Nines of Nines |author = Josh Deprez |date = 2016-04-23 |access-date = May 31, 2016 |archive-date = September 4, 2016 |archive-url = https://web.archive.org/web/20160904175507/http://www.joshdeprez.com/post/67-nines-of-availability/ |url-status = dead }}</ref> More simply, going from 99.9% availability to 99.95% availability is a factor of 2 (0.1% to 0.05% unavailability), but going from 99.95% to 99.99% availability is a factor of 5 (0.05% to 0.01% unavailability), over twice as much.{{NoteTag|"Twice as much" on a logarithmic scale, meaning two ''factors'' of 2: <math>\times 2 \times 2 < \times 5</math>}}
Line 197: Line 218:
In general, the number of nines is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a [[probability]] (like 0.00001), or a [[downtime]] per year is quoted. Availability specified as a number of nines is often seen in [[marketing]] documents.{{Citation needed|date=August 2008}} The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence.<ref>[http://searchstorage.techtarget.com/tip/0,289483,sid5_gci921823,00.html Evan L. Marcus, ''The myth of the nines'']</ref> For large amounts of 9s, the "unavailability" index (measure of downtime rather than uptime) is easier to handle. For example, this is why an "unavailability" rather than availability metric is used in hard disk or data link [[bit error rate]]s.
In general, the number of nines is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a [[probability]] (like 0.00001), or a [[downtime]] per year is quoted. Availability specified as a number of nines is often seen in [[marketing]] documents.{{Citation needed|date=August 2008}} The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence.<ref>[http://searchstorage.techtarget.com/tip/0,289483,sid5_gci921823,00.html Evan L. Marcus, ''The myth of the nines'']</ref> For large amounts of 9s, the "unavailability" index (measure of downtime rather than uptime) is easier to handle. For example, this is why an "unavailability" rather than availability metric is used in hard disk or data link [[bit error rate]]s.


Sometimes the humorous term "nine fives" (55.5555555%) is used to contrast with "five nines" (99.999%),<ref>{{cite magazine |last1=Newman |first1=David |last2=Snyder | first2=Joel |last3=Thayer |first3=Rodney |date=2012-06-24 |title=Crying Wolf: False alarms hide attacks |url=https://books.google.com/books?id=1xgEAAAAMBAJ&dq=%22nine+fives%22&pg=PT27 |magazine=Network World |volume=19 |issue=25 |page=60 |access-date=2019-03-15|quotation="leading to crashes and uptime numbers closer to nine fives than to five nines."}}</ref><ref>{{cite web |last=Metcalfe |first=Bob |author-link=Robert Metcalfe |date=2001-04-02 |title=After 35 years of technology crusades, Bob Metcalfe rides off into the sunset |url=https://www.itworld.com/article/2797678/after-35-years-of-technology-crusades--bob-metcalfe-rides-off-into-the-sunset.html |website=ITworld |access-date=2019-03-15 |quotation="and five nines (not nine fives) of reliability"}}</ref><ref>{{cite web |last=Pilgrim |first=Jim |date=2010-10-20 |title=Goodbye Five 9s |url=https://www.seeclearfield.com/newsroom/goodbye-five-9s.html |publisher=Clearfield, Inc. |access-date=2019-03-15|quotation="but it seems to me we are moving closer to 9-5s (55.5555555%) in network reliability rather than 5-9s"}}</ref> though this is not an actual goal, but rather a sarcastic reference to something totally failing to meet any reasonable target.
Sometimes the humorous term "nine fives" (55.5555555%) is used to contrast with "five nines" (99.999%),<ref>{{cite magazine |last1=Newman |first1=David |last2=Snyder | first2=Joel |last3=Thayer |first3=Rodney |date=2012-06-24 |title=Crying Wolf: False alarms hide attacks |url=https://books.google.com/books?id=1xgEAAAAMBAJ&dq=%22nine+fives%22&pg=PT27 |magazine=Network World |volume=19 |issue=25 |page=60 |access-date=2019-03-15|quotation="leading to crashes and uptime numbers closer to nine fives than to five nines."}}</ref><ref>{{cite web |last=Metcalfe |first=Bob |author-link=Robert Metcalfe |date=2001-04-02 |title=After 35 years of technology crusades, Bob Metcalfe rides off into the sunset |url=https://www.itworld.com/article/2797678/after-35-years-of-technology-crusades--bob-metcalfe-rides-off-into-the-sunset.html |website=ITworld |access-date=2019-03-15 |quotation="and five nines (not nine fives) of reliability" }}{{Dead link|date=July 2024 |bot=InternetArchiveBot |fix-attempted=yes }}</ref><ref>{{cite web |last=Pilgrim |first=Jim |date=2010-10-20 |title=Goodbye Five 9s |url=https://www.seeclearfield.com/newsroom/goodbye-five-9s.html |publisher=Clearfield, Inc. |access-date=2019-03-15|quotation="but it seems to me we are moving closer to 9-5s (55.5555555%) in network reliability rather than 5-9s"}}</ref> though this is not an actual goal, but rather a sarcastic reference to something totally failing to meet any reasonable target.


==Measurement and interpretation==
==Measurement and interpretation==


Availability measurement is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas the system administrator will claim 100% [[uptime]]. However, given the true definition of availability, the system will be approximately 99.9% available, or three nines (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, even when the systems are continuing to function. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users&nbsp;– a true availability measure is holistic.
Availability measurement is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas the system administrator will claim 100% uptime. However, given the true definition of availability, the system will be approximately 99.9% available, or three nines (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, even when the systems are continuing to function. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users&nbsp;– a true availability measure is holistic.


Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation") that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users themselves, than systems which experience periodic lulls in demand.
Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation") that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users themselves, than systems which experience periodic lulls in demand.
Line 208: Line 229:


==Closely related concepts==
==Closely related concepts==
Recovery time (or estimated time of repair (ETR), also known as [[recovery time objective]] (RTO) is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Another metric is [[mean time to recovery]] (MTTR). Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary [[disaster recovery]] data center.
Recovery time (or estimated time of repair (ETR), also known as [[recovery time objective]] (RTO) is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Another metric is [[mean time to recovery]] (MTTR). Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary [[IT disaster recovery|disaster recovery]] data center.


Another related concept is [[data availability]], that is the degree to which [[database]]s and other information storage systems faithfully record and report system transactions. Information management often focuses separately on data availability, or [[Disaster recovery#Recovery Point Objective|Recovery Point Objective]], in order to determine acceptable (or actual) [[data loss]] with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss.
Another related concept is [[data availability]], that is the degree to which [[database]]s and other information storage systems faithfully record and report system transactions. Information management often focuses separately on data availability, or [[Recovery Point Objective]], in order to determine acceptable (or actual) [[data loss]] with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss.


A [[service level agreement]] ("SLA") formalizes an organization's availability objectives and requirements.
A [[service level agreement]] ("SLA") formalizes an organization's availability objectives and requirements.
Line 218: Line 239:


==System design==
==System design==
Adding more components to an overall system design can undermine efforts to achieve high availability because [[complex system]]s inherently have more potential failure points and are more difficult to implement correctly. While some analysts would put forth the theory that the most highly available systems adhere to a simple architecture (a single, high quality, multi-purpose physical system with comprehensive internal hardware redundancy), this architecture suffers from the requirement that the entire system must be brought down for patching and operating system upgrades. More advanced system designs allow for systems to be patched and upgraded without compromising service availability (see [[load balancing (computing)|load balancing]] and [[failover]]).
On one hand, adding more components to an overall system design can undermine efforts to achieve high availability because [[complex system]]s inherently have more potential failure points and are more difficult to implement correctly. While some analysts would put forth the theory that the most highly available systems adhere to a simple architecture (a single, high-quality, multi-purpose physical system with comprehensive internal hardware redundancy), this architecture suffers from the requirement that the entire system must be brought down for patching and operating system upgrades. More advanced system designs allow for systems to be patched and upgraded without compromising service availability (see [[load balancing (computing)|load balancing]] and [[failover]]). High availability requires less human intervention to restore operation in complex systems; the reason for this being that the most common cause for outages is human error.<ref name="humanerror">{{Cite web |title=What is network downtime? |url=https://www.techtarget.com/searchnetworking/definition/network-downtime |access-date=2023-12-27 |website=Networking |language=en}}</ref>


=== High availability through redundancy ===
High availability requires less human intervention to restore operation in complex systems; the reason for this being that the most common cause for outages is human error.<ref name=humanerror>{{cite web|date=October 27, 2010|url=http://www.gartner.com/id=1458131|title=Top Seven Considerations for Configuration Management for Virtual and Cloud Infrastructures |publisher=[[Gartner]]|access-date=October 13, 2013}}{{dead link|date=December 2021|bot=medic}}{{cbignore|bot=medic}}</ref>
On the other hand, [[Redundancy (engineering)|redundancy]] is used to create systems with high levels of availability (e.g. popular ecommerce websites). In this case it is required to have high levels of failure detectability and avoidance of common cause failures.


If redundant parts are used [[Availability#Series vs Parallel components|in parallel and have independent failure]] (e.g. by not being within the same data center), they can exponentially increase the availability and make the overall system highly available. If you have N parallel components each having X availability, then you can use following formula:<ref>{{Cite book |title=Reliability and Availability Engineering: Modeling, Analysis, and Applications |year=2017 |isbn=978-1107099500 |last1=Trivedi |first1=Kishor S. |last2=Bobbio |first2=Andrea |publisher=Cambridge University Press }}</ref><ref>{{Cite book |title=System Sustainment: Acquisition And Engineering Processes For The Sustainment Of Critical And Legacy Systems (World Scientific Series On Emerging Technologies: Avram Bar-cohen Memorial Series) |year=2022 |publisher=World Scientific |isbn=978-9811256844}}</ref>
[[Redundancy (engineering)|Redundancy]] is used to create systems with high levels of availability (e.g. aircraft flight computers). In this case it is required to have high levels of failure detectability and avoidance of common cause failures. Two kinds of redundancy are passive redundancy and active redundancy.

Availability of parallel components = 1 - (1 - X)^ N
[[File:System availability chart.png|alt=10 hosts, each having 50% availability. But if they are used in parallel and fail independently, they can provide high availability.|thumb|10 hosts, each having 50% availability. But if they are used in parallel and fail independently, they can provide high availability.]]
So for example if each of your components has only 50% availability, by using 10 of components in parallel, you can achieve 99.9023% availability.

Two kinds of redundancy are passive redundancy and active redundancy.


Passive redundancy is used to achieve high availability by including enough excess capacity in the design to accommodate a performance decline. The simplest example is a boat with two separate engines driving two separate propellers. The boat continues toward its destination despite failure of a single engine or propeller. A more complex example is multiple redundant power generation facilities within a large system involving [[electric power transmission]]. Malfunction of single components is not considered to be a failure unless the resulting performance decline exceeds the specification limits for the entire system.
Passive redundancy is used to achieve high availability by including enough excess capacity in the design to accommodate a performance decline. The simplest example is a boat with two separate engines driving two separate propellers. The boat continues toward its destination despite failure of a single engine or propeller. A more complex example is multiple redundant power generation facilities within a large system involving [[electric power transmission]]. Malfunction of single components is not considered to be a failure unless the resulting performance decline exceeds the specification limits for the entire system.


Active redundancy is used in complex systems to achieve high availability with no performance decline. Multiple items of the same kind are incorporated into a design that includes a method to detect failure and automatically reconfigure the system to bypass failed items using a voting scheme. This is used with complex computing systems that are linked. Internet [[routing]] is derived from early work by Birman and Joseph in this area.<ref>{{IETF RFC|992}}</ref> Active redundancy may introduce more complex failure modes into a system, such as continuous system reconfiguration due to faulty voting logic.
Active redundancy is used in complex systems to achieve high availability with no performance decline. Multiple items of the same kind are incorporated into a design that includes a method to detect failure and automatically reconfigure the system to bypass failed items using a voting scheme. This is used with complex computing systems that are linked. Internet [[routing]] is derived from early work by Birman and Joseph in this area.<ref>{{IETF RFC|992}}</ref>{{secondary source needed|date=October 2024}} Active redundancy may introduce more complex failure modes into a system, such as continuous system reconfiguration due to faulty voting logic.


Zero downtime system design means that modeling and simulation indicates [[mean time between failures]] significantly exceeds the period of time between [[planned maintenance]], [[upgrade]] events, or system lifetime. Zero downtime involves massive redundancy, which is needed for some types of aircraft and for most kinds of [[communications satellite]]s. [[Global Positioning System]] is an example of a zero downtime system.
Zero downtime system design means that modeling and simulation indicates mean time between failures significantly exceeds the period of time between [[planned maintenance]], [[upgrade]] events, or system lifetime. Zero downtime involves massive redundancy, which is needed for some types of aircraft and for most kinds of [[communications satellite]]s. [[Global Positioning System]] is an example of a zero downtime system.


Fault [[instrumentation]] can be used in systems with limited redundancy to achieve high availability. Maintenance actions occur during brief periods of down-time only after a fault indicator activates. Failure is only significant if this occurs during a [[mission critical]] period.
Fault [[instrumentation]] can be used in systems with limited redundancy to achieve high availability. Maintenance actions occur during brief periods of downtime only after a fault indicator activates. Failure is only significant if this occurs during a [[mission critical]] period.


[[Modeling and simulation]] is used to evaluate the theoretical reliability for large systems. The outcome of this kind of model is used to evaluate different design options. A model of the entire system is created, and the model is stressed by removing components. Redundancy simulation involves the N-x criteria. N represents the total number of components in the system. x is the number of components used to stress the system. N-1 means the model is stressed by evaluating performance with all possible combinations where one component is faulted. N-2 means the model is stressed by evaluating performance with all possible combinations where two component are faulted simultaneously.
[[Modeling and simulation]] is used to evaluate the theoretical reliability for large systems. The outcome of this kind of model is used to evaluate different design options. A model of the entire system is created, and the model is stressed by removing components. Redundancy simulation involves the N-x criteria. N represents the total number of components in the system. x is the number of components used to stress the system. N-1 means the model is stressed by evaluating performance with all possible combinations where one component is faulted. N-2 means the model is stressed by evaluating performance with all possible combinations where two component are faulted simultaneously.


==Reasons for unavailability==
==Reasons for unavailability==
A survey among academic availability experts in 2010 ranked reasons for unavailability of enterprise IT systems. All reasons refer to '''not following best practice''' in each of the following areas (in order of importance):<ref>Ulrik Franke, Pontus Johnson, Johan König, Liv Marcks von Würtemberg: Availability of enterprise IT systems – an expert-based Bayesian model, ''Proc. Fourth International Workshop on Software Quality and Maintainability'' (WSQM 2010), Madrid, [http://www.kth.se/ees/forskning/publikationer/modules/publications_polopoly/reports/2010/IR-EE-ICS_2010_047.pdf?l=en_UK] {{Webarchive|url=https://archive.is/20120804102750/www.kth.se/ees/forskning/publikationer/modules/publications_polopoly/reports/2010/IR-EE-ICS_2010_047.pdf?l=en_UK|date=August 4, 2012}}</ref>
A survey among academic availability experts in 2010 ranked reasons for unavailability of enterprise IT systems. All reasons refer to '''not following best practice''' in each of the following areas (in order of importance):<ref>Ulrik Franke, Pontus Johnson, Johan König, Liv Marcks von Würtemberg: Availability of enterprise IT systems – an expert-based Bayesian model, ''Proc. Fourth International Workshop on Software Quality and Maintainability'' (WSQM 2010), Madrid, [http://www.kth.se/ees/forskning/publikationer/modules/publications_polopoly/reports/2010/IR-EE-ICS_2010_047.pdf?l=en_UK] {{Webarchive|url=https://archive.today/20120804102750/http://www.kth.se/ees/forskning/publikationer/modules/publications_polopoly/reports/2010/IR-EE-ICS_2010_047.pdf?l=en_UK|date=August 4, 2012}}</ref>


# Monitoring of the relevant components
# Monitoring of the relevant components
Line 254: Line 282:


==Costs of unavailability==
==Costs of unavailability==
In a 1998 report from [[IBM Global Services]], unavailable systems were estimated to have cost American businesses $4.54 billion in 1996, due to lost productivity and revenues.<ref>IBM Global Services, ''Improving systems availability'', IBM Global Services, 1998, [http://www.dis.uniroma1.it/~irl/docs/availabilitytutorial.pdf]</ref>
In a 1998 report from [[IBM Global Services]], unavailable systems were estimated to have cost American businesses $4.54 billion in 1996, due to lost productivity and revenues.<ref>IBM Global Services, ''Improving systems availability'', IBM Global Services, 1998, [http://www.dis.uniroma1.it/~irl/docs/availabilitytutorial.pdf] {{Webarchive|url=https://web.archive.org/web/20110401093354/http://www.dis.uniroma1.it/~irl/docs/availabilitytutorial.pdf|date=April 1, 2011}}</ref>


==See also==
==See also==
* [[Availability]]
* [[Fault tolerance]]
* [[Fault tolerance]]
* [[High-availability cluster]]
* [[High-availability cluster]]
* [[Responsiveness]]
* [[Overall equipment effectiveness]]
* [[Overall equipment effectiveness]]
* [[Reliability, availability and serviceability]]
* [[Reliability, availability and serviceability]]
* [[Resilience (network)]]
* [[Responsiveness]]
* [[Scalability]]
* [[Ubiquitous computing]]
* [[Ubiquitous computing]]


Line 282: Line 311:
[[Category:Reliability engineering]]
[[Category:Reliability engineering]]
[[Category:Measurement]]
[[Category:Measurement]]
[[Category:Computer networks engineering]]

Latest revision as of 10:59, 21 December 2024

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.[1]

There is now more dependence on these systems as a result of modernization. For instance, in order to carry out their regular daily tasks, hospitals and data centers need their systems to be highly available. Availability refers to the ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is – from the user's point of view – unavailable.[2] Generally, the term downtime is used to refer to periods when a system is unavailable.

Resilience

[edit]

High availability is a property of network resilience, the ability to "provide and maintain an acceptable level of service in the face of faults and challenges to normal operation."[3] Threats and challenges for services can range from simple misconfiguration over large scale natural disasters to targeted attacks.[4] As such, network resilience touches a very wide range of topics. In order to increase the resilience of a given communication network, the probable challenges and risks have to be identified and appropriate resilience metrics have to be defined for the service to be protected.[5]

The importance of network resilience is continuously increasing, as communication networks are becoming a fundamental component in the operation of critical infrastructures.[6] Consequently, recent efforts focus on interpreting and improving network and computing resilience with applications to critical infrastructures.[7] As an example, one can consider as a resilience objective the provisioning of services over the network, instead of the services of the network itself. This may require coordinated response from both the network and from the services running on top of the network.[8]

These services include:

Resilience and survivability are interchangeably used according to the specific context of a given study.[9]

Principles

[edit]

There are three principles of systems design in reliability engineering that can help achieve high availability.

  1. Elimination of single points of failure. This means adding or building redundancy into the system so that failure of a component does not mean failure of the entire system.
  2. Reliable crossover. In redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover.
  3. Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure – but the maintenance activity must.

Scheduled and unscheduled downtime

[edit]

A distinction can be made between scheduled and unscheduled downtime. Typically, scheduled downtime is a result of maintenance that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Scheduled downtime events might include patches to system software that require a reboot or system configuration changes that only take effect upon a reboot. In general, scheduled downtime is usually the result of some logical, management-initiated event. Unscheduled downtime events typically arise from some physical event, such as a hardware or software failure or environmental anomaly. Examples of unscheduled downtime events include power outages, failed CPU or RAM components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, security breaches, or various application, middleware, and operating system failures.

If users can be warned away from scheduled downtimes, then the distinction is useful. But if the requirement is for true high availability, then downtime is downtime whether or not it is scheduled.

Many computing sites exclude scheduled downtime from availability calculations, assuming that it has little or no impact upon the computing user community. By doing this, they can claim to have phenomenally high availability, which might give the illusion of continuous availability. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any single point of failure and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements. For certain systems, scheduled downtime does not matter, for example, system downtime at an office building after everybody has gone home for the night.

Percentage calculation

[edit]

Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable.

Availability % Downtime per year[note 1] Downtime per quarter Downtime per month Downtime per week Downtime per day (24 hours)
90% ("one nine") 36.53 days 9.13 days 73.05 hours 16.80 hours 2.40 hours
95% ("one nine five") 18.26 days 4.56 days 36.53 hours 8.40 hours 1.20 hours
97% ("one nine seven") 10.96 days 2.74 days 21.92 hours 5.04 hours 43.20 minutes
98% ("one nine eight") 7.31 days 43.86 hours 14.61 hours 3.36 hours 28.80 minutes
99% ("two nines") 3.65 days 21.9 hours 7.31 hours 1.68 hours 14.40 minutes
99.5% ("two nines five") 1.83 days 10.98 hours 3.65 hours 50.40 minutes 7.20 minutes
99.8% ("two nines eight") 17.53 hours 4.38 hours 87.66 minutes 20.16 minutes 2.88 minutes
99.9% ("three nines") 8.77 hours 2.19 hours 43.83 minutes 10.08 minutes 1.44 minutes
99.95% ("three nines five") 4.38 hours 65.7 minutes 21.92 minutes 5.04 minutes 43.20 seconds
99.99% ("four nines") 52.60 minutes 13.15 minutes 4.38 minutes 1.01 minutes 8.64 seconds
99.995% ("four nines five") 26.30 minutes 6.57 minutes 2.19 minutes 30.24 seconds 4.32 seconds
99.999% ("five nines") 5.26 minutes 1.31 minutes 26.30 seconds 6.05 seconds 864.00 milliseconds
99.9999% ("six nines") 31.56 seconds 7.89 seconds 2.63 seconds 604.80 milliseconds 86.40 milliseconds
99.99999% ("seven nines") 3.16 seconds 0.79 seconds 262.98 milliseconds 60.48 milliseconds 8.64 milliseconds
99.999999% ("eight nines") 315.58 milliseconds 78.89 milliseconds 26.30 milliseconds 6.05 milliseconds 864.00 microseconds
99.9999999% ("nine nines") 31.56 milliseconds 7.89 milliseconds 2.63 milliseconds 604.80 microseconds 86.40 microseconds
99.99999999% ("ten nines") 3.16 milliseconds 788.40 microseconds 262.80 microseconds 60.48 microseconds 8.64 microseconds
99.999999999% ("eleven nines") 315.58 microseconds 78.84 microseconds 26.28 microseconds 6.05 microseconds 864.00 nanoseconds
100% 0 seconds 0 seconds 0 seconds 0 seconds 0 seconds

The terms uptime and availability are often used interchangeably but do not always refer to the same thing. For example, a system can be "up" with its services not "available" in the case of a network outage. Or a system undergoing software maintenance can be "available" to be worked on by a system administrator, but its services do not appear "up" to the end user or customer. The subject of the terms is thus important here: whether the focus of a discussion is the server hardware, server OS, functional service, software service/process, or similar, it is only if there is a single, consistent subject of the discussion that the words uptime and availability can be used synonymously.

Five-by-five mnemonic

[edit]

A simple mnemonic rule states that 5 nines allows approximately 5 minutes of downtime per year. Variants can be derived by multiplying or dividing by 10: 4 nines is 50 minutes and 3 nines is 500 minutes. In the opposite direction, 6 nines is 0.5 minutes (30 sec) and 7 nines is 3 seconds.

"Powers of 10" trick

[edit]

Another memory trick to calculate the allowed downtime duration for an "-nines" availability percentage is to use the formula seconds per day.

For example, 90% ("one nine") yields the exponent , and therefore the allowed downtime is seconds per day.

Also, 99.999% ("five nines") gives the exponent , and therefore the allowed downtime is seconds per day.

"Nines"

[edit]

Percentages of a particular order of magnitude are sometimes referred to by the number of nines or "class of nines" in the digits. For example, electricity that is delivered without interruptions (blackouts, brownouts or surges) 99.999% of the time would have 5 nines reliability, or class five.[10] In particular, the term is used in connection with mainframes[11][12] or enterprise computing, often as part of a service-level agreement.

Similarly, percentages ending in a 5 have conventional names, traditionally the number of nines, then "five", so 99.95% is "three nines five", abbreviated 3N5.[13][14] This is casually referred to as "three and a half nines",[15] but this is incorrect: a 5 is only a factor of 2, while a 9 is a factor of 10, so a 5 is 0.3 nines (per below formula: ):[note 2] 99.95% availability is 3.3 nines, not 3.5 nines.[16] More simply, going from 99.9% availability to 99.95% availability is a factor of 2 (0.1% to 0.05% unavailability), but going from 99.95% to 99.99% availability is a factor of 5 (0.05% to 0.01% unavailability), over twice as much.[note 3]

A formulation of the class of 9s based on a system's unavailability would be

(cf. Floor and ceiling functions).

A similar measurement is sometimes used to describe the purity of substances.

In general, the number of nines is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a probability (like 0.00001), or a downtime per year is quoted. Availability specified as a number of nines is often seen in marketing documents.[citation needed] The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence.[17] For large amounts of 9s, the "unavailability" index (measure of downtime rather than uptime) is easier to handle. For example, this is why an "unavailability" rather than availability metric is used in hard disk or data link bit error rates.

Sometimes the humorous term "nine fives" (55.5555555%) is used to contrast with "five nines" (99.999%),[18][19][20] though this is not an actual goal, but rather a sarcastic reference to something totally failing to meet any reasonable target.

Measurement and interpretation

[edit]

Availability measurement is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas the system administrator will claim 100% uptime. However, given the true definition of availability, the system will be approximately 99.9% available, or three nines (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, even when the systems are continuing to function. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users – a true availability measure is holistic.

Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation") that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users themselves, than systems which experience periodic lulls in demand.

An alternative metric is mean time between failures (MTBF).

[edit]

Recovery time (or estimated time of repair (ETR), also known as recovery time objective (RTO) is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Another metric is mean time to recovery (MTTR). Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary disaster recovery data center.

Another related concept is data availability, that is the degree to which databases and other information storage systems faithfully record and report system transactions. Information management often focuses separately on data availability, or Recovery Point Objective, in order to determine acceptable (or actual) data loss with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss.

A service level agreement ("SLA") formalizes an organization's availability objectives and requirements.

Military control systems

[edit]

High availability is one of the primary requirements of the control systems in unmanned vehicles and autonomous maritime vessels. If the controlling system becomes unavailable, the Ground Combat Vehicle (GCV) or ASW Continuous Trail Unmanned Vessel (ACTUV) would be lost.

System design

[edit]

On one hand, adding more components to an overall system design can undermine efforts to achieve high availability because complex systems inherently have more potential failure points and are more difficult to implement correctly. While some analysts would put forth the theory that the most highly available systems adhere to a simple architecture (a single, high-quality, multi-purpose physical system with comprehensive internal hardware redundancy), this architecture suffers from the requirement that the entire system must be brought down for patching and operating system upgrades. More advanced system designs allow for systems to be patched and upgraded without compromising service availability (see load balancing and failover). High availability requires less human intervention to restore operation in complex systems; the reason for this being that the most common cause for outages is human error.[21]

High availability through redundancy

[edit]

On the other hand, redundancy is used to create systems with high levels of availability (e.g. popular ecommerce websites). In this case it is required to have high levels of failure detectability and avoidance of common cause failures.

If redundant parts are used in parallel and have independent failure (e.g. by not being within the same data center), they can exponentially increase the availability and make the overall system highly available. If you have N parallel components each having X availability, then you can use following formula:[22][23]

Availability of parallel components = 1 - (1 - X)^ N

10 hosts, each having 50% availability. But if they are used in parallel and fail independently, they can provide high availability.
10 hosts, each having 50% availability. But if they are used in parallel and fail independently, they can provide high availability.

So for example if each of your components has only 50% availability, by using 10 of components in parallel, you can achieve 99.9023% availability.

Two kinds of redundancy are passive redundancy and active redundancy.

Passive redundancy is used to achieve high availability by including enough excess capacity in the design to accommodate a performance decline. The simplest example is a boat with two separate engines driving two separate propellers. The boat continues toward its destination despite failure of a single engine or propeller. A more complex example is multiple redundant power generation facilities within a large system involving electric power transmission. Malfunction of single components is not considered to be a failure unless the resulting performance decline exceeds the specification limits for the entire system.

Active redundancy is used in complex systems to achieve high availability with no performance decline. Multiple items of the same kind are incorporated into a design that includes a method to detect failure and automatically reconfigure the system to bypass failed items using a voting scheme. This is used with complex computing systems that are linked. Internet routing is derived from early work by Birman and Joseph in this area.[24][non-primary source needed] Active redundancy may introduce more complex failure modes into a system, such as continuous system reconfiguration due to faulty voting logic.

Zero downtime system design means that modeling and simulation indicates mean time between failures significantly exceeds the period of time between planned maintenance, upgrade events, or system lifetime. Zero downtime involves massive redundancy, which is needed for some types of aircraft and for most kinds of communications satellites. Global Positioning System is an example of a zero downtime system.

Fault instrumentation can be used in systems with limited redundancy to achieve high availability. Maintenance actions occur during brief periods of downtime only after a fault indicator activates. Failure is only significant if this occurs during a mission critical period.

Modeling and simulation is used to evaluate the theoretical reliability for large systems. The outcome of this kind of model is used to evaluate different design options. A model of the entire system is created, and the model is stressed by removing components. Redundancy simulation involves the N-x criteria. N represents the total number of components in the system. x is the number of components used to stress the system. N-1 means the model is stressed by evaluating performance with all possible combinations where one component is faulted. N-2 means the model is stressed by evaluating performance with all possible combinations where two component are faulted simultaneously.

Reasons for unavailability

[edit]

A survey among academic availability experts in 2010 ranked reasons for unavailability of enterprise IT systems. All reasons refer to not following best practice in each of the following areas (in order of importance):[25]

  1. Monitoring of the relevant components
  2. Requirements and procurement
  3. Operations
  4. Avoidance of network failures
  5. Avoidance of internal application failures
  6. Avoidance of external services that fail
  7. Physical environment
  8. Network redundancy
  9. Technical solution of backup
  10. Process solution of backup
  11. Physical location
  12. Infrastructure redundancy
  13. Storage architecture redundancy

A book on the factors themselves was published in 2003.[26]

Costs of unavailability

[edit]

In a 1998 report from IBM Global Services, unavailable systems were estimated to have cost American businesses $4.54 billion in 1996, due to lost productivity and revenues.[27]

See also

[edit]

Notes

[edit]
  1. ^ Using 365.25 days per year; respectively, a quarter is a ¼ of that value (i.e., 91.3125 days), and a month is a twelfth of it (i.e., 30.4375 days). For consistency, all times are rounded to two decimal digits.
  2. ^ See mathematical coincidences concerning base 2 for details on this approximation.
  3. ^ "Twice as much" on a logarithmic scale, meaning two factors of 2:

References

[edit]
  1. ^ Robert, Sheldon (April 2024). "high availability (HA)". Techtarget.
  2. ^ Floyd Piedad, Michael Hawkins (2001). High Availability: Design, Techniques, and Processes. Prentice Hall. ISBN 9780130962881.
  3. ^ "Definitions - ResiliNetsWiki". resilinets.org.
  4. ^ "Webarchiv ETHZ / Webarchive ETH". webarchiv.ethz.ch.
  5. ^ Smith, Paul; Hutchison, David; Sterbenz, James P.G.; Schöller, Marcus; Fessi, Ali; Karaliopoulos, Merkouris; Lac, Chidung; Plattner, Bernhard (July 3, 2011). "Network resilience: a systematic approach". IEEE Communications Magazine. 49 (7): 88–97. doi:10.1109/MCOM.2011.5936160. S2CID 10246912 – via IEEE Xplore.
  6. ^ accesstel (June 9, 2022). "operational resilience | telcos | accesstel | risk | crisis". accesstel. Retrieved May 8, 2023.
  7. ^ "The CERCES project - Center for Resilient Critical Infrastructures at KTH Royal Institute of Technology". Archived from the original on October 19, 2018. Retrieved August 26, 2023.
  8. ^ Zhao, Peiyue; Dán, György (December 3, 2018). "A Benders Decomposition Approach for Resilient Placement of Virtual Process Control Functions in Mobile Edge Clouds". IEEE Transactions on Network and Service Management. 15 (4): 1460–1472. doi:10.1109/TNSM.2018.2873178. S2CID 56594760 – via IEEE Xplore.
  9. ^ Castet J., Saleh J. Survivability and Resiliency of Spacecraft and Space-Based Networks: a Framework for Characterization and Analysis", American Institute of Aeronautics and Astronautics, AIAA Technical Report 2008-7707. Conference on Network Protocols (ICNP 2006), Santa Barbara, California, USA, November 2006
  10. ^ Lecture Notes M. Nesterenko, Kent State University
  11. ^ Introduction to the new mainframe: Large scale commercial computing Chapter 5 Availability Archived March 4, 2016, at the Wayback Machine IBM (2006)
  12. ^ IBM zEnterprise EC12 Business Value Video at youtube.com
  13. ^ Precious metals, Volume 4. Pergamon Press. 1981. p. page 262. ISBN 9780080253695.
  14. ^ PVD for Microelectronics: Sputter Desposition to Semiconductor Manufacturing. 1998. p. 387.
  15. ^ Murphy, Niall Richard; Beyer, Betsy; Petoff, Jennifer; Jones, Chris (2016). Site Reliability Engineering: How Google Runs Production Systems. p. 38.
  16. ^ Josh Deprez (April 23, 2016). "Nines of Nines". Archived from the original on September 4, 2016. Retrieved May 31, 2016.
  17. ^ Evan L. Marcus, The myth of the nines
  18. ^ Newman, David; Snyder, Joel; Thayer, Rodney (June 24, 2012). "Crying Wolf: False alarms hide attacks". Network World. Vol. 19, no. 25. p. 60. Retrieved March 15, 2019. leading to crashes and uptime numbers closer to nine fives than to five nines.
  19. ^ Metcalfe, Bob (April 2, 2001). "After 35 years of technology crusades, Bob Metcalfe rides off into the sunset". ITworld. Retrieved March 15, 2019. and five nines (not nine fives) of reliability[permanent dead link]
  20. ^ Pilgrim, Jim (October 20, 2010). "Goodbye Five 9s". Clearfield, Inc. Retrieved March 15, 2019. but it seems to me we are moving closer to 9-5s (55.5555555%) in network reliability rather than 5-9s
  21. ^ "What is network downtime?". Networking. Retrieved December 27, 2023.
  22. ^ Trivedi, Kishor S.; Bobbio, Andrea (2017). Reliability and Availability Engineering: Modeling, Analysis, and Applications. Cambridge University Press. ISBN 978-1107099500.
  23. ^ System Sustainment: Acquisition And Engineering Processes For The Sustainment Of Critical And Legacy Systems (World Scientific Series On Emerging Technologies: Avram Bar-cohen Memorial Series). World Scientific. 2022. ISBN 978-9811256844.
  24. ^ RFC 992
  25. ^ Ulrik Franke, Pontus Johnson, Johan König, Liv Marcks von Würtemberg: Availability of enterprise IT systems – an expert-based Bayesian model, Proc. Fourth International Workshop on Software Quality and Maintainability (WSQM 2010), Madrid, [1] Archived August 4, 2012, at archive.today
  26. ^ Marcus, Evan; Stern, Hal (2003). Blueprints for high availability (Second ed.). Indianapolis, IN: John Wiley & Sons. ISBN 0-471-43026-9.
  27. ^ IBM Global Services, Improving systems availability, IBM Global Services, 1998, [2] Archived April 1, 2011, at the Wayback Machine
[edit]