Jump to content

High availability

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Platte Daddy (talk | contribs) at 17:50, 6 January 2011 (Measurement and interpretation: Improved an area to talk about the nature of a conflict instead of the players in the conflict.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.

Users want their systems, for example wrist watches, hospitals, airplanes or computers, to be ready to serve them at all times. Availability refers to the ability of the user community to access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is said to be unavailable.[1] Generally, the term downtime is used to refer to periods when a system is unavailable.

Scheduled and Unscheduled Downtime

A distinction can be made between scheduled and unscheduled downtime. Typically, scheduled downtime is a result of maintenance that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Scheduled downtime events might include patches to system software that require a reboot or system configuration changes that only take effect upon a reboot. In general, scheduled downtime is usually the result of some logical, management-initiated event. Unscheduled downtime events typically arise from some physical event, such as a hardware or software failure or environmental anomaly. Examples of unscheduled downtime events include power outages, failed CPU or RAM components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, catastrophic security breaches, or various application, middleware, and operating system failures.

Many computing sites exclude scheduled downtime from availability calculations, assuming, correctly or incorrectly, that scheduled downtime has little or no impact upon the computing user community. By excluding scheduled downtime, many systems can claim to have phenomenally high availability, which might give the illusion of continuous availability. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any single point of failure and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements.[citation needed] For certain systems, scheduled downtime does not matter, for example system downtime at an office building after everybody has gone home for the night.

Percentage calculation

Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable per year, month, or week.

Availability % Downtime per year Downtime per month* Downtime per week
90% ("one nine") 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
98% 7.30 days 14.4 hours 3.36 hours
99% ("two nines") 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 minutes
99.8% 17.52 hours 86.23 minutes 20.16 minutes
99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds

* For monthly calculations, a 30-day month is used.

Uptime and availability are not synonymous. A system can be up, but not available, as in the case of a network outage.

In general, the number of nines is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a probability (like 0.00001), or a downtime per year is quoted. Availability specified as a number of nines is often seen in marketing documents.[citation needed]

The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence.[2]

Measurement and interpretation

Clearly, how availability is measured is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas the system administrator will claim 100% "uptime." However, given the true definition of availability, the system will be approximately 99.9% available, or three nines (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, even when the systems are continuing to function. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users — a true availability measure is holistic.

Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation") that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users themselves, than systems which experience periodic lulls in demand.

Recovery time (or estimated time of repair (ETR)) is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary disaster recovery data center.

Another related concept is data availability, that is the degree to which databases and other information storage systems faithfully record and report system transactions. Information management specialists often focus separately on data availability in order to determine acceptable (or actual) data loss with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss.

A service level agreement ("SLA") formalizes an organization's availability objectives and requirements.

System design for high availability

Paradoxically, adding more components to an overall system design can undermine efforts to achieve high availability. That is because complex systems inherently have more potential failure points and are more difficult to implement correctly. While some analysts would put forth the theory that the most highly available systems adhere to a simple architecture (a single, high quality, multi-purpose physical system with comprehensive internal hardware redundancy); however, this architecture suffers from the requirement that the entire system must be brought down for patching and Operating System upgrades. More advanced system designs allow for systems to be patched and upgraded without compromising service availability (see load balancing and failover).

The communications and computing industry has established the Service Availability Forum to foster the creation of high availability network infrastructure products, systems and services. The same basic design principle applies beyond computing in such diverse fields as nuclear power, aeronautics, and medical care.

Reasons for unavailability

A survey among academic availability experts in 2010 ranked reasons for unavailability of enterprise IT systems, from most to least important, as follows[3]:

Causal factor of unavailability
Lack of best practice change control
Lack of best practice monitoring of the relevant components
Lack of best practice requirements and procurement
Lack of best practice operations
Lack of best practice avoidance of network failures
Lack of best practice avoidance of internal application failures
Lack of best practice avoidance of external services that fail
Lack of best practice physical environment
Lack of best practice network redundancy
Lack of best practice technical solution of backup
Lack of best practice process solution of backup
Lack of best practice physical location
Lack of best practice infrastructure redundancy
Lack of best practice storage architecture redundancy

The factors themselves are based on the work of Evan Marcus & Hal Stern.[4]

Costs of unavailability

In a 1998 report from IBM Global Services, unavailable systems are estimated to have cost American businesses $4.54 billion in 1996, due to lost productivity and revenues.[5]

See also

References

  1. ^ Piedad, Floyd. High Availability: Design, Techniques, and Processes, [1]
  2. ^ Evan L. Marcus, The myth of the nines
  3. ^ Ulrik Franke, Pontus Johnson, Johan König, Liv Marcks von Würtemberg: Availability of enterprise IT systems - an expert-based Bayesian model, Proc. Fourth International Workshop on Software Quality and Maintainability (WSQM 2010), Madrid, [2]
  4. ^ E. Marcus and H. Stern, Blueprints for high availability, second edition. Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003.
  5. ^ IBM Global Services, Improving systems availability, IBM Global Services, 1998, [3]