Safety engineering: Difference between revisions
Minor edit |
Defined some terms, added discussion, edited |
||
Line 1: | Line 1: | ||
'''Safety engineering''' is used to assure that a [[life-critical]] [[system]] behaves as needed even when pieces fail. |
'''Safety engineering''' is used to assure that a [[life-critical]] [[system]] behaves as needed even when pieces fail. Safety engineering is strongly related to [[system engineering]]. |
||
Safety engineers distinguish different extents of defective operation: A "fault" is said to occur when some piece of equipment does not operate as designed. A "failure" only occurs if a human being (other than a repair person) has to cope with the situation. A "critical" failure endangers one or a few people. A "catastrophic" failure endangers, harms or kills a significant population of people. |
|||
==Fault modeling techniques== |
|||
Safety engineers also identify different modes of safe operation: A "probabilistically safe" system has no single point of failure, and enough redundant sensors, computers and effectors so that it is very unlikely to cause harm (usually "very unlikely" means less than one human life lost in a billion hours of operation). An "inherently safe" system is a clever mechanical arrangement that cannot be made to cause harm- obviously the best arrangement, but this is not always possible. For example, "inherently safe" airplanes are not possible. A "fail-safe" system is one that cannot cause harm when it fails. A "fault-tolerant" system can continue to operate with faults, though its operation may be degraded in some fashion. |
|||
⚫ | |||
These terms combine to describe the safety needed by systems: For example, most biomedical equipment is only "critical," and often another identical piece of equipment is nearby, so it can be merely "probabilistically fail-safe." Train signals can cause "catastrophic" accidents (imagine chemical releases from tank-cars) and are usually "inherently safe." Aircraft "failures" are "catastrophic" (at least for their passengers and crew,) so aircraft are usually "probabilistically fault-tolerant." Without any safety features, nuclear reactors might have "catastrophic failures," so real nuclear reactors are required to be at least "probabilistically fail-safe," and newer [[pebble-bed reactor|designs]] are "inherently fault-tolerant." |
|||
⚫ | Far too often, |
||
==The Process== |
|||
⚫ | |||
Ideally, safety-engineers take an early design of a system, analyze it to find what faults can occur, and then propose changes to make the system more safe. In an early design stage, often a fail-safe system can be made acceptably safe with a few sensors and some software to read them. Probabilitically fault-tolerant systems can often be made by using more, but smaller and less-expensive pieces of equipment. |
|||
Historically, many organizations viewed "safety engineering" as a process to produce documentation to gain regulatory approval, rather than a real asset to the engineering process. These same organizations have often made their views into a self-fulfilling prophecy by assigning less-able personnel to safety engineering. |
|||
⚫ | Far too often, rather than actually helping with the design, safety engineers are assigned to prove that an existing, completed design is safe. If a competent safety engineer then discovers significant safety problems late in the design process, correcting them can be very expensive. This project management error has wasted billions of dollars in the development of commercial [[nuclear reactor]]s. |
||
==Analysis Techniques== |
|||
⚫ | |||
⚫ | |||
In the technique known as "failure modes and effects analysis", an engineer starts with a block diagram of a system. The engineer then considers what happens if each block of the diagram fails. The engineer than draws up a table in which failures are paired with their effects and an evaluation of the effects. The design of the system is then corrected, and the table adjusted until the system is not known to have unacceptable problems. Of course, the engineers may make mistakes. It's very helpful to have several engineers review the failure modes and effects analysis. |
In the technique known as "failure modes and effects analysis", an engineer starts with a block diagram of a system. The engineer then considers what happens if each block of the diagram fails. The engineer than draws up a table in which failures are paired with their effects and an evaluation of the effects. The design of the system is then corrected, and the table adjusted until the system is not known to have unacceptable problems. Of course, the engineers may make mistakes. It's very helpful to have several engineers review the failure modes and effects analysis. |
||
===Fault |
===Fault Tree Analysis=== |
||
In the technique known as "fault tree analysis", an undesired effect is taken as the root of a tree of logic. Then, each situation that could cause that effect is added to the tree as a series of logic expressions. When [[fault tree]]s have real numbers about failure probabilities (often unavailable because of testing expense), [[computer program]]s can calculate failure probabilities from fault trees. The classic computer program is the Idaho National Engineering and Environmental Laboratory's [[SAPHIRE]], which is used by the U.S. government to evaluate the safety and [[reliability]] of [[nuclear reactor]]s, the [[space shuttle]], and the [[International Space Station]]. |
In the technique known as "fault tree analysis", an undesired effect is taken as the root of a tree of logic. Then, each situation that could cause that effect is added to the tree as a series of logic expressions. When [[fault tree]]s have real numbers about failure probabilities (often unavailable because of testing expense), [[computer program]]s can calculate failure probabilities from fault trees. The classic computer program is the Idaho National Engineering and Environmental Laboratory's [[SAPHIRE]], which is used by the U.S. government to evaluate the safety and [[reliability]] of [[nuclear reactor]]s, the [[space shuttle]], and the [[International Space Station]]. |
||
Line 17: | Line 29: | ||
[[Unified Modeling Language|UML]] [[activity diagram]]s have been used as graphical components in a fault tree analysis. |
[[Unified Modeling Language|UML]] [[activity diagram]]s have been used as graphical components in a fault tree analysis. |
||
==Safety |
==Safety Certification== |
||
Usually a failure in safety-certified systems is acceptable if less than one life per 30 years of operation (10<sup>9</sup> seconds) is lost to mechanical failure. Most Western nuclear reactors, medical equipment, and commercial aircraft are certified to this level. |
Usually a failure in safety-certified systems is acceptable if less than one life per 30 years of operation (10<sup>9</sup> seconds) is lost to mechanical failure. Most Western nuclear reactors, medical equipment, and commercial aircraft are certified to this level. |
||
==Preventing |
==Preventing Failure== |
||
===Adding |
===Probabilistic Fault Tolerance: Adding 'Redundancy', i.e. Equipment and Systems=== |
||
Once a failure mode is identified, it can usually be |
Once a failure mode is identified, it can usually be prevented entirely by adding extra equipment to the system. For example, nuclear reactors emit dangerous radiation and contain nasty poisons, and nuclear reactions can cause such high heat that no substance can contain them. Therefore reactors have emergency core cooling systems to keep the heat down, shielding to contain the radiation, and containments (usually several, nested) to prevent leakage. |
||
Most biological organisms have extreme amounts of redundancy: multiple organs, multiple limbs, etc. |
|||
===Redundancy=== |
|||
For any given failure, a fail-over, or redundancy can almost always be designed and incorporated into a system. |
For any given failure, a fail-over, or redundancy can almost always be designed and incorporated into a system. |
||
===Fail- |
===Inherent Fail-Safe Design=== |
||
When adding equipment is impractical (usually because of expense), then the least expensive for of design is often "inherently [[fail safe|fail-safe]]". The typical approach is to arrange the system so that ordinary single failures cause the mechanism to shut down in a safe way. |
|||
One of the most common fail-safe systems is the overflow tube in most commodes. If the valve in the commode sticks open, rather than causing the tank to overflow and damage the bathroom, the tank spills into an overflow tube and the commode "leaks." |
|||
ANother common example is that in an elevator the cable supporting the car pulls spring-loaded brakes open. If the cable breaks, the brakes grab rails, and the car does not fall. |
|||
Another common inherently fail-safe system is the pilot-light sensor in most gas furnaces. If the pilot light is cold, a mechanical arrangement disengages the gas valve, so that the house cannot fill with unburned gas. |
|||
Inherent fail-safes are common in medical equipment, traffic and railway signals, communications equipment, and safety equipment. |
|||
When adding equipment is impractical (usually because of expense), then the design has to made inherently safe, or "[[fail safe]]". The typical approach is to arrange the system so that ordinary single failures cause the mechanism to shut down in a safe way. For example, in an elevator the cable supporting the car pulls spring-loaded brakes open. If the cable breaks, the brakes grab rails, and the car does not fall. Another common fail-safe system is the pilot-light sensor in most gas furnaces. If the pilot light is cold, a mechanical arrangement disengages the gas valve, so that the house cannot fill with unburned gas. Fail safes are common in medical equipment, traffic and railway signals, communications equipment, and safety equipment. |
|||
==The safety engineer== |
==The safety engineer== |
Revision as of 21:43, 15 June 2004
Safety engineering is used to assure that a life-critical system behaves as needed even when pieces fail. Safety engineering is strongly related to system engineering.
Safety engineers distinguish different extents of defective operation: A "fault" is said to occur when some piece of equipment does not operate as designed. A "failure" only occurs if a human being (other than a repair person) has to cope with the situation. A "critical" failure endangers one or a few people. A "catastrophic" failure endangers, harms or kills a significant population of people.
Safety engineers also identify different modes of safe operation: A "probabilistically safe" system has no single point of failure, and enough redundant sensors, computers and effectors so that it is very unlikely to cause harm (usually "very unlikely" means less than one human life lost in a billion hours of operation). An "inherently safe" system is a clever mechanical arrangement that cannot be made to cause harm- obviously the best arrangement, but this is not always possible. For example, "inherently safe" airplanes are not possible. A "fail-safe" system is one that cannot cause harm when it fails. A "fault-tolerant" system can continue to operate with faults, though its operation may be degraded in some fashion.
These terms combine to describe the safety needed by systems: For example, most biomedical equipment is only "critical," and often another identical piece of equipment is nearby, so it can be merely "probabilistically fail-safe." Train signals can cause "catastrophic" accidents (imagine chemical releases from tank-cars) and are usually "inherently safe." Aircraft "failures" are "catastrophic" (at least for their passengers and crew,) so aircraft are usually "probabilistically fault-tolerant." Without any safety features, nuclear reactors might have "catastrophic failures," so real nuclear reactors are required to be at least "probabilistically fail-safe," and newer designs are "inherently fault-tolerant."
The Process
Ideally, safety-engineers take an early design of a system, analyze it to find what faults can occur, and then propose changes to make the system more safe. In an early design stage, often a fail-safe system can be made acceptably safe with a few sensors and some software to read them. Probabilitically fault-tolerant systems can often be made by using more, but smaller and less-expensive pieces of equipment.
Historically, many organizations viewed "safety engineering" as a process to produce documentation to gain regulatory approval, rather than a real asset to the engineering process. These same organizations have often made their views into a self-fulfilling prophecy by assigning less-able personnel to safety engineering.
Far too often, rather than actually helping with the design, safety engineers are assigned to prove that an existing, completed design is safe. If a competent safety engineer then discovers significant safety problems late in the design process, correcting them can be very expensive. This project management error has wasted billions of dollars in the development of commercial nuclear reactors.
Analysis Techniques
The two most common fault modeling techniques are called "failure modes and effects analysis" and "fault tree analysis." These techniques are just ways of finding problems and of making plans to cope with failures.
Failure Modes and Effects Analysis
In the technique known as "failure modes and effects analysis", an engineer starts with a block diagram of a system. The engineer then considers what happens if each block of the diagram fails. The engineer than draws up a table in which failures are paired with their effects and an evaluation of the effects. The design of the system is then corrected, and the table adjusted until the system is not known to have unacceptable problems. Of course, the engineers may make mistakes. It's very helpful to have several engineers review the failure modes and effects analysis.
Fault Tree Analysis
In the technique known as "fault tree analysis", an undesired effect is taken as the root of a tree of logic. Then, each situation that could cause that effect is added to the tree as a series of logic expressions. When fault trees have real numbers about failure probabilities (often unavailable because of testing expense), computer programs can calculate failure probabilities from fault trees. The classic computer program is the Idaho National Engineering and Environmental Laboratory's SAPHIRE, which is used by the U.S. government to evaluate the safety and reliability of nuclear reactors, the space shuttle, and the International Space Station.
UML activity diagrams have been used as graphical components in a fault tree analysis.
Safety Certification
Usually a failure in safety-certified systems is acceptable if less than one life per 30 years of operation (109 seconds) is lost to mechanical failure. Most Western nuclear reactors, medical equipment, and commercial aircraft are certified to this level.
Preventing Failure
Probabilistic Fault Tolerance: Adding 'Redundancy', i.e. Equipment and Systems
Once a failure mode is identified, it can usually be prevented entirely by adding extra equipment to the system. For example, nuclear reactors emit dangerous radiation and contain nasty poisons, and nuclear reactions can cause such high heat that no substance can contain them. Therefore reactors have emergency core cooling systems to keep the heat down, shielding to contain the radiation, and containments (usually several, nested) to prevent leakage.
Most biological organisms have extreme amounts of redundancy: multiple organs, multiple limbs, etc.
For any given failure, a fail-over, or redundancy can almost always be designed and incorporated into a system.
Inherent Fail-Safe Design
When adding equipment is impractical (usually because of expense), then the least expensive for of design is often "inherently fail-safe". The typical approach is to arrange the system so that ordinary single failures cause the mechanism to shut down in a safe way.
One of the most common fail-safe systems is the overflow tube in most commodes. If the valve in the commode sticks open, rather than causing the tank to overflow and damage the bathroom, the tank spills into an overflow tube and the commode "leaks."
ANother common example is that in an elevator the cable supporting the car pulls spring-loaded brakes open. If the cable breaks, the brakes grab rails, and the car does not fall.
Another common inherently fail-safe system is the pilot-light sensor in most gas furnaces. If the pilot light is cold, a mechanical arrangement disengages the gas valve, so that the house cannot fill with unburned gas.
Inherent fail-safes are common in medical equipment, traffic and railway signals, communications equipment, and safety equipment.
The safety engineer
Personality and role
Oddly enough, personality issues can be paramount in a safety engineer. They must be personally pleasant, intelligent, and ruthless with themselves and their organization. In particular, they have to be able to "sell" the failures that they discover, as well as the attendant expense and time needed to correct them.
Safety engineers have to be ruthless about getting facts from other engineers. It is common for a safety engineer to consider software, chemical, electronic, eletrical, mechanical, procedural, and training problems in the same day. Often the facts can be very uncomfortable.
Teamwork
It is important to make the safety engineers part of a team, so that safety problems cannot be discounted as due to the safety engineers' personality problems or ignored by firing a single engineer.
It is a severe safety problem if an engineering team or management discredits a safety engineer: either the manager appointed a poor engineer to the position, indicating that there may be numerous undiscovered safety issues, or the team has inverted development priorities and considers safety to be less important than upper management or government does.
See also
- life-critical
- reliability theory
- nuclear reactor
- biomedical engineering
- SAPHIRE (risk analysis software)
- Some of the techniques of safety engineering have been applied to the field of security engineering.