Root cause analysis: Difference between revisions
Reverted 1 edit by James C. Breese (talk): WP:OR and WP:ELNO. (TW) |
|||
Line 32: | Line 32: | ||
# Root cause analysis can help to transform a reactive culture (that reacts to problems) into a forward-looking culture that solves problems before they occur or escalate. More importantly, it reduces the frequency of problems occurring over time within the environment where the RCA process is used. |
# Root cause analysis can help to transform a reactive culture (that reacts to problems) into a forward-looking culture that solves problems before they occur or escalate. More importantly, it reduces the frequency of problems occurring over time within the environment where the RCA process is used. |
||
# RCA is a threat to many cultures and environments. Threats to cultures often meet with resistance. There may be other forms of management support required to achieve RCA effectiveness and success. For example, a "non-punitive" policy toward problem identifiers may be required. |
# RCA is a threat to many cultures and environments. Threats to cultures often meet with resistance. There may be other forms of management support required to achieve RCA effectiveness and success. For example, a "non-punitive" policy toward problem identifiers may be required. |
||
Here's an actual example of a root-cause analysis that was performed by an engineer in the disk drive industry: |
|||
"We received a disk drive back from a customer; the customer complained that the drive would no longer spin up. Upon investigation, it became clear that one of the three transistors used to spin up the spindle motor had died, so this dead transistor was the first-level cause of the disk drive failure. We carefully investigated the circuitry which supplied signals to this dead transistor, and after we satisfied ourselves that all of this surrounding circuitry was operating correctly, we removed the dead transistor and tested it; one of the three leads on the transistor was an open circuit. This open circuit was the second-level cause of the disk drive failure. We sent the transistor to an x-ray laboratory, and took an x-ray picture of the transistor’s guts. The picture was very interesting: the semi-conductor die inside the transistor was not aligned properly with respect to the bonding wires which were supposed to connect to the transistor's three leads. So this die mis-alignment was the third-level cause of the disk drive failure. We contacted the vendor of the transistor and asked him to investigate the cause of the die mis-alignment, and during his investigation the transistor vendor found that a chip-positioning table in their factory had not been maintained properly and had slop in the mechanism; this was the fourth-level cause of the disk drive failure. |
|||
"We did not pursue this failure analysis any further but I am sure that the transistor vendor carried this analysis another step or two, and probably tightened up the maintenance procedure for his die-alignment table. For our purposes we had completed a root cause failure analysis." |
|||
So a quick picture of this particular example of root-cause failure analysis shows at least 7 steps: |
|||
1. Disk drive failure (discovered by the end user) |
|||
2. Spindle motor fails to spin up (discovered by the end user) |
|||
3. Dead transistor (discovered by disk-drive vendor) |
|||
4. Open connection at one lead of the transistor (discovered by the disk-drive vendor) |
|||
5. Die mis-alignment (discovered by x-ray) |
|||
6. Slop in alignment table at transistor vendor (discovered by transistor vendor) |
|||
7. Poor maintenance procedure (at transistor-vendor’s factory) |
|||
This true story illustrates several very important principles of the root cause failure analysis: |
|||
First of all, make sure that any action that you take during the analysis will not destroy any information that you might need later. In the example above, the engineer described the precautions that he took before he removed the transistor from the circuitry; he took special pains to make sure that all the circuitry around the defective transistor seemed to be working correctly, and there were no indications of circuit or design defects which might have damaged the transistor. Only after he was satisfied that all the surrounding circuitry was O.K. did he remove the dead transistor. |
|||
Next, keep investigating until you reach a level beyond which it is impractical to proceed further; one definition of this criterion comes from a friend of mine who works for a Silicon-Valley company that is known for their high quality. He tells me that the rule of thumb at his company is to ask "WHY" five times, and if further analysis does not seem practical, they consider that the root cause failure analysis is complete at that point. In the example above, of course, we were able to ask "WHY" six times before we got to a reasonable root cause. |
|||
Third, take corrective action, so the problem will not occur again. In the case above, the transistor vendor repaired his chip-positioning table (and presumably re-vamped his maintenance procedure) so the alignment problem would not happen again. |
|||
Fourth, publish the results of your investigation, so that you and others will not have to repeat this same exercise in the future. |
|||
<http://www.amazon.com/Famous-By-Friday-ebook/dp/B00A9IEPMW/ref=sr_1_2?s=books&ie=UTF8&qid=1354232104&sr=1-2&keywords=famous+by+friday> See Chapter 10. |
|||
==General process for performing and documenting an RCA-based Corrective Action== |
==General process for performing and documenting an RCA-based Corrective Action== |
Revision as of 19:23, 1 December 2012
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
No issues specified. Please specify issues, or remove this template. |
Root cause analysis (RCA) is a method of problem solving that tries to identify the root causes of faults or problems that cause operating events.
RCA practice tries to solve problems by attempting to identify and correct the root causes of events, as opposed to simply addressing their symptoms. By focusing correction on root causes, problem recurrence can be prevented. RCFA (Root Cause Failure Analysis) recognizes that complete prevention of recurrence by one corrective action is not always possible.
Nevertheless, in the U.S. nuclear power industry the NRC requires that "In the case of significant conditions adverse to quality, the measures shall assure that the cause of the condition is determined and corrective action taken to prevent repetition." [10CFR50, Appendix B, Criterion XVI, Sentence 2)] In practice more than one "cause" is allowed and more than one corrective action is not forbidden.
Conversely, there may be several effective measures (methods) that address the root causes of a problem. Thus, RCA is often considered to be an iterative process, and is frequently viewed as a tool of continuous improvement.
RCA is typically used as a reactive method of identifying event(s) causes, revealing problems and solving them. Analysis is done after an event has occurred. Insights in RCA may make it useful as a pro-active method. In that event, RCA can be used to forecast or predict probable events even before they occur. While one follows the other, RCA is a completely separate process to Incident Management.
Root cause analysis is not a single, sharply defined methodology; there are many different tools, processes, and philosophies for performing RCA. However, several very-broadly defined approaches or "schools" can be identified by their basic approach or field of origin: safety-based, production-based, process-based, failure-based, and systems-based.
- Safety-based RCA descends from the fields of accident analysis and occupational safety and health.
- Production-based RCA has its origins in the field of quality control for industrial manufacturing.
- Process-based RCA is basically a follow-on to production-based RCA, but with a scope that has been expanded to include business processes.
- Failure-based RCA is rooted in the practice of failure analysis as employed in engineering and maintenance.
- Systems-based RCA has emerged as an amalgamation of the preceding schools, along with ideas taken from fields such as change management, risk management, and systems analysis.
Despite the different approaches among the various schools of root cause analysis, there are some common principles. It is also possible to define several general processes for performing RCA.
General principles of root cause analysis
- The primary aim of RCA is to identify the factors that resulted in the nature, the magnitude, the location, and the timing of the harmful outcomes (consequences) of one or more past events in order to identify what behaviours, actions, inactions, or conditions need to be changed to prevent recurrence of similar harmful outcomes and to identify the lessons to be learned to promote the achievement of better consequences. ("Success" is defined as the near-certain prevention of recurrence.)
- To be effective, RCA must be performed systematically, usually as part of an investigation, with conclusions and root causes that are identified backed up by documented evidence. Usually a team effort is required.
- There may be more than one root cause for an event or a problem, the difficult part is demonstrating the persistence and sustaining the effort required to determine them.
- The purpose of identifying all solutions to a problem is to prevent recurrence at lowest cost in the simplest way. If there are alternatives that are equally effective, then the simplest or lowest cost approach is preferred.
- Root causes identified depend on the way in which the problem or event is defined. Effective problem statements and event descriptions (as failures, for example) are helpful, or even required.
- To be effective, the analysis should establish a sequence of events or timeline to understand the relationships between contributory (causal) factors, root cause(s) and the defined problem or event to prevent in the future.
- Root cause analysis can help to transform a reactive culture (that reacts to problems) into a forward-looking culture that solves problems before they occur or escalate. More importantly, it reduces the frequency of problems occurring over time within the environment where the RCA process is used.
- RCA is a threat to many cultures and environments. Threats to cultures often meet with resistance. There may be other forms of management support required to achieve RCA effectiveness and success. For example, a "non-punitive" policy toward problem identifiers may be required.
Here's an actual example of a root-cause analysis that was performed by an engineer in the disk drive industry:
"We received a disk drive back from a customer; the customer complained that the drive would no longer spin up. Upon investigation, it became clear that one of the three transistors used to spin up the spindle motor had died, so this dead transistor was the first-level cause of the disk drive failure. We carefully investigated the circuitry which supplied signals to this dead transistor, and after we satisfied ourselves that all of this surrounding circuitry was operating correctly, we removed the dead transistor and tested it; one of the three leads on the transistor was an open circuit. This open circuit was the second-level cause of the disk drive failure. We sent the transistor to an x-ray laboratory, and took an x-ray picture of the transistor’s guts. The picture was very interesting: the semi-conductor die inside the transistor was not aligned properly with respect to the bonding wires which were supposed to connect to the transistor's three leads. So this die mis-alignment was the third-level cause of the disk drive failure. We contacted the vendor of the transistor and asked him to investigate the cause of the die mis-alignment, and during his investigation the transistor vendor found that a chip-positioning table in their factory had not been maintained properly and had slop in the mechanism; this was the fourth-level cause of the disk drive failure. "We did not pursue this failure analysis any further but I am sure that the transistor vendor carried this analysis another step or two, and probably tightened up the maintenance procedure for his die-alignment table. For our purposes we had completed a root cause failure analysis."
So a quick picture of this particular example of root-cause failure analysis shows at least 7 steps:
1. Disk drive failure (discovered by the end user) 2. Spindle motor fails to spin up (discovered by the end user) 3. Dead transistor (discovered by disk-drive vendor) 4. Open connection at one lead of the transistor (discovered by the disk-drive vendor) 5. Die mis-alignment (discovered by x-ray) 6. Slop in alignment table at transistor vendor (discovered by transistor vendor) 7. Poor maintenance procedure (at transistor-vendor’s factory)
This true story illustrates several very important principles of the root cause failure analysis: First of all, make sure that any action that you take during the analysis will not destroy any information that you might need later. In the example above, the engineer described the precautions that he took before he removed the transistor from the circuitry; he took special pains to make sure that all the circuitry around the defective transistor seemed to be working correctly, and there were no indications of circuit or design defects which might have damaged the transistor. Only after he was satisfied that all the surrounding circuitry was O.K. did he remove the dead transistor. Next, keep investigating until you reach a level beyond which it is impractical to proceed further; one definition of this criterion comes from a friend of mine who works for a Silicon-Valley company that is known for their high quality. He tells me that the rule of thumb at his company is to ask "WHY" five times, and if further analysis does not seem practical, they consider that the root cause failure analysis is complete at that point. In the example above, of course, we were able to ask "WHY" six times before we got to a reasonable root cause. Third, take corrective action, so the problem will not occur again. In the case above, the transistor vendor repaired his chip-positioning table (and presumably re-vamped his maintenance procedure) so the alignment problem would not happen again. Fourth, publish the results of your investigation, so that you and others will not have to repeat this same exercise in the future. <http://www.amazon.com/Famous-By-Friday-ebook/dp/B00A9IEPMW/ref=sr_1_2?s=books&ie=UTF8&qid=1354232104&sr=1-2&keywords=famous+by+friday> See Chapter 10.
General process for performing and documenting an RCA-based Corrective Action
Notice that RCA (in steps 3, 4 and 5) forms the most critical part of successful corrective action, because it directs the corrective action at the true root cause of the problem. The root cause is secondary to the goal of prevention, but without knowing the root cause, it is not possible to determine what an effective corrective action for the defined problem will be.
- Define the problem or describe the event factually. Include the qualitative and quantitative attributes (properties) of the harmful outcomes. This usually includes specifying the natures, the magnitudes, the locations, and the timing of events.
- Gather data and evidence, classifying it along a timeline of events to the final failure or crisis. For every behavior, condition, action, and inaction specify in the "timeline" what should have been done when it differs from what was done.
- Ask "why" and identify the causes associated with each step in the sequence towards the defined problem or event. "Why" is taken to mean "What were the factors that directly resulted in the effect?"
- Classify causes into causal factors that relate to an event in the sequence and root causes, that if eliminated, can be agreed to have interrupted that step of the sequence chain.
- Identify all other harmful factors that have equal or better claim to be called "root causes." If there are multiple root causes, which is often the case, reveal those clearly for later optimum selection.
- Identify corrective action(s) that will with certainty prevent recurrence of each harmful effect, including outcomes and factors. Check that each corrective action would, if pre-implemented before the event, have reduced or prevented specific harmful effects.
- Identify solutions that, when effective, and with consensus agreement of the group, prevent recurrence with reasonable certainty , are within the institution's control, meet its goals and objectives and do not cause or introduce other new, unforeseen problems.
- Implement the recommended root cause correction(s).
- Ensure effectiveness by observing the implemented recommendation solutions.
- Identify other methodologies for problem solving and problem avoidance that may be useful.
- Identify and address the other instances of each harmful outcome and harmful factor.
See also
- Failure mode and effects analysis
- Fault tree analysis
- Forensic engineering
- Fix it twice
- Eight Disciplines Problem Solving
- Multiple regression and multivariate linear regression