Jump to content

Fault management: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Monkbot (talk | contribs)
m Task 18 (cosmetic): eval 2 templates: hyphenate params (1×);
 
(30 intermediate revisions by 24 users not shown)
Line 1: Line 1:
{{Multiple issues|
{{citestyle|date=January 2011}}
{{merge|Alarm management|date=June 2011}}
{{more citations needed|date=October 2017}}
{{technical|date=October 2017}}
In [[network management]], '''fault management''' is the set<!--non-technical use, don't link--> of functions that detect, isolate, and correct malfunctions in a telecommunications network, compensate for environmental changes, and include maintaining and examining [[Computer glitch|error]] [[Data logging|logs]], accepting and acting on error detection notifications, tracing and identifying faults, carrying out sequences of diagnostics tests, correcting faults, reporting error conditions, and localizing and tracing faults by examining and manipulating [[database]] [[information]].
}}


In [[network management]], '''fault management''' is the set<!--non-technical use, don't link--> of functions that detect, isolate, and correct malfunctions in a telecommunications network, compensate for environmental changes, and include maintaining and examining [[Computer glitch|error]] [[Data logging|logs]], accepting and acting on error detection notifications, tracing and identifying faults, carrying out sequences of diagnostics tests, correcting faults, reporting error conditions, and localizing and tracing faults by examining and manipulating [[database]] [[information]].<ref>{{Cite web|title = What is fault management? - Definition from WhatIs.com|url = http://searchnetworking.techtarget.com/definition/fault-management|access-date = 2015-10-06}}</ref>
When a fault or event occurs, a network component will often send a notification to the network operator using a protocol such as [[Simple Network Management Protocol|SNMP]]. An alarm is a persistent indication of a fault that clears only when the triggering condition has been resolved. A current list of problems occurring on the network component is often kept in the form of an active alarm list such as is defined in RFC 3877,the Alarm [[Management information base|MIB]]. A list of cleared faults is also maintained by most [[network management]] systems.

When a fault or event occurs, a network component will often send a notification to the network operator using a protocol such as [[Simple Network Management Protocol|SNMP]]. An alarm is a persistent indication of a fault that clears only when the triggering condition has been resolved. A current list of problems occurring on the network component is often kept in the form of an active alarm list such as is defined in RFC 3877, the Alarm [[Management information base|MIB]]. A list of cleared faults is also maintained by most [[network management]] systems.<ref>{{Cite web|date=2020-04-07|title=What Is Fault Management? A Definition & Introductory Guide|url=https://www.xplg.com/what-is-fault-management-2/|access-date=2020-11-15|website=XpoLog Log Analysis, Management & Viewer|language=en-US}}</ref>


Fault management systems may use complex filtering systems to assign alarms to severity levels. These can range in severity from debug to emergency, as in the [[syslog]] protocol.<ref>RFC 3164</ref> Alternatively, they could use the ITU X.733 Alarm Reporting Function's perceived severity field. This takes on values of cleared, indeterminate, critical, major, minor or warning. Note that the latest version of the syslog protocol draft under development within the [[IETF]] includes a mapping between these two different sets of severities. It is considered good practice to send a notification not only when a problem has occurred, but also when it has been resolved. The latter notification would have a severity of clear.
Fault management systems may use complex filtering systems to assign alarms to severity levels. These can range in severity from debug to emergency, as in the [[syslog]] protocol.<ref>RFC 3164</ref> Alternatively, they could use the ITU X.733 Alarm Reporting Function's perceived severity field. This takes on values of cleared, indeterminate, critical, major, minor or warning. Note that the latest version of the syslog protocol draft under development within the [[IETF]] includes a mapping between these two different sets of severities. It is considered good practice to send a notification not only when a problem has occurred, but also when it has been resolved. The latter notification would have a severity of clear.
Line 10: Line 13:


==Types==
==Types==
There are two primary ways to perform fault management - these are active and passive. Passive fault management is done by collecting alarms from devices (normally via SNMP) when something happens in the devices. In this mode, the fault management system only knows if a device it is monitoring is intelligent enough to generate an error and report it to the management tool. However, if the device being monitored fails completely or locks up, it won't throw an alarm and the problem will not be detected. Active fault management addresses this issue by actively monitoring devices via tools such as [[Ping (networking utility)|ping]] to determine if the device is active and responding. If the device stops responding, active monitoring will throw an alarm showing the device as unavailable and allows for the proactive correction of the problem.
There are two primary ways to perform fault management - these are active and passive. Passive fault management is done by collecting alarms from devices (normally via [[SNMP]] traps) when something happens in the devices. In this mode, the fault management system only knows if a device it is monitoring is intelligent enough to generate an error and report it to the management tool. However, if the device being monitored fails completely or locks up, it won't throw an alarm and the problem will not be detected. Active fault management addresses this issue by actively monitoring devices via tools such as [[Ping (networking utility)|ping]] to determine if the device is active and responding. If the device stops responding, active monitoring will throw an alarm showing the device as unavailable and allows for the proactive correction of the problem.


Fault management includes any tools or procedure for diagnosing, testing or repairing the network when a failure occurs.
Fault management includes any tools or procedure for testing, diagnosing or repairing the network when a failure occurs.

==See also==
*[[Alarm management]]
*[[Alarm fatigue]]


==Notes==
==Notes==
Line 21: Line 28:


[[Category:Network management]]
[[Category:Network management]]

[[fr:Fault management]]

Latest revision as of 17:55, 26 December 2020

In network management, fault management is the set of functions that detect, isolate, and correct malfunctions in a telecommunications network, compensate for environmental changes, and include maintaining and examining error logs, accepting and acting on error detection notifications, tracing and identifying faults, carrying out sequences of diagnostics tests, correcting faults, reporting error conditions, and localizing and tracing faults by examining and manipulating database information.[1]

When a fault or event occurs, a network component will often send a notification to the network operator using a protocol such as SNMP. An alarm is a persistent indication of a fault that clears only when the triggering condition has been resolved. A current list of problems occurring on the network component is often kept in the form of an active alarm list such as is defined in RFC 3877, the Alarm MIB. A list of cleared faults is also maintained by most network management systems.[2]

Fault management systems may use complex filtering systems to assign alarms to severity levels. These can range in severity from debug to emergency, as in the syslog protocol.[3] Alternatively, they could use the ITU X.733 Alarm Reporting Function's perceived severity field. This takes on values of cleared, indeterminate, critical, major, minor or warning. Note that the latest version of the syslog protocol draft under development within the IETF includes a mapping between these two different sets of severities. It is considered good practice to send a notification not only when a problem has occurred, but also when it has been resolved. The latter notification would have a severity of clear.

A fault management console allows a network administrator or system operator to monitor events from multiple systems and perform actions based on this information. Ideally, a fault management system should be able to correctly identify events and automatically take action, either launching a program or script to take corrective action, or activating notification software that allows a human to take proper intervention (i.e. send e-mail or SMS text to a mobile phone). Some notification systems also have escalation rules that will notify a chain of individuals based on availability and severity of alarm.

Types

[edit]

There are two primary ways to perform fault management - these are active and passive. Passive fault management is done by collecting alarms from devices (normally via SNMP traps) when something happens in the devices. In this mode, the fault management system only knows if a device it is monitoring is intelligent enough to generate an error and report it to the management tool. However, if the device being monitored fails completely or locks up, it won't throw an alarm and the problem will not be detected. Active fault management addresses this issue by actively monitoring devices via tools such as ping to determine if the device is active and responding. If the device stops responding, active monitoring will throw an alarm showing the device as unavailable and allows for the proactive correction of the problem.

Fault management includes any tools or procedure for testing, diagnosing or repairing the network when a failure occurs.

See also

[edit]

Notes

[edit]
  1. ^ "What is fault management? - Definition from WhatIs.com". Retrieved 2015-10-06.
  2. ^ "What Is Fault Management? A Definition & Introductory Guide". XpoLog Log Analysis, Management & Viewer. 2020-04-07. Retrieved 2020-11-15.
  3. ^ RFC 3164

References

[edit]