Search results
Results from the WOW.Com Content Network
Fault detection, isolation, and recovery (FDIR) is a subfield of control engineering which concerns itself with monitoring a system, identifying when a fault has occurred, and pinpointing the type of fault and its location. Two approaches can be distinguished: A direct pattern recognition of sensor readings that indicate a fault and an analysis ...
The construction of a failure detector is an essential, but a very difficult problem that occurred in the development of the fault-tolerant component in a distributed computer system. As a result, the failure detector was invented because of the need for detecting errors in the massive information transaction in distributed computing systems.
To support these, a computer system is typically designed so that its watchdog timer will be kicked only if the computer deems the system functional. The computer determines whether the system is functional by conducting one or more fault detection tests and will kick the watchdog only if all tests have passed. [citation needed]
This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation. Recovery from a fault condition requires classifying the fault or failing component. The National Institute of Standards and Technology (NIST) categorizes faults based on locality, cause, duration, and effect.
In the nominal, i.e. fault-free situation, the lower control loop operates to meet the control goals. The fault-detection (FDI) module monitors the closed-loop system to detect and isolate faults. The fault estimate is passed to the reconfiguration block, which modifies the control loop to reach the control goals in spite of the fault.
An alarm is a persistent indication of a fault that clears only when the triggering condition has been resolved. A current list of problems occurring on the network component is often kept in the form of an active alarm list such as is defined in RFC 3877, the Alarm MIB. A list of cleared faults is also maintained by most network management ...
All error-detection and correction schemes add some redundancy (i.e., some extra data) to a message, which receivers can use to check consistency of the delivered message and to recover data that has been determined to be corrupted.
Serviceability includes various methods of easily diagnosing the system when problems arise. Early detection of faults can decrease or avoid system downtime. For example, some enterprise systems can automatically call a service center (without human intervention) when the system experiences a system fault.