Fault Management is accomplished by means of several processes/sub-processes like fault detection, fault localisation, fault reporting, fault correction, fault repair, etc. These processes/sub-processes are located over different management layers, however, most of them (like fault detection, fault correction, fault localisation and fault correction) are mainly located over the Network Element and Network Element Management layers, since this underlying network infrastructure has the
'self healing' capabilities.
It is possible, however, that some faults/problems affecting the telecom services are detected within the
"Network and Systems Management" layer, by correlating the alarm/events (originated by different Network Elements) and correlating network data, through network data management.
Network data management logically collects and processes performance and traffic data, as well as usage data.
While the Fault Management triggered within the Network Element and NE management layers is primarily reactive, the Fault Management triggered within the Network and Systems Management layer is primarily proactive. Meaning triggered by automation rather than triggered by the customer; and this is important for improving service quality, customer perception of service and for lowering costs.
Focusing on the Network and Systems Management layer, when a fault/problem is detected, no matter where and how, several processes are implicated, as described in
Figure 6.
Figure 6 taken from the Telecom Operations Map [100] shows an example of how Fault Management data can be used to drive an operator's service assurance process. Service assurance then becomes primarily proactive, i.e. triggered by automation rather than triggered by the customer. It is argued that this approach is crucial to improving service quality, customer perception of service and for lowering costs.
TOM assurance activities (and their associated interfaces) shown in
Figure 6 can be associated with ITU-T TMN service components from TS 32.111-series [3] according to
Table 1:
The TOM assurance example shown in
Figure 6 also recognises that Performance Management data can also be used to detect network problems.
The TOM assurance example also adds some detail to the Service Management Layer by showing how activities such as determining and monitoring Service Level Agreements (SLAs) and trouble ticket reporting are interfaced to the Network Management layer.
A 3GPP system is composed of a multitude of Network Elements (NE) of various types and, typically, different vendors, which inter-operate in a co-ordinated manner in order to satisfy the network users' communication requirements.
The occurrence of failures in a NE may cause a deterioration of this NE's function and/or service quality and will, in severe cases, lead to the complete unavailability of the respective NE. In order to minimise the effects of such failures on the Quality of Service (QoS) as perceived by the network users it is necessary to:
-
detect failures in the network as soon as they occur and alert the operating personnel as fast as possible;
-
isolate the failures (autonomously or through operator intervention), i.e. switch off faulty units and, if applicable, limit the effect of the failure as much as possible by reconfiguration of the faulty NE/adjacent NEs;
-
if necessary, determine the cause of the failure using diagnosis and test routines; and
-
repair/eliminate failures in due time through the application of maintenance procedures.
This aspect of the management environment is termed
"Fault Management" (FM). The purpose of FM is to detect failures as soon as they occur and to limit their effects on the network Quality of Service (QOS) as far as possible.
The latter is achieved by bringing additional/redundant equipment into operation, reconfiguring existing equipment/NEs, or by repairing/eliminating the cause of the failure.
Fault Management (FM) encompasses all of the above functionalities except commissioning/decommissioning of NEs and potential operator triggered reconfiguration (these are a matter of Configuration Management (CM), cf.
TS 32.600).
FM also includes associated features in the Operations System (OS), such as the administration of a pending alarms list, the presentation of operational state information of physical and logical devices/resources/functions, and the provision and analysis of the alarm and state history of the network.
Fault management is further specified in TS 32.111-series [3].