Any evaluation of the NEs' and the overall network health status require the detection of faults in the network and, consequently, the notification of alarms to the OS (EM and/or NM). Depending on the nature of the fault, it may be combined with a change of the operational state of the logical and/or physical resource(s) affected by the fault. Detection and notification of these state changes is as essential as it is for the alarms. A list of active alarms in the network and operational state information as well as alarm/state history data are required by the system operator for further analysis. Additionally, test procedures can be used in order to obtain more detailed information if necessary, or to verify an alarm or state or the proper operation of NEs and their logical and physical resources.
The following clauses explain the detection of faults, the handling of alarms and state changes and the execution of tests.
Only those requirements covered by
clause 5 and related IRPs shall be considered as valid requirements for compliance to the standard defined by the present document.
Faults that may occur in the network can be grouped into one of the following categories:
-
Hardware failures, i.e. the malfunction of some physical resource within a NE.
-
Software problems, e.g. software bugs, database inconsistencies.
-
Functional faults, i.e. a failure of some functional resource in a NE and no hardware component can be found responsible for the problem.
-
Loss of some or all of the NE's specified capability due to overload situations.
-
Communication failures between two NEs, or between NE and OS, or between two OSs.
In any case, as a consequence of faults, appropriate alarms related to the physical or logical resource(s) affected by the fault(s), shall be generated by the network entities.
The following clauses focus on the aspects of fault detection, alarm generation and storage, fault recovery and retrieval of stored alarm information.
When any type of fault described above occurs within a3GPP system, the affected network entities shall be able to detect them immediately.
The network entities accomplish this task using autonomous self-check circuits/procedures, including, in the case of NEs, the observation of measurements, counters and thresholds. The threshold measurements may be predefined by the manufacturer and executed autonomously in the NE, or they may be based on performance measurements administered by the EM, cf. [4]. The fault detection mechanism as defined above shall cover both active and standby components of the network entities.
The majority of the faults should have well-defined conditions for the declaration of their presence or absence, i.e. fault occurrence and fault clearing conditions. Any such incident shall be referred to in the present document as an ADAC fault. The network entities should be able to recognize when a previously detected ADAC fault is no longer present, i.e. the clearing of the fault, using similar techniques as they use to detect the occurrence of the fault. For some faults, no clearing condition exists. For the purpose of the present document, these faults shall be referred to as ADMC faults. An example of this is when the network entity has to restart a software process due to some inconsistencies, and normal operation can be resumed afterwards. In this case, although the inconsistencies are cleared, the cause of the problem is not yet corrected. Manual intervention by the system operator shall always be necessary to clear ADMC faults since these, by definition, cannot be cleared by the network entity itself.
For some faults there is no need for any short-term action, neither from the system operator nor from the network entity itself, since the fault condition lasted for a short period of time only and then disappeared. An example of this is when a NE detects the crossing of some observed threshold, and in the next sampling interval, the observed value stays within its limits.
For each fault, the fault detection process shall supply the following information:
-
the device/resource/file/functionality/smallest replaceable unit as follows:
-
for hardware faults, the smallest replaceable unit that is faulty;
-
for software faults, the affected software component, e.g. corrupted file(s) or databases or software code;
-
for functional faults, the affected functionality;
-
for faults caused by overload, information on the reason for the overload;
-
for all the above faults, wherever applicable, an indication of the physical and logical resources that are affected by the fault if applicable, a description of the loss of capability of the affected resource.
-
the type of the fault (communication, environmental, equipment, processing error, QoS) according to ITU-T Recommendation X.733 [9];
-
the severity of the fault (indeterminate, warning, minor, major, critical), as defined in ITU-T Recommendation X.733 [9];
-
the probable cause of the fault;
-
the time at which the fault was detected in the faulty network entity;
-
the nature of the fault, e.g. ADAC or ADMC;
-
any other information that helps understanding the cause and the location of the abnormal situation (system/implementation specific).
For some faults, additional means, such as test and diagnosis features, may be necessary in order to obtain the required level of detail. See
clause 4.3 for details.
For each detected fault, appropriate alarms shall be generated by the faulty network entity, regardless of whether it is an ADAC or an ADMC fault. Such alarms shall contain all the information provided by the fault detection process as described in
clause 4.1.1.
Examples of criteria for setting the alarm severity to
"critical" are [18]:
-
Total disturbance of the system or significant service impact for customers
-
Performance, capacity, throughput restrictions
-
Accounting disturbed
Examples of criteria for setting the alarm severity to
"major" are [18]:
-
Outage of a redundant component (e.g. outage of a redundant power supply)
-
Introduction of retaliatory actions required, to ensure the service availability
In order to ease the fault localization and repair, the faulty network entity should generate for each single fault, one single alarm, also in the case where a single fault causes a degradation of the operational capabilities of more than one physical or logical resource within the network entity. An example of this is a hardware fault, which affects not only a physical resource but also degrades the logical resource(s) that this hardware supports. In this case the network entity should generate one single alarm for the faulty resource (i.e. the resource which needs to be repaired) and a number of events related to state management (cf.
clause 4.2) for all the physical/logical resources affected by the fault, including the faulty one itself.
In case a network entity is not able to recognize that a single fault manifests itself in different ways, the single fault is detected as multiple faults and originates multiple alarms. In this case however, when the fault is repaired the network entity should be able to detect the repair of all the multiple faults and clear the related multiple alarms.
When a fault occurs on the connection media between two NEs or between a NE and an OS, and affects the communication capability between such NE/OS, each affected NE/OS shall detect the fault as described in
clause 4.1.1 and generate its own associated communication alarm toward the managing OS. In this case it is the responsibility of the OS to correlate alarms received from different NEs/OSs and localize the fault in the best possible way.
Within each NE, all alarms generated by that NE shall be input into a list of active alarms. The NEs shall be able to provide such a list of active alarms to the OS when requested.
The alarms originated in consequence of faults need to be cleared. To clear an alarm it is necessary to repair the corresponding fault.
Alarm maintenance manuals must contain a clear repair action for the dedicated malfunction. The repair action shall also be populated in the corresponding alarm field (see [18]).
Wherever possible, event-based automated repair actions to solve standard error situations without manual interaction should be implemented, if not already implemented on the Network Element level (see [18]).
The procedures to repair faults are implementation dependent and therefore they are out of the scope of the present document, however, in general:
-
the equipment faults are repaired by replacing the faulty units with working ones;
-
the software faults are repaired by means of partial or global system initializations, by means of software patches or by means of updated software loads;
-
the communication faults are repaired by replacing the faulty transmission equipment or, in case of excessive noise, by removing the cause of the noise;
-
the QoS faults are repaired either by removing the causes that degraded the QoS or by improving the capability of the system to react against the causes that could result in a degradation of the QoS;
-
Solving the environmental problem repairs the environment faults (high temperature, high humidity, etc.).
It is also possible that an ADAC fault is spontaneously repaired, without the intervention of the operator (e.g. a threshold crossed fault). In this case the NE behaves as for the ADAC faults repaired by the operator.
In principle, the NE uses the same mechanisms to detect that a fault has been repaired, as for the detection of the occurrence of the fault. However, for ADMC faults, manual intervention by the operator is always necessary to clear the fault. Practically, various methods exist for the system to detect that a fault has been repaired and clear alarms and the faults that triggered them. For example:
-
The system operator implicitly requests the NE to clear a fault, e.g. by initializing a new device that replaces a faulty one. Once the new device has been successfully put into service, the NE shall clear the fault(s). Consequently, the NE shall clear all related alarms.
-
The system operator explicitly requests the clearing of one or more alarms. Once the alarm(s) has/have been cleared, the fault management system (within EM and/or NE) should reissue those alarms (as new alarms) in case the fault situation still persists.
-
The NE detects the exchange of a faulty device by a new one and initializes it autonomously. Once the new device has been successfully put into service, the NE shall clear the fault(s). Consequently, the NE shall clear all related alarms.
-
The NE detects that a previously reported threshold crossed alarm is no longer valid. It shall then clear the corresponding active alarm and the associated fault, without requiring any operator intervention. The details for the administration of thresholds and the exact condition for the NE to clear a threshold crossed alarm are implementation specific and depend on the definition of the threshold measurement, see also subclause 4.1.1.
-
ADMC faults/alarms can, by definition, not be cleared by the NE autonomously. Therefore, in any case, system operator functions shall be available to request the clearing of ADAC alarms/faults in the NE. Once an ADMC alarm/fault has been cleared, the NE shall clear the associated ADAC fault/alarm.
Details of these mechanisms are system/implementation specific.
Each time an alarm is cleared the NE shall generate an appropriate clear alarm event. A clear alarm is defined as an alarm, as specified in
clause 3.1, except that its severity is set to
"cleared". The relationship between the clear alarm and the active alarm is established:
-
by re-using a set of parameters that uniquely identify the active alarm (see clause 4.1.1); or
-
by including a reference to the active alarm in the clear alarm.
When a clear alarm is generated the corresponding active alarm is removed from the active alarm list.
As soon as an alarm is entered into or removed from the active alarms list Alarm notifications shall be forwarded by the NE, in the form of unsolicited notifications;
If forwarding is not possible at this time, e.g. due to communication breakdown, then the notifications shall be sent as soon as the communication capability has been restored. The storage space is limited. The storage capacity is Operator and implementation dependent. If the number of delayed notifications exceeds the storage space then an alarm synchronization procedure shall be run when the communication capability has been restored.
The OS shall detect the communication failures that prevent the reception of alarms and raise an appropriate alarm to the operator.
If the Itf-N is implemented in the NE, then the destination of the notifications is the NM, and the interface shall comply with the stipulations made in
clause 5. If the Itf-N resides in the EM, proprietary means may be employed to forward the notifications to the EM. Note that, even if the Itf-N is implemented in the NE, the EM may still also receive the notifications by one of the above mechanisms. However, the present document does not explicitly require the NEs to support the EM as a second destination.
The event report shall include all information defined for the respective event (see
clauses 4.1.1, 4.1.2 and 4.1.3), plus an identification of the NE that generated the report.
The system operator shall be able to allow or suppress alarm reporting for each NE. As a minimum, the following criteria shall be supported for alarm filtering:
-
the NE that generated the alarm, i.e. all alarm messages for that NE shall be suppressed;
-
the device/resource/function to which the alarm relates;
-
the severity of the alarm;
-
the time at which the alarm was detected, i.e. the alarm time; and,
-
any combination of the above criteria.
The result of any command to modify the forwarding criteria shall be confirmed by the NE to the requesting operator.
For Fault Management (FM) purposes, each NE shall have to store and retain the following information:
-
a list of all active alarms, i.e. all alarms that have not yet been cleared; and
-
alarm history information, i.e. all notifications related to the occurrence and clearing of alarms.
It shall be possible to apply filters when active alarm information is retrieved by the Manager and when the history information is stored by the NE and retrieved by the Manager.
The storage space for alarm history in the NE is limited. Therefore it shall be organized as a circular buffer, i.e. the oldest data item(s) shall be overwritten by new data if the buffer is full. Further
"buffer full" behaviours, e.g. those defined in ITU-T Recommendation X.735 [11], may be implemented as an option. The storage capacity itself, and thus the duration, for which the data can be retained, shall be Operator and implementation dependent.
After a fault has been detected and the replaceable faulty units have been identified, some management functions are necessary in order to perform system recovery and/or restoration, either automatically by the NE and/or the EM, or manually by the operator.
The fault recovery functions are used in various phases of the Fault Management (FM):
-
Once a fault has been detected, the NE shall be able to evaluate the effect of the fault on the telecommunication services and autonomously take recovery actions in order to minimize service degradation or disruption.
-
Once the faulty unit(s) has (have) been replaced or repaired, it shall be possible from the EM to put the previously faulty unit(s) back into service so that normal operation is restored. This transition should be done in such a way that the currently provided telecommunication services are not, or only minimally, disturbed.
-
At any time the NE shall be able to perform recovery actions if requested by the operator. The operator may have several reasons to require such actions; e.g. he has deduced a faulty condition by analysing and correlating alarm reports, or he wants to verify that the NE is capable of performing the recovery actions (proactive maintenance).
The recovery actions that the NE performs (autonomously or on demand) in case of faults depend on the nature and severity of the faults, on the hardware and software capabilities of the NE and on the current configuration of the NE.
Faults are distinguished in two categories: software faults and hardware faults. In the case of software faults, depending on the severity of the fault, the recovery actions may be system initializations (at different levels), activation of a backup software load, activation of a fallback software load, download of a software unit etc. In the case of hardware faults, the recovery actions depend on the existence and type of redundant (i.e. back-up) resources. Redundancy of some resources may be provided in the NE in order to achieve fault tolerance and to improve system availability.
If the faulty resource has no redundancy, the recovery actions shall be:
-
Isolate and remove from service the faulty resource so that it cannot disturb other working resources;
-
Remove from service the physical and functional resources (if any) which are dependent on the faulty one. This prevents the propagation of the fault effects to other fault-free resources;
-
State management related activities for the faulty resource and other affected/dependent resources cf. clause 4.2;
-
Generate and forward appropriate notifications to inform the OS about all the changes performed.
If the faulty resource has redundancy, the NE shall perform action a), c) and d) above and, in addition, the recovery sequence that is specific to that type of redundancy. Several types of redundancy exist (e.g. hot standby, cold standby, duplex, symmetric/asymmetric, N plus one or N plus K redundancy, etc.), and for each one, there is a specific sequence of actions to be performed in case of failure. The present document specifies the Fault Management aspects of the redundancies, but it does not define the specific recovery sequences of the redundancy types.
In the case of a failure of a resource providing service, the recovery sequence shall start immediately. Before or during the changeover, a temporary and limited loss of service shall be acceptable. In the case of a management command, the NE should perform the changeover without degradation of the telecommunication services.
The detailed definition of the management of the redundancies is out of the scope of the present document. If a fault causes the interruption of ongoing calls, then the interrupted calls shall be cleared, i.e. all resources allocated to these calls shall immediately be released by the system.
It shall be possible to configure the alarm actions, thresholds and severities by means of commands, according to the following requirements:
-
the operator shall be able to configure any threshold that determines the declaration or clearing of a fault. If a series of thresholds are defined to generate alarms of various severities, then for each alarm severity the threshold values shall be configurable individually.
-
it shall be possible to modify the severity of alarms defined in the system, e.g. from major to critical. This capability should be implemented on the manager, however, in case it is implemented on the NE, the alarms forwarded by the NE to the OS and the alarms displayed on the local MMI shall have the same severity.
The NE shall confirm such alarm configuration commands and shall notify the results to the requesting system operator.
A single network fault may result in the generation of multiple alarms and events from affected entities over time and spread over a wide geographical area. If possible, the OS should indicate which alarms and events are correlated to each other.
Alarms may be correlated in view of certain rules such as alarm propagation path, specific geographical area, specific equipment, or repeated alarms from the same source. The alarms are partitioned into sets where alarms within one correlated set have a high probability of being caused by the same network fault. A correlated set may also contain events. These events are considered having a high probability of being related to the same network fault.
The correlation describes relations between network events (e.g. current alarms as those captured in AlarmList, historical alarms as those captured in NotificationLog, network configuration changes).
For a set of correlated alarms, one alarm may relate to the fault which is the root cause of all the correlated alarms and events. If possible, the OS should perfom a Root Cause Analysis to identify and indicate the Root Cause Alarm.
Root Cause Analysis is a process that can determine and identify the network condition (e.g. fault, mis configuration) causing the alarms. The determination may be based on the following (for example):
-
Information carried in alarm(s);
-
Information carried in correlated alarm sets;
-
Information carried in network notifications;
-
Network configuration information;
-
Operators' network management experience.
The alarm severities set by the network elements (NEs) in a mobile system, visible across the Itf-N, are basically resource focused (e.g. severity is set to major if NE available capacity is low). Vast amount of alarms classified as critical are potentially sent to operator's management centers but are rarely critical from the overall business perspective. They may even not be critical from the aspect of time to respond.
An operator's view can obviously be very different from the alarm severity defined by the NEs' resource focused views.
Operators need to enrich the information, i.e. the NE's resource focused view, for the purpose of the alarm management processes, see ref. [23]. The
Figure 4.1.10 introduces the concept of Managed Alarm, the management representation of the alarm in the NM domain (above Itf-N).
Within the NM, the received resource alarms are transformed so that the alarm severity is no longer resource focused but service impact focused, to prevent or mitigate network and service outage and degradation. The transformation applies to all severity levels.
A very special important class of managed alarm is the Highly Managed Alarm (HMA) class, introduced by the ANSI/ISA standard in ref. [23].
These HMAs are the most critical alarms, catastrophic from operations, security, business or any other top level point of view. These HMAs should receive special treatment particularly when it comes to viewing their status in the Human-Machine Interface (HMI). These are the alarms that shall never be allowed to be delayed or lost and must always be given the highest attention.
Considerable high levels of administrative requirements are applicable for the HMAs. For companies following this standard, detailed documentation and a multitude of special administrative requirements in a precise way, need to be fulfilled.
These include:
-
Specific shelving requirements, such as access control with audit trail;
-
Specific "Out of Service" alarm requirements, such as interim protection, access control, and audit trail;
-
Mandatory initial and refresher training with specific content and documentation;
-
Mandatory initial and periodic testing with specific documentation;
-
Mandatory training around maintenance requirements with specific documentation;
-
Mandatory audit requirements.
The HMA classes are also subject to special requirements for operator training, frequency of testing, and archiving of alarm records for proof of regulatory compliance.
Millions of mobile customers are from time to time affected by major failures in the infrastructure of mobile systems. Service assurance management of the continuously increasing complexity of our mobile systems could benefit from concepts like HMAs. The most critical equipment should be identified and secured. The HMAs should be treated in the most thoughtful way. The HMAs should never be hidden, delayed etc. in e.g. alarm flooding.
Setup of HMAs, within the scope and responsibility of the NM, will include many of the processes identified in the alarm management lifecycle, see ref. [23].
This management function provides capabilities that can be used in different phases of the Fault Management (FM). For example:
-
when a fault has been detected and if the information provided through the alarm report is not sufficient to localize the faulty resource, tests can be executed to better localize the fault;
-
during normal operation of the NE, tests can be executed for the purpose of detecting faults;
-
once a faulty resource has been repaired or replaced, before it is restored to service, tests can be executed on that resource to be sure that it is fault free.
However, regardless of the context where the testing is used, its target is always the same: verify if a system's physical or functional resource performs properly and, in case it happens to be faulty, provide all the information to help the operator to localize and correct the faults.
Testing is an activity that involves the operator, the managing system (the OS) and the managed system (the NE). Generally the operator requests the execution of tests from the OS and the managed NE autonomously executes the tests without any further support from the operator.
In some cases, the operator may request that only a test bed is set up (e.g. establish special internal connections, provide access test points, etc.). The operator can then perform the real tests, which may require some manual support to handle external test equipment. Since the
"local maintenance" and the
"inter NE testing" are out of the scope of the present document, this aspect of the testing is not treated any further.
The requirements for the test management service are based on ITU-T Recommendation X.745 [12], where the testing description and definitions are specified.
A 3GPP system is composed of a multitude of network elements of various types and with a variety of complexity. The purpose of FM is to detect failures as soon as they occur and to limit their effects on the network Quality of Service (QoS) as far as possible.
Alarm Surveillance of the network is the first line Network Management Assurance Activity and is often maintained in near real time. The very essence of the surveillance functionality is to alert the operating personnel when failures appear in the networks. This is emphasized by the following sentence from
TS 32.101, clause 7.5.2 Standardisation objectives:
"In order to minimise the effects of such failures on the QoS as perceived by the network users it is necessary to detect failures in the network as soon as they occur and alert the operating personnel as fast as possible;".
The operating personnel are confronted with most of the alarm notifications. It is of significant importance that the alarms are of operational relevance otherwise valuable time and resources will be spent to identify the irrelevant alarms.
Operator response to an alarm may consist of many different steps such as:
-
Recognizing the alarm;
-
Acknowledging the alarm;
-
Verifying that the alarm is valid and not a malfunction;
-
Getting enhanced information related to the alarm;
-
Analysing the situation in order to try to determine the cause of the alarm, potential service impact and decide upon actions on the alarm. This may include reporting/activating other people from the second line support;
-
Taking actions which may include activating reset of network elements, replacing the faulty equipment, creating trouble reports, etc;
-
Continuing the surveillance of the network element(s) to ensure the fault correction.
The alarm notifications are basically a human-machine interface and a common expectation is that operators should never miss alarms requiring an operation action. To be able to fulfil such a request the goal is to only monitor the necessary alarms at the right time by extracting the relevant ones.
The key criterion is that alarms must require an operator response - that is, an action.
The expectation of alarm handling includes the following:
-
Few alarms;
-
Alarms are clearly prioritized and presented to the operator;
-
Each alarm requires a needed action;
-
Each action is taken by the operator;
-
Alarms suppression methods aid the operator to handle alarm flooding so that saturation of the alarm management systems will not happen and control of the network is never lost.
The assumption to efficiently handle the potentially vast amount of alarms in a mobile system is that alarms must exist solely as a tool for the benefit of the operator, see
clause 4.4. They are not to be configured as a miscellaneous recording tool or for the prime benefit of maintenance personal.
The information carried in the alarm message should also be good enough to ultimately feed and partly enable automatic-correlation engines. However, alarm response is still not an automated process involving deterministic machines; it is a complex human cognitive process involving thought and analysis. The human factors involved in alarm response are subject to many variables. The quality of the alarm notifications is of fundamental importance to enable an efficient management of a mobile system.
The key to secure the quality of the information presented to the operator is to present alarm notifications of high operational relevans, in a timely fashion. If e.g. secondary logs, status or performance data are provided, it must be possible to easily separate those from the alarms.
Some of the characteristics that an alarm should have are summarized below:
-
Relevance: i.e. not spurious or of low operational value;
-
Uniqueness: i.e. not duplicating another alarm;
-
Timeliness: i.e. not long before any response is needed or too late to do anything;
-
Importance: i.e. indicating the importance that the operator deals with the problem;
-
Explicability: i.e. having a message which is clear and easy to understand;
-
Recognizance: i.e. identifying the problem that has occurred;
-
Guidance: i.e. indicative of the action to be taken;
-
Prioritization: i.e. drawing attention to the most important issues.