The solution consists of the basic solution and the following optional solution components:
-
Dedicated perceived severity change notification
-
Acknowledging alarms by MnS consumers
-
Commenting alarms by MnS consumers
-
Alarm correlation
-
Reliability of alarm lists
Usage |
Operations and notifications |
NRM |
FM basic | notifyNewAlarm
notifyChangedAlarmGeneral
notifyClearedAlarm | AlarmList |
Dedicated perceived severity change notification | notifyChangedAlarm | |
Acknowledging alarms by MnS consumers | notifyAckStateChanged | alarmRecord.ackTime
alarmRecord.ackUserId
alarmRecord.ackSystemId
alarmRecord.ackState |
Commenting alarms by MnS consumers | notifyComments | alarmRecord:comments, datatype:alarmComment |
Alarm correlation | notifyCorrelatedNotificationChanged | alarmRecord:correlatedNotifications
alarmRecord:rootCauseIndicator |
Reliability of alarm lists | notifyPotentialFaultyAlarmList
notifyAlarmListRebuilt | AlarmList.unreliableAlarmScope |
The solution for Fault Management is based on the model driven approach.
NRM data is written to control the behaviour of the fault management.
Data provided to the fault management consumer is made available in two ways (representing the same information). MnS consumers may use the a read operation to read any data. Additionally, data that should be provided as soon as it is available in the MnS producer is sent to subscribed MnS consumers in notifications (e.g. information about a new alarm).
For this reason, only an alarm model is defined. The CRUD operations defined in
clause 11.1 of TS 28.532,are used for interacting with the instantiation of the model.
Since the generic provisioning notifications defined in
clause 11.1 of TS 28.532 are not used in all cases, the present document also defines some specific alarm notifications to report changes in the alarm model.
Interactions with the alarm model with both operations and notifications may be subject to access control.
An alarm is described by a set of attributes. This set of attributes is referred to as alarm record. An alarm record is hence the management representation of an alarm.
The object instance attribute in an alarm record identifies the object that represents the alarmed entity in the management system. Objects are identified using their Distinguished Name (DN). Note that all is needed is a DN. It is not required that the object really exists in the management system and can be accessed with CRUD operations.
The alarm type (
ITU-T X.733 [8], clause 8.1.1) attribute specifies roughly in which area of the supervised system an alarm has occurred:
-
If the alarm type is equal to "COMMUNICATIONS_ALARM", the alarm is principally associated with the procedures and/or processes required to convey information from one point to another.
-
If the alarm type is equal to "PROCESSING_ERROR_ALARM", the alarm is principally associated with a software or processing fault.
-
If the alarm type is equal to "EQUIPMENT_ALARM", the alarm is principally associated with an equipment fault.
-
If the alarm type is equal to "ENVIRONMENTAL_ALARM", the alarm type is principally associated with a condition relating to an enclosure in which the equipment resides.
The present document also provides the alarm type
"QUALITY_OF_SERVICE_ALARM". This alarm type does not specify the area where the issue occurs but conveys that the alarm is principally associated with a degradation in the quality of a service. Also, this alarm type can be combined with any perceived severity. An alarm with this type is often generated, in addition to an alarm with one of the other types, for the same underlying fault. This allows to filter on alarms that are related to a (potential) service degradation only.
The specific problem attribute (
ITU-T X.733 [8], clause 8.1.2.2) provides further refinements to the probable cause of the alarm.
The perceived severity attribute (
ITU-T X.733 [8], clause 8.1.2.3) allows to assess the severity of the alarm condition as determined by the system. The values critical, major, minor and warning are provided, and the value cleared indicates that the condition leading to an alarm is not present anymore.
Alarms with the same values for the attributes object instance, alarm type, probable cause and specific problem are considered the same alarm. These four attributes are also called alarm identifying attributes. As a shortcut for the alarm identifying attributes the alarm identifier is defined. To refer to a specific alarm it is hence possible to use the four alarm identifying attributes or the alarm identifier.
The alarm records representing the current state of the system are stored in alarm lists on MnS producers. An alarm list contains the alarm records related to a certain management scope. This scope is either a manged element or a subnetwork. Historical alarm records are not stored in an alarm list. Therefore, at any point in time, there cannot be more than one alarm record in an alarm list, where the alarm identifying attributes have the same values.
Alarm lists are typically created automatically upon system start up. They cannot be created or deleted by MnS consumers.
The alarm records in the alarm list are created and deleted by the system. A MnS consumer can only read the attributes of alarm records but not manipulate them (except for a few exceptions).
Besides the alarm records itself, alarm lists contain also attributes describing the alarm records, such as the total number of alarm records in the alarm list or the time when an alarm record was updated the last time.
A MnS consumer can retrieve the alarm records in an alarm list using the
"getMOIAttributes" operation defined in
clause 11.1.1.2 of TS 28.532. Often it is desired to retrieve only alarm records matching some criteria and not all alarms in an alarm list. For example, a MnS consumer might be interested only in alarms whose perceived severity is critical or in alarms from a specific managed element. This requires support for conditional data node retrieval.
An alarm is defined as a fault, an error or failure that requires attention or reaction by an operator or some machine. For that reason, alarm records should not be removed from the alarm list without prior acknowledgement by the operator or a machine. The acknowledgement state attribute is provided for that purpose in an alarm record. It can have the values acknowledged and unacknowledged and is set by the MnS consumer.
When a new alarm record is created by the system, its acknowledgement state is set to unacknowledged. To acknowledge an alarm, a MnS consumer can set the attribute to acknowledged. A MnS consumer may also set back the state of a previously acknowledged alarm to unacknowledged. The MnS consumer may provide its identity (user identifier and system identifier) to the MnS Producer when setting the acknowledgement state attribute. The MnS Producer stores this information in the corresponding alarm record.
The system automatically captures the time when the acknowledgement state attribute is updated. A dedicated acknowledgement time attribute is provided for that purpose.
For reporting changes of the acknowledgement state refer to
clause 6.12.
The possibility to acknowledge alarms is an optional feature.
If the condition leading to an alarm is not prevailing or not detected anymore, the perceived severity of the alarm is set to cleared by the system. These alarms are referred to as automatically detected automatically cleared alarms (ADAC alarms). There are also alarms that are not automatically cleared. These alarms are referred to as automatically detected manually cleared alarms (ADMC alarms).
MnS consumers need to manually clear ADMC alarms by setting the perceived severity attribute of the alarm record to cleared. The MnS consumer may provide its identity (user identifier and system identifier) to the MnS producer when setting the attribute. The MnS Producer stores this information in the corresponding alarm record. If the fault condition still prevails, the system will create a new alarm or change the perceived severity value back to the old value, depending on if the alarm was removed or not removed after clearing it.
It is out of scope of the present document how the MnS consumer can find out if an alarm is an ADAC or ADMC alarm. Furthermore, it is outside the scope of the present document how a MnS consumer can find out that the fault condition does not exist anymore.
The possibility to clear alarms is a mandatory feature in case ADMC alarms may be raised by the system.
A MnS consumer can add one or more comments, in the format of free text, to an alarm record. The MnS consumer may provide its identity (user identifier and system identifier) when adding a comment. Each comment is annotated automatically with the time it is created.
A MnS consumer cannot update or delete a comment. Comments are deleted automatically when the corresponding alarm record is deleted.
For reporting the addition of a comment refer to
clause 6.12.
The possibility to comment alarms is an optional feature.
Multiple errors and failures may be caused by a single fault. A single error may result also in multiple failures. The system may support identifying these relationships between faults, errors, and alarms.
To capture these relationships the correlated notifications attribute and the root cause indicator attribute are provided. Modifications of these attributes are reported using the notify correlated notification changed notification.
Alarm lists may become unreliable for numerous reasons. Due to the organisation of managed objects (that can be alarmed and have related alarm records in the alarm list) in hierarchical object trees, alarm records relating to a complete subtree are typically becoming unreliable. For example, consider a subnetwork manager that loses the connection to one of the managed elements it manages. In this case the alarm records relating to the complete object subtree starting at the object representing the managed element are not updated and more and hence unreliable.
Alarm lists advertise unreliable parts by indicating the base objects of unreliable subtrees in the (multi-valued) unreliable alarm scope attribute. When the complete alarm list is unreliable the unreliable alarm scope attribute shall specify the object instance of the MnS agent. When the bad part of the alarm list has been rebuilt and is up to date again the corresponding base object of the previously unreliable subtree is removed from the unreliable alarm scope attribute. An empty attribute indicates that the complete alarm list is reliable,
When objects are created or deleted, or when attribute values are updated, then this is normally notified to MnS consumers using object creation, object deletion or attribute value change notifications. When alarm records are created, or deleted or modified these general-purpose notifications are not used. Dedicated notifications are used instead as follows:
-
If a new alarm record is added to an alarm list a notify new alarm notification is sent.
-
If the acknowledgement state changes its value, the notify acknowledgment state changed notification is sent.
-
If a comment is added to an alarm record, the notify comments notification is sent.
-
If the correlated notifications attribute or the root cause indicator attribute changes its value, the notify correlated notification changed notification is sent.
-
If the perceived severity changes its value to cleared, the notify cleared alarm notification is sent.
-
In all other cases a notify changed alarm general notification is sent.
Alarms are identified in alarm notifications using the alarm identifier, except for in the notify new alarm notification, where the four alarm identifying attributes are included as well to allow the MnS consumer receiving the notification to relate the alarm identifier to the alarm identifying attributes.
The removal of an alarm record from an alarm list is not notified directly, only indirectly through the notifications reporting the clearance and, if supported, the acknowledgement of an alarm:
-
If alarm acknowledgement is not supported, the MnS consumer can deduct from the reception of a notification reporting the clearance of an alarm that the corresponding alarm record was removed from the alarm list.
-
If alarm acknowledgement is supported, the MnS consumer can deduct from the consecutive reception of a notification reporting the clearance of an alarm and a notification reporting the acknowledgement of the same alarm that the corresponding alarm record was removed from the alarm list. The order of receiving the notifications is not relevant.
A MnS producer can maintain an exact copy of the alarm list on the MnS producer by consuming the alarm notifications, assuming of course the MnS consumer starts with an exact alarm list copy.
Modifications of the unreliable alarm scope attribute are notified using the notify potential faulty alarm list notification and the notify alarm list rebuilt notification. More specifically, when
-
a new value is added to the unreliable alarm scope attribute the notify potential faulty alarm list notification is sent. The object class and object instance parameters of the notification header specify the base object of the subtree that has become unreliable.
-
a value is removed from the unreliable alarm scope attribute the notify alarm list rebuilt notification is sent. The object class and object instance parameters of the notification header specify the base object of the subtree that has been rebuilt and is reliable again.
When (parts of) the alarm list is unreliable the MnS producer may nevertheless send reliable alarm notifications that allow a MnS consumer to maintain an exact copy of the (unreliable) alarm list on the MnS producer. When the MnS consumer receives an alarm list rebuilt notification he knows that his alarm list copy is reliable and no alignment with the alarm list on the MnS consumer is required. To inform the MnS consumer about if unreliable or reliable alarm notifications were sent, or in other words, if an alarm list alignment is required or not required the alarm list alignment required attribute is provided.
To receive the notifications described in this clause, MnS consumers need to have appropriate notification subscriptions in place.
The alarm list features the operational state and the administrative state attribute.
When an alarm list is unlocked and enabled alarm records shall be added, updated, or removed based on currently prevailing alarm conditions. The alarm list is always representing the current alarm conditions. Alarm notifications are sent.
When an alarm list is locked, the system shall not add, delete, or update alarm records. However, the MnS consumer may acknowledge, clear or comment alarms. Alarm notifications are not sent.
When the alarm list is disabled, its behaviour is undefined, however the administrative state and operational state shall be correctly handled. Alarm records may or may not be added, deleted, or updated based on prevailing alarm conditions. Furthermore, the result of a MnS consumer acknowledging, clearing, or commenting an alarm is not predictable and may or may not fail. Alarm notifications are not sent.
When an alarm list is locked or disabled its alarm records are hence not reliable.
The operational state and administrative state attributes always represent the current state, and attribute value change notifications for these state attributes are always sent, even when the alarm list is locked or disabled.
Note that when moving from a locked or disabled state to an unlocked and enabled state it may take some time until all alarm records are updated, and the alarm list represents the current state of the system. The alarm list may be unreliable even though unlocked and enabled.
The system may advertise that the alarm list is unreliable in its entirety by setting the value of the unreliable alarm scope attribute to the Distinguished Name (DN) of the MnS agent.
When the system detects a fault, an error or failure caused by a fault, the system creates an internal alarm description based on the alarm record attributes. In a second step the system needs to determine if this internal alarm is a new alarm or just an update of an already existing alarm. It does so by checking if there is already an alarm record with the same values for the four alarm identifying attributes (object instance, alarm type, probable cause, and specific problem) in the alarm list.
-
If there is an alarm record with the same values for the alarm identifying attributes, then the corresponding existing alarm record in the alarm list is updated.
-
If there is no alarm record with the same values for the alarm identifying attributes, then a new alarm record is added to the alarm list.
If alarm acknowledgement is supported, alarm records for cleared alarms are deleted by the system only when they are acknowledged. In other words, the alarm list contains only alarm records for alarms, whose:
-
perceived severity is not cleared, or whose
-
perceived severity is cleared, but that are not acknowledged.
If alarm acknowledgement is not supported, alarm records for cleared alarms are deleted immediately by the system.
The alarms represented by the alarm records in the alarm list are also referred to as active alarms.