Appendix A. Vendor-Specific Alarm Types Example
This example shows how to define alarm types in a vendor-specific module. In this case, the vendor "xyz" has chosen to define top- level identities according to X.733 event types. module example-xyz-alarms { namespace "urn:example:xyz-alarms"; prefix xyz-al; import ietf-alarms { prefix al; } identity xyz-alarms { base al:alarm-type-id; } identity communications-alarm { base xyz-alarms; } identity quality-of-service-alarm { base xyz-alarms; } identity processing-error-alarm { base xyz-alarms; } identity equipment-alarm { base xyz-alarms; } identity environmental-alarm { base xyz-alarms; } // communications alarms identity link-alarm { base communications-alarm; } // QoS alarms identity high-jitter-alarm { base quality-of-service-alarm; } }
Appendix B. Alarm Inventory Example
This shows an alarm inventory: one alarm type is defined only with the identifier and another is dynamically configured. In the latter case, a digital input has been connected to a smoke detector; therefore, the "alarm-type-qualifier" is set to "smoke-detector" and the "alarm-type-id" to "environmental-alarm". <alarms xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms" xmlns:xyz-al="urn:example:xyz-alarms" xmlns:dev="urn:example:device"> <alarm-inventory> <alarm-type> <alarm-type-id>xyz-al:link-alarm</alarm-type-id> <alarm-type-qualifier/> <resource> /dev:interfaces/dev:interface </resource> <will-clear>true</will-clear> <description> Link failure; operational state down but admin state up </description> </alarm-type> <alarm-type> <alarm-type-id>xyz-al:environmental-alarm</alarm-type-id> <alarm-type-qualifier>smoke-alarm</alarm-type-qualifier> <will-clear>true</will-clear> <description> Connected smoke detector to digital input </description> </alarm-type> </alarm-inventory> </alarms>Appendix C. Alarm List Example
In this example, we show an alarm that has toggled [major, clear, major]. An operator has acknowledged the alarm. <alarms xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms" xmlns:xyz-al="urn:example:xyz-alarms" xmlns:dev="urn:example:device"> <alarm-list> <number-of-alarms>1</number-of-alarms> <last-changed>2018-04-08T08:39:50.00Z</last-changed> <alarm>
<resource> /dev:interfaces/dev:interface[name='FastEthernet1/0'] </resource> <alarm-type-id>xyz-al:link-alarm</alarm-type-id> <alarm-type-qualifier></alarm-type-qualifier> <time-created>2018-04-08T08:20:10.00Z</time-created> <is-cleared>false</is-cleared> <alt-resource>1.3.6.1.2.1.2.2.1.1.17</alt-resource> <last-raised>2018-04-08T08:39:40.00Z</last-raised> <last-changed>2018-04-08T08:39:50.00Z</last-changed> <perceived-severity>major</perceived-severity> <alarm-text> Link operationally down but administratively up </alarm-text> <status-change> <time>2018-04-08T08:39:40.00Z</time> <perceived-severity>major</perceived-severity> <alarm-text> Link operationally down but administratively up </alarm-text> </status-change> <status-change> <time>2018-04-08T08:30:00.00Z</time> <perceived-severity>cleared</perceived-severity> <alarm-text> Link operationally up and administratively up </alarm-text> </status-change> <status-change> <time>2018-04-08T08:20:10.00Z</time> <perceived-severity>major</perceived-severity> <alarm-text> Link operationally down but administratively up </alarm-text> </status-change> <operator-state-change> <time>2018-04-08T08:39:50.00Z</time> <state>ack</state> <operator>joe</operator> <text>Will investigate, ticket TR764999</text> </operator-state-change> </alarm> </alarm-list> </alarms>
Appendix D. Alarm Shelving Example
This example shows how to shelve alarms. We shelve alarms related to the smoke detectors, since they are being installed and tested. We also shelve all alarms from FastEthernet1/0. <alarms xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms" xmlns:xyz-al="urn:example:xyz-alarms" xmlns:dev="urn:example:device"> <control> <alarm-shelving> <shelf> <name>FE10</name> <resource> /dev:interfaces/dev:interface[name='FastEthernet1/0'] </resource> </shelf> <shelf> <name>detectortest</name> <alarm-type> <alarm-type-id> xyz-al:environmental-alarm </alarm-type-id> <alarm-type-qualifier-match> smoke-alarm </alarm-type-qualifier-match> </alarm-type> </shelf> </alarm-shelving> </control> </alarms>
Appendix E. X.733 Mapping Example
This example shows how to map a dynamic alarm type (alarm-type- id=environmental-alarm, alarm-type-qualifier=smoke-alarm) to the corresponding X.733 "event-type" and "probable-cause" parameters. <alarms xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms" xmlns:xyz-al="urn:example:xyz-alarms"> <control> <x733-mapping xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms-x733"> <alarm-type-id>xyz-al:environmental-alarm</alarm-type-id> <alarm-type-qualifier-match> smoke-alarm </alarm-type-qualifier-match> <event-type>quality-of-service-alarm</event-type> <probable-cause>777</probable-cause> </x733-mapping> </control> </alarms>Appendix F. Relationship to Other Alarm Standards
This section briefly describes how this alarm data model relates to other relevant standards.F.1. Definition of "Alarm"
The table below summarizes relevant definitions of the term "alarm" in other alarm standards. +------------+---------------------------+--------------------------+ | Standard | Definition | Comment | +------------+---------------------------+--------------------------+ | X.733 | error: A deviation of a | The X.733 alarm | | [X.733] | system from normal | definition is focused on | | | operation. fault: The | the notification as such | | | physical or algorithmic | and not the state. | | | cause of a malfunction. | X.733 defines an alarm | | | Faults manifest | as a deviation from a | | | themselves as errors. | normal condition but | | | alarm: A notification, of | without the requirement | | | the form defined by this | that it needs corrective | | | function, of a specific | actions. | | | event. An alarm may or | | | | may not represent an | | | | error. | | | | | |
| G.7710 | Alarms are indications | The G.7710 definition is |
| [G.7710] | that are automatically | close to the original |
| | generated by a device as | X.733 definition. |
| | a result of the | |
| | declaration of a failure. | |
| | | |
| Alarm MIB | Alarm: Persistent | RFC 3877 defines the |
| [RFC3877] | indication of a fault. | term alarm as referring |
| | Fault: Lasting error or | back to "a deviation |
| | warning condition. | from normal operation". |
| | Error: A deviation of a | The Alarm YANG data |
| | system from normal | model adds the |
| | operation. | requirement that it |
| | | should require a |
| | | corrective action and |
| | | should be undesired, not |
| | | only a deviation from |
| | | normal. The alarm MIB |
| | | is state oriented in the |
| | | same way as the Alarm |
| | | YANG module; it focuses |
| | | on the "lasting |
| | | condition", not the |
| | | individual |
| | | notifications. |
| | | |
| ISA | Alarm: An audible and/or | The ISA standard adds an |
| [ISA182] | visible means of | important requirement to |
| | indicating to the | the "deviation from |
| | operator an equipment | normal condition state": |
| | malfunction, process | requiring a response. |
| | deviation, or abnormal | |
| | condition requiring a | |
| | response. | |
| | | |
| EEMUA | An alarm is an event to | This is the foundation |
| [EEMUA] | which an operator must | for the definition of |
| | knowingly react, respond, | alarm in this document. |
| | and acknowledge -- not | It focuses on the core |
| | simply acknowledge and | criterion that an action |
| | ignore. | is really needed. |
| | | |
| 3GPP Alarm | 3GPP v15: An alarm | The latest 3GPP Alarm | | IRP | signifies an undesired | IRP version uses | | [ALARMIRP] | condition of a resource | literally the same alarm | | | (e.g., device, link) for | definition as this alarm | | | which an operator action | data model. It is worth | | | is required. It | noting that earlier | | | emphasizes a key | versions used a | | | requirement that | definition not requiring | | | operators [...] should | an operator action and | | | not be informed about an | the more-broad | | | undesired condition | definition of deviation | | | unless it requires | from normal condition. | | | operator action. | The earlier version also | | | 3GPP v12: alarm: abnormal | defined an alarm as a | | | network entity condition, | special case of "event". | | | which categorizes an | | | | event as a fault. | | | | fault: a deviation of a | | | | system from normal | | | | operation, which may | | | | result in the loss of | | | | operational capabilities | | | | [...] | | +------------+---------------------------+--------------------------+ Table 1: Definition of the Term "Alarm" in Standards The evolution of the definition of alarm moves from focused on events reporting a deviation from normal operation towards a definition to a undesired *state* that *requires an operator action*.F.2. Data Model
This section describes how this YANG alarm data model relates to other standard data models. Note well that we cover other data models for alarm interfaces but not other standards such as SDO- specific alarms.F.2.1. X.733
X.733 has acted as a base for several alarm data models over the years. The YANG alarm data model differs in the following ways: X.733 models the alarm list as a list of notifications. The YANG alarm data model defines the alarm list as the current alarm states for the resources, which is generated from the state change reporting notifications.
In X.733, an alarm can have the severity level "clear". In the YANG alarm data model, "clear" is not a severity level; it is a separate state of the alarm. An alarm can have the following states, for example, (major, cleared) and (minor, not cleared). X.733 uses a flat, globally defined enumerated "probable-cause" to identify alarm types. This alarm data model uses a hierarchical YANG identity: "alarm-type". This enables delegation of alarm types within organizations. It also enables management to reason about abstract alarm types corresponding to base identities; see Section 3.2. The YANG alarm data model has not included the majority of the X.733 alarm attributes. Rather, these are defined in an augmenting module [X.733] if "strict" X.733 compliance is needed.F.2.2. The Alarm MIB (RFC 3877)
The MIB in RFC 3877 takes a different approach; rather than defining a concrete data model for alarms, it defines a model to map existing SNMP-managed objects and notifications into alarm states and alarm notifications. This was necessary since MIBs were already defined with both managed objects and notifications indicating alarms, for example, "linkUp" and "linkDown" notifications in combination with "ifAdminState" and "ifOperState". So, RFC 3877 cannot really be compared to the alarm YANG module in that sense. The Alarm MIB maps existing MIB definitions into alarms, such as "alarmModelTable". The upside of that is that an SNMP Manager can, at runtime, read the possible alarm types. This corresponds to the "alarmInventory" in the alarm YANG module.F.2.3. 3GPP Alarm IRP
The 3GPP Alarm IRP is an evolution of X.733. Main differences between the alarm YANG module and 3GPP are as follows: 3GPP keeps the majority of the X.733 attributes, but the alarm YANG module does not. 3GPP introduced overlapping and possibly conflicting keys for alarms, alarmId, and (managed object, event type, probable cause, specific problem). (See Example 3 in Annex C of [ALARMIRP]). In the YANG alarm data model, the key for identifying an alarm instance is clearly defined by ("resource", "alarm-type-id", "alarm-type-qualifier"). See also Section 3.4 for more information.
The alarm YANG module clearly separates the resource/ instrumentation lifecycle from the operator lifecycle. 3GPP allows operators to set the alarm severity to clear; this is not allowed by this module. Rather, an operator closes an alarm, which does not affect the severity.F.2.4. G.7710
G.7710 is different than the previously referenced alarm standards. It does not define a data model for alarm reporting. It defines common equipment management function requirements including alarm instrumentation. The scope is transport networks. The requirements in G.7710 correspond to features in the alarm YANG module in the following way: Alarm Severity Assignment Profile (ASAP): the alarm profile "/alarms/alarm-profile/". Alarm Reporting Control (ARC): alarm shelving "/alarms/control/ alarm-shelving/" and the ability to control alarm notifications "/alarms/control/notify-status-changes". Alarm shelving corresponds to the use case of turning off alarm reporting for a specific resource, which is the NALM (No ALarM) state in M.3100.Appendix G. Alarm-Usability Requirements
This section defines usability requirements for alarms. Alarm usability is important for an alarm interface. A data model will help in defining the format, but if the actual alarms are of low value, we have not gained the goal of alarm management. Common alarm problems and their causes are summarized in Table 2. This summary is adopted to networking based on the ISA [ISA182] and Engineering Equipment Materials Users Association (EEMUA) [EEMUA] standards.
+-----------------+--------------------------------+----------------+ | Problem | Cause | How this | | | | module | | | | addresses the | | | | cause | +-----------------+--------------------------------+----------------+ | Alarms are | "Nuisance" alarms (chattering | Strict | | generated, but | alarms and fleeting alarms), | definition of | | they are | faulty hardware, redundant | alarms | | ignored by the | alarms, cascading alarms, | requiring | | operator. | incorrect alarm settings, and | corrective | | | alarms that have not been | response. See | | | rationalized; the alarms | alarm | | | represent log information | requirements | | | rather than true alarms. | in Table 3. | | | | | | When alarms | Insufficient alarm-response | The alarm | | occur, | procedures and not well- | inventory | | operators do | defined alarm types. | lists all | | not know how to | | alarm types | | respond. | | and corrective | | | | actions. See | | | | alarm | | | | requirements | | | | in Table 3. | | | | | | The alarm | Nuisance alarms, stale alarms, | The alarm | | display is full | and alarms from equipment not | definition and | | of alarms, even | in service. | alarm | | when there is | | shelving. | | nothing wrong. | | | | | | | | During a | Incorrect prioritization of | State-based | | failure, | alarms. Not using advanced | alarm model | | operators are | alarm techniques (e.g., state- | and alarm-rate | | flooded with so | based alarming). | requirements; | | many alarms | | see Tables 4 | | that they do | | and 5, | | not know which | | respectively. | | ones are the | | | | most important. | | | +-----------------+--------------------------------+----------------+ Table 2: Alarm Problems and Causes
Based upon the above problems, EEMUA gives the following definition of a good alarm: +----------------+--------------------------------------------------+ | Characteristic | Explanation | +----------------+--------------------------------------------------+ | Relevant | Not spurious or of low operational value. | | | | | Unique | Not duplicating another alarm. | | | | | Timely | Not long before any response is needed or too | | | late to do anything. | | | | | Prioritized | Indicating the importance that the operator | | | deals with the problem. | | | | | Understandable | Having a message that is clear and easy to | | | understand. | | | | | Diagnostic | Identifying the problem that has occurred. | | | | | Advisory | Indicative of the action to be taken. | | | | | Focusing | Drawing attention to the most important issues. | +----------------+--------------------------------------------------+ Table 3: Definition of a Good Alarm Vendors SHOULD rationalize all alarms according to the table above. Another crucial requirement is acceptable alarm notification rates. Vendors SHOULD make sure that they do not exceed the recommendations from EEMUA below: +-----------------------------------+-------------------------------+ | Long-Term Alarm Rate in Steady | Acceptability | | Operation | | +-----------------------------------+-------------------------------+ | More than one per minute | Very likely to be | | | unacceptable. | | | | | One per 2 minutes | Likely to be overdemanding. | | | | | One per 5 minutes | Manageable. | | | | | Less than one per 10 minutes | Very likely to be acceptable. | +-----------------------------------+-------------------------------+ Table 4: Acceptable Alarm Rates -- Steady State
+----------------------------+--------------------------------------+ | Number of alarms displayed | Acceptability | | in 10 minutes following a | | | major network problem | | +----------------------------+--------------------------------------+ | More than 100 | Definitely excessive and very likely | | | to lead to the operator abandoning | | | the use of the alarm system. | | | | | 20-100 | Hard to cope with. | | | | | Under 10 | Should be manageable, but it may be | | | difficult if several of the alarms | | | require a complex operator response. | +----------------------------+--------------------------------------+ Table 5: Acceptable Alarm Rates -- Burst The numbers in Tables 4 and 5 are the sum of all alarms for a network being managed from one alarm console. So every individual system or Network Management System (NMS) contributes to these numbers. Vendors SHOULD make sure that the following rules are used in designing the alarm interface: 1. Rationalize the alarms in the system to ensure that every alarm is necessary, has a purpose, and follows the cardinal rule that it requires an operator response. Adheres to the rules of Table 3. 2. Audit the quality of the alarms. Talk with the operators about how well the alarm information supports them. Do they know what to do in the event of an alarm? Are they able to quickly diagnose the problem and determine the corrective action? Does the alarm text adhere to the requirements in Table 3? 3. Analyze and benchmark the performance of the system and compare it to the recommended metrics in Tables 4 and 5. Start by identifying nuisance alarms, as well as standing alarms at normal state and startup.
Acknowledgements
The authors wish to thank Viktor Leijon and Johan Nordlander for their valuable input on forming the alarm model. The authors also wish to thank Nick Hancock, Joey Boyd, Tom Petch, and Balazs Lengyel for their extensive reviews and contributions to this document.Authors' Addresses
Stefan Vallin Stefan Vallin AB Email: stefan@wallan.se Martin Bjorklund Cisco Email: mbj@tail-f.com