RFC 8632

A YANG Data Model for Alarm Management

Pages: 82
Proposed Standard
→ Errata

Part 4 of 4 – Pages 70 to 82

RFC8632 - Page 70 prevText

Appendix A.  Vendor-Specific Alarm Types Example

   This example shows how to define alarm types in a vendor-specific
   module.  In this case, the vendor "xyz" has chosen to define top-
   level identities according to X.733 event types.

   module example-xyz-alarms {
     namespace "urn:example:xyz-alarms";
     prefix xyz-al;

     import ietf-alarms {
       prefix al;
     }

     identity xyz-alarms {
       base al:alarm-type-id;
     }

     identity communications-alarm {
       base xyz-alarms;
     }
     identity quality-of-service-alarm {
       base xyz-alarms;
     }
     identity processing-error-alarm {
       base xyz-alarms;
     }
     identity equipment-alarm {
       base xyz-alarms;
     }
     identity environmental-alarm {
       base xyz-alarms;
     }

     // communications alarms
     identity link-alarm {
       base communications-alarm;
     }

     // QoS alarms
     identity high-jitter-alarm {
       base quality-of-service-alarm;
     }
   }

RFC8632 - Page 71

Appendix B.  Alarm Inventory Example

   This shows an alarm inventory: one alarm type is defined only with
   the identifier and another is dynamically configured.  In the latter
   case, a digital input has been connected to a smoke detector;
   therefore, the "alarm-type-qualifier" is set to "smoke-detector" and
   the "alarm-type-id" to "environmental-alarm".

   <alarms xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms"
           xmlns:xyz-al="urn:example:xyz-alarms"
           xmlns:dev="urn:example:device">
     <alarm-inventory>
       <alarm-type>
         <alarm-type-id>xyz-al:link-alarm</alarm-type-id>
         <alarm-type-qualifier/>
         <resource>
           /dev:interfaces/dev:interface
         </resource>
         <will-clear>true</will-clear>
         <description>
           Link failure; operational state down but admin state up
         </description>
       </alarm-type>
       <alarm-type>
         <alarm-type-id>xyz-al:environmental-alarm</alarm-type-id>
         <alarm-type-qualifier>smoke-alarm</alarm-type-qualifier>
         <will-clear>true</will-clear>
         <description>
           Connected smoke detector to digital input
         </description>
       </alarm-type>
     </alarm-inventory>
   </alarms>

Appendix C.  Alarm List Example

   In this example, we show an alarm that has toggled [major, clear,
   major].  An operator has acknowledged the alarm.

   <alarms xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms"
           xmlns:xyz-al="urn:example:xyz-alarms"
           xmlns:dev="urn:example:device">
     <alarm-list>
       <number-of-alarms>1</number-of-alarms>
       <last-changed>2018-04-08T08:39:50.00Z</last-changed>
       <alarm>

RFC8632 - Page 72

         <resource>
           /dev:interfaces/dev:interface[name='FastEthernet1/0']
         </resource>
         <alarm-type-id>xyz-al:link-alarm</alarm-type-id>
         <alarm-type-qualifier></alarm-type-qualifier>
         <time-created>2018-04-08T08:20:10.00Z</time-created>
         <is-cleared>false</is-cleared>
         <alt-resource>1.3.6.1.2.1.2.2.1.1.17</alt-resource>
         <last-raised>2018-04-08T08:39:40.00Z</last-raised>
         <last-changed>2018-04-08T08:39:50.00Z</last-changed>
         <perceived-severity>major</perceived-severity>
         <alarm-text>
           Link operationally down but administratively up
         </alarm-text>
         <status-change>
           <time>2018-04-08T08:39:40.00Z</time>
           <perceived-severity>major</perceived-severity>
           <alarm-text>
             Link operationally down but administratively up
           </alarm-text>
         </status-change>
         <status-change>
           <time>2018-04-08T08:30:00.00Z</time>
           <perceived-severity>cleared</perceived-severity>
           <alarm-text>
             Link operationally up and administratively up
           </alarm-text>
         </status-change>
         <status-change>
           <time>2018-04-08T08:20:10.00Z</time>
           <perceived-severity>major</perceived-severity>
           <alarm-text>
             Link operationally down but administratively up
           </alarm-text>
         </status-change>
         <operator-state-change>
           <time>2018-04-08T08:39:50.00Z</time>
           <state>ack</state>
           <operator>joe</operator>
           <text>Will investigate, ticket TR764999</text>
         </operator-state-change>
       </alarm>
     </alarm-list>
   </alarms>

RFC8632 - Page 73

Appendix D.  Alarm Shelving Example

   This example shows how to shelve alarms.  We shelve alarms related to
   the smoke detectors, since they are being installed and tested.  We
   also shelve all alarms from FastEthernet1/0.

   <alarms xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms"
           xmlns:xyz-al="urn:example:xyz-alarms"
           xmlns:dev="urn:example:device">
     <control>
       <alarm-shelving>
         <shelf>
           <name>FE10</name>
           <resource>
             /dev:interfaces/dev:interface[name='FastEthernet1/0']
           </resource>
         </shelf>
         <shelf>
           <name>detectortest</name>
           <alarm-type>
             <alarm-type-id>
               xyz-al:environmental-alarm
             </alarm-type-id>
             <alarm-type-qualifier-match>
               smoke-alarm
             </alarm-type-qualifier-match>
           </alarm-type>
         </shelf>
       </alarm-shelving>
     </control>
   </alarms>

RFC8632 - Page 74

Appendix E.  X.733 Mapping Example

   This example shows how to map a dynamic alarm type (alarm-type-
   id=environmental-alarm, alarm-type-qualifier=smoke-alarm) to the
   corresponding X.733 "event-type" and "probable-cause" parameters.

   <alarms xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms"
           xmlns:xyz-al="urn:example:xyz-alarms">
     <control>
       <x733-mapping
          xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms-x733">
         <alarm-type-id>xyz-al:environmental-alarm</alarm-type-id>
         <alarm-type-qualifier-match>
           smoke-alarm
         </alarm-type-qualifier-match>
         <event-type>quality-of-service-alarm</event-type>
         <probable-cause>777</probable-cause>
       </x733-mapping>
     </control>
   </alarms>

Appendix F.  Relationship to Other Alarm Standards

   This section briefly describes how this alarm data model relates to
   other relevant standards.

F.1.  Definition of "Alarm"

   The table below summarizes relevant definitions of the term "alarm"
   in other alarm standards.

   +------------+---------------------------+--------------------------+
   | Standard   | Definition                | Comment                  |
   +------------+---------------------------+--------------------------+
   | X.733      | error: A deviation of a   | The X.733 alarm          |
   | [X.733]    | system from normal        | definition is focused on |
   |            | operation.  fault: The    | the notification as such |
   |            | physical or algorithmic   | and not the state.       |
   |            | cause of a malfunction.   | X.733 defines an alarm   |
   |            | Faults manifest           | as a deviation from a    |
   |            | themselves as errors.     | normal condition but     |
   |            | alarm: A notification, of | without the requirement  |
   |            | the form defined by this  | that it needs corrective |
   |            | function, of a specific   | actions.                 |
   |            | event.  An alarm may or   |                          |
   |            | may not represent an      |                          |
   |            | error.                    |                          |
   |            |                           |                          |

RFC8632 - Page 75

   | G.7710     | Alarms are indications    | The G.7710 definition is |
   | [G.7710]   | that are automatically    | close to the original    |
   |            | generated by a device as  | X.733 definition.        |
   |            | a result of the           |                          |
   |            | declaration of a failure. |                          |
   |            |                           |                          |
   | Alarm MIB  | Alarm: Persistent         | RFC 3877 defines the     |
   | [RFC3877]  | indication of a fault.    | term alarm as referring  |
   |            | Fault: Lasting error or   | back to "a deviation     |
   |            | warning condition.        | from normal operation".  |
   |            | Error: A deviation of a   | The Alarm YANG data      |
   |            | system from normal        | model adds the           |
   |            | operation.                | requirement that it      |
   |            |                           | should require a         |
   |            |                           | corrective action and    |
   |            |                           | should be undesired, not |
   |            |                           | only a deviation from    |
   |            |                           | normal.  The alarm MIB   |
   |            |                           | is state oriented in the |
   |            |                           | same way as the Alarm    |
   |            |                           | YANG module; it focuses  |
   |            |                           | on the  "lasting         |
   |            |                           | condition", not the      |
   |            |                           | individual               |
   |            |                           | notifications.           |
   |            |                           |                          |
   | ISA        | Alarm: An audible and/or  | The ISA standard adds an |
   | [ISA182]   | visible means of          | important requirement to |
   |            | indicating to the         | the "deviation from      |
   |            | operator an equipment     | normal condition state": |
   |            | malfunction, process      | requiring a response.    |
   |            | deviation, or abnormal    |                          |
   |            | condition requiring a     |                          |
   |            | response.                 |                          |
   |            |                           |                          |
   | EEMUA      | An alarm is an event to   | This is the foundation   |
   | [EEMUA]    | which an operator must    | for the definition of    |
   |            | knowingly react, respond, | alarm in this document.  |
   |            | and acknowledge -- not    | It focuses on the core   |
   |            | simply acknowledge and    | criterion that an action |
   |            | ignore.                   | is really needed.        |
   |            |                           |                          |

RFC8632 - Page 76

   | 3GPP Alarm | 3GPP v15: An alarm        | The latest 3GPP Alarm    |
   | IRP        | signifies an undesired    | IRP version uses         |
   | [ALARMIRP] | condition of a resource   | literally the same alarm |
   |            | (e.g., device, link) for  | definition as this alarm |
   |            | which an operator action  | data model.  It is worth |
   |            | is required.  It          | noting that earlier      |
   |            | emphasizes a key          | versions used a          |
   |            | requirement that          | definition not requiring |
   |            | operators [...] should    | an operator action and   |
   |            | not be informed about an  | the more-broad           |
   |            | undesired condition       | definition of deviation  |
   |            | unless it requires        | from normal condition.   |
   |            | operator action.          | The earlier version also |
   |            | 3GPP v12: alarm: abnormal | defined an alarm as a    |
   |            | network entity condition, | special case of "event". |
   |            | which categorizes an      |                          |
   |            | event as a fault.         |                          |
   |            | fault: a deviation of a   |                          |
   |            | system from normal        |                          |
   |            | operation, which may      |                          |
   |            | result in the loss of     |                          |
   |            | operational capabilities  |                          |
   |            | [...]                     |                          |
   +------------+---------------------------+--------------------------+

           Table 1: Definition of the Term "Alarm" in Standards

   The evolution of the definition of alarm moves from focused on events
   reporting a deviation from normal operation towards a definition to a
   undesired *state* that *requires an operator action*.

F.2.  Data Model

   This section describes how this YANG alarm data model relates to
   other standard data models.  Note well that we cover other data
   models for alarm interfaces but not other standards such as SDO-
   specific alarms.

F.2.1.  X.733

   X.733 has acted as a base for several alarm data models over the
   years.  The YANG alarm data model differs in the following ways:

      X.733 models the alarm list as a list of notifications.  The YANG
      alarm data model defines the alarm list as the current alarm
      states for the resources, which is generated from the state change
      reporting notifications.

RFC8632 - Page 77

      In X.733, an alarm can have the severity level "clear".  In the
      YANG alarm data model, "clear" is not a severity level; it is a
      separate state of the alarm.  An alarm can have the following
      states, for example, (major, cleared) and (minor, not cleared).

      X.733 uses a flat, globally defined enumerated "probable-cause" to
      identify alarm types.  This alarm data model uses a hierarchical
      YANG identity: "alarm-type".  This enables delegation of alarm
      types within organizations.  It also enables management to reason
      about abstract alarm types corresponding to base identities; see
      Section 3.2.

      The YANG alarm data model has not included the majority of the
      X.733 alarm attributes.  Rather, these are defined in an
      augmenting module [X.733] if "strict" X.733 compliance is needed.

F.2.2.  The Alarm MIB (RFC 3877)

   The MIB in RFC 3877 takes a different approach; rather than defining
   a concrete data model for alarms, it defines a model to map existing
   SNMP-managed objects and notifications into alarm states and alarm
   notifications.  This was necessary since MIBs were already defined
   with both managed objects and notifications indicating alarms, for
   example, "linkUp" and "linkDown" notifications in combination with
   "ifAdminState" and "ifOperState".  So, RFC 3877 cannot really be
   compared to the alarm YANG module in that sense.

   The Alarm MIB maps existing MIB definitions into alarms, such as
   "alarmModelTable".  The upside of that is that an SNMP Manager can,
   at runtime, read the possible alarm types.  This corresponds to the
   "alarmInventory" in the alarm YANG module.

F.2.3.  3GPP Alarm IRP

   The 3GPP Alarm IRP is an evolution of X.733.  Main differences
   between the alarm YANG module and 3GPP are as follows:

      3GPP keeps the majority of the X.733 attributes, but the alarm
      YANG module does not.

      3GPP introduced overlapping and possibly conflicting keys for
      alarms, alarmId, and (managed object, event type, probable cause,
      specific problem).  (See Example 3 in Annex C of [ALARMIRP]).  In
      the YANG alarm data model, the key for identifying an alarm
      instance is clearly defined by ("resource", "alarm-type-id",
      "alarm-type-qualifier").  See also Section 3.4 for more
      information.

RFC8632 - Page 78

      The alarm YANG module clearly separates the resource/
      instrumentation lifecycle from the operator lifecycle. 3GPP allows
      operators to set the alarm severity to clear; this is not allowed
      by this module.  Rather, an operator closes an alarm, which does
      not affect the severity.

F.2.4.  G.7710

   G.7710 is different than the previously referenced alarm standards.
   It does not define a data model for alarm reporting.  It defines
   common equipment management function requirements including alarm
   instrumentation.  The scope is transport networks.

   The requirements in G.7710 correspond to features in the alarm YANG
   module in the following way:

      Alarm Severity Assignment Profile (ASAP): the alarm profile
      "/alarms/alarm-profile/".

      Alarm Reporting Control (ARC): alarm shelving "/alarms/control/
      alarm-shelving/" and the ability to control alarm notifications
      "/alarms/control/notify-status-changes".  Alarm shelving
      corresponds to the use case of turning off alarm reporting for a
      specific resource, which is the NALM (No ALarM) state in M.3100.

Appendix G.  Alarm-Usability Requirements

   This section defines usability requirements for alarms.  Alarm
   usability is important for an alarm interface.  A data model will
   help in defining the format, but if the actual alarms are of low
   value, we have not gained the goal of alarm management.

   Common alarm problems and their causes are summarized in Table 2.
   This summary is adopted to networking based on the ISA [ISA182] and
   Engineering Equipment Materials Users Association (EEMUA) [EEMUA]
   standards.

RFC8632 - Page 79

   +-----------------+--------------------------------+----------------+
   | Problem         | Cause                          | How this       |
   |                 |                                | module         |
   |                 |                                | addresses the  |
   |                 |                                | cause          |
   +-----------------+--------------------------------+----------------+
   | Alarms are      | "Nuisance" alarms (chattering  | Strict         |
   | generated, but  | alarms and fleeting alarms),   | definition of  |
   | they are        | faulty hardware, redundant     | alarms         |
   | ignored by the  | alarms, cascading alarms,      | requiring      |
   | operator.       | incorrect alarm settings, and  | corrective     |
   |                 | alarms that have not been      | response.  See |
   |                 | rationalized; the alarms       | alarm          |
   |                 | represent log information      | requirements   |
   |                 | rather than true alarms.       | in Table 3.    |
   |                 |                                |                |
   | When alarms     | Insufficient alarm-response    | The alarm      |
   | occur,          | procedures and not well-       | inventory      |
   | operators do    | defined alarm types.           | lists all      |
   | not know how to |                                | alarm types    |
   | respond.        |                                | and corrective |
   |                 |                                | actions.  See  |
   |                 |                                | alarm          |
   |                 |                                | requirements   |
   |                 |                                | in Table 3.    |
   |                 |                                |                |
   | The alarm       | Nuisance alarms, stale alarms, | The alarm      |
   | display is full | and alarms from equipment not  | definition and |
   | of alarms, even | in service.                    | alarm          |
   | when there is   |                                | shelving.      |
   | nothing wrong.  |                                |                |
   |                 |                                |                |
   | During a        | Incorrect prioritization of    | State-based    |
   | failure,        | alarms.  Not using advanced    | alarm model    |
   | operators are   | alarm techniques (e.g., state- | and alarm-rate |
   | flooded with so | based alarming).               | requirements;  |
   | many alarms     |                                | see Tables 4   |
   | that they do    |                                | and 5,         |
   | not know which  |                                | respectively.  |
   | ones are the    |                                |                |
   | most important. |                                |                |
   +-----------------+--------------------------------+----------------+

                    Table 2: Alarm Problems and Causes

RFC8632 - Page 80

   Based upon the above problems, EEMUA gives the following definition
   of a good alarm:

   +----------------+--------------------------------------------------+
   | Characteristic | Explanation                                      |
   +----------------+--------------------------------------------------+
   | Relevant       | Not spurious or of low operational value.        |
   |                |                                                  |
   | Unique         | Not duplicating another alarm.                   |
   |                |                                                  |
   | Timely         | Not long before any response is needed or too    |
   |                | late to do anything.                             |
   |                |                                                  |
   | Prioritized    | Indicating the importance that the operator      |
   |                | deals with the problem.                          |
   |                |                                                  |
   | Understandable | Having a message that is clear and easy to       |
   |                | understand.                                      |
   |                |                                                  |
   | Diagnostic     | Identifying the problem that has occurred.       |
   |                |                                                  |
   | Advisory       | Indicative of the action to be taken.            |
   |                |                                                  |
   | Focusing       | Drawing attention to the most important issues.  |
   +----------------+--------------------------------------------------+

                    Table 3: Definition of a Good Alarm

   Vendors SHOULD rationalize all alarms according to the table above.
   Another crucial requirement is acceptable alarm notification rates.
   Vendors SHOULD make sure that they do not exceed the recommendations
   from EEMUA below:

   +-----------------------------------+-------------------------------+
   | Long-Term Alarm Rate in Steady    | Acceptability                 |
   | Operation                         |                               |
   +-----------------------------------+-------------------------------+
   | More than one per minute          | Very likely to be             |
   |                                   | unacceptable.                 |
   |                                   |                               |
   | One per 2 minutes                 | Likely to be overdemanding.   |
   |                                   |                               |
   | One per 5 minutes                 | Manageable.                   |
   |                                   |                               |
   | Less than one per 10 minutes      | Very likely to be acceptable. |
   +-----------------------------------+-------------------------------+

              Table 4: Acceptable Alarm Rates -- Steady State

RFC8632 - Page 81

   +----------------------------+--------------------------------------+
   | Number of alarms displayed | Acceptability                        |
   | in 10 minutes following a  |                                      |
   | major network problem      |                                      |
   +----------------------------+--------------------------------------+
   | More than 100              | Definitely excessive and very likely |
   |                            | to lead to the operator abandoning   |
   |                            | the use of the alarm system.         |
   |                            |                                      |
   | 20-100                     | Hard to cope with.                   |
   |                            |                                      |
   | Under 10                   | Should be manageable, but it may be  |
   |                            | difficult if several of the alarms   |
   |                            | require a complex operator response. |
   +----------------------------+--------------------------------------+

                 Table 5: Acceptable Alarm Rates -- Burst

   The numbers in Tables 4 and 5 are the sum of all alarms for a network
   being managed from one alarm console.  So every individual system or
   Network Management System (NMS) contributes to these numbers.

   Vendors SHOULD make sure that the following rules are used in
   designing the alarm interface:

   1.  Rationalize the alarms in the system to ensure that every alarm
       is necessary, has a purpose, and follows the cardinal rule that
       it requires an operator response.  Adheres to the rules of
       Table 3.

   2.  Audit the quality of the alarms.  Talk with the operators about
       how well the alarm information supports them.  Do they know what
       to do in the event of an alarm?  Are they able to quickly
       diagnose the problem and determine the corrective action?  Does
       the alarm text adhere to the requirements in Table 3?

   3.  Analyze and benchmark the performance of the system and compare
       it to the recommended metrics in Tables 4 and 5.  Start by
       identifying nuisance alarms, as well as standing alarms at normal
       state and startup.

RFC8632 - Page 82

Acknowledgements

   The authors wish to thank Viktor Leijon and Johan Nordlander for
   their valuable input on forming the alarm model.

   The authors also wish to thank Nick Hancock, Joey Boyd, Tom Petch,
   and Balazs Lengyel for their extensive reviews and contributions to
   this document.

Authors' Addresses

   Stefan Vallin
   Stefan Vallin AB

   Email: stefan@wallan.se


   Martin Bjorklund
   Cisco

   Email: mbj@tail-f.com