TR 28.830
Study on Fault Supervision evolution

3GPP‑Page CONTENT_↓

V18.1.0 (Wzip) 2024/06 12 p.

Rapporteur:: Mr. WANG, CHEN
China Mobile Research Inst.

Content for TR 28.830 Word version: 18.1.0

5.1 Issue 1: Missing definitions 5.2 Issue 2: Potential enhancements for fault related analysis

1 Scope p. 6

The present document investigates potential use cases, requirements and solutions for the fault supervision evolution, its relation with performance management and fault supervision, relation and interaction with MDAS and COSLA, and potential enhancements for MDA assisted fault management. The present document provides conclusions and recommendations on the normative work.

2 References p. 6

The following documents contain provisions which, through reference in this text, constitute provisions of the present document.

References are either specific (identified by date of publication, edition number, version number, etc.) or non-specific.
For a specific reference, subsequent revisions do not apply.
For a non-specific reference, the latest version applies. In the case of a reference to a 3GPP document (including a GSM document), a non-specific reference implicitly refers to the latest version of that document in the same Release as the present document.

[1]

TR 21.905: "Vocabulary for 3GPP Specifications".

[2]

ITU-T Recommendation X.731 (1992) | ISO/IEC 10164-2: 1992, Information technology - Open Systems Interconnection - Systems Management - State management function.

[3]

TS 28.625: "State Management Data Definitions".

[4]

ITU-T Recommendation X.733 (1992) | ISO/IEC 10164-4: 1992, Information technology - Open Systems Interconnection - Systems Management - Alarm reporting function.

[5]

TS 28.532: "Generic management services".

[6]

ITU-T Recommendation X.739 (1993): "Information technology - Open Systems Interconnection - Systems Management - Metric Objects and attributes".

[7]

ITU-T Recommendation E.880 (1993): "Telephone network and ISDN Quality of service, network management and traffic engineering. Field data collection and evaluation on the performance of equipment, networks and services".

[8]

TS 28.552: "5G performance measurements".

[9]

TS 28.554: "5G end to end Key Performance Indicators (KPI)".

[10]

TS 28.111: "Fault management".

[11]

TS 28.104: "Management and orchestration; Management Data Analytics (MDA)".

[12]

TS 28.535: "Management and orchestration; Management services for communication service assurance; Requirements".

[13]

TS 28.536: "Management and orchestration; Management services for communication service assurance; Stage 2 and stage 3".

3 Definitions of terms, symbols and abbreviations p. 7

3.1 Terms p. 7

For the purposes of the present document, the terms given in TR 21.905 and the following apply. A term defined in the present document takes precedence over the definition of the same term, if any, in TR 21.905.

3.2 Symbols p. 7

None.

3.3 Abbreviations p. 7

For the purposes of the present document, the abbreviations given in TR 21.905. An abbreviation defined in the present document takes precedence over the definition of the same abbreviation, if any, in TR 21.905.

4 Background p. 7

Since several decades the telecommunication management network offers a multitude of possibilities to inform about specific states of the system [2] and [3], errors and faults by using alarms [4] and [5], and about the performance related indications like counters, KPI gauges, aggregations, statistics, and thresholds, e.g. [6] - [9].

Already the first paragraph on the model of alarm reporting [4], clause 7 describes the importance to use thresholds and to detect trends in order to provide warnings to the managers. This means the managed systems are encouraged to use means to detect abnormal conditions as early as possible in order to inform the management system by standardized means about the situation. Any new proposal has to consider already existing solutions in order to avoid diverging, non-interoperable frameworks.

ITU-T Recommendation X.733 [4], clause 7 also highlights the importance to correlate multiple events. While the correlation is an internal function of management systems, the interfaces are supporting the correlation by specific fields to associate multiple events to each other. This also is true for the corresponding 3GPP specifications, which to a large extent are based on the specifications by ITU-T. Correlation in existing specifications mainly covers alarm notifications, although other type of data e.g. normal performance measurements, KPIs, historical data etc could also be considered for more comprehensive analysis.

The combination of alarm reporting and state management would be able to reduce the number of alarm messages very efficiently if certain best practices are followed: If alarms are used to indicate that a resource requires maintenance, and states are used to inform about the well-being of a resource.

For example, in case a backhaul link towards a gNB has a problem, many logical and physical interfaces of the gNB, many protocol layers, and all cells will experience certain abnormal conditions. If all these resources are raising alarms, then the management system will choke in alarms -although none of these alarms requires any maintenance, since the problem is caused by the link, while the base station as such has no problem at all.

If in such situation the resources would consider the rule to issue alarms only in case they require maintenance, then the base station would not send any alarm, while all affected resources would set their operational state to "disabled" and the availability state to "dependency". In this case the human operator would be aware that the base station does not work as expected and would be also aware of the fact that the base station as such does not require any maintenance. However, although the mechanism described above have been standardized by ITU-T in 1992, such mechanisms are not applied in currents systems. Reducing the number of alarms in the network elements by simple filtering of alarms imposes the risk to miss important information that might be needed by other management functions. Therefore, it requires the network elements to perform thorough correlation of notifications and state information in order to suppress redundant information only, but not to suppress information that is needed by higher level management tools.

It is an unfortunate fact that -since ever- the management systems as well as the human operators are choking in alarms, although a combination of alarm reporting and state management would offer a technical means to reduce the number of alarms. As a matter of fact, the determination whether an abnormal behavior is caused by an entity itself or by another entity (or subsystem) requires sophisticated correlation functions that are able to be reliable in order to avoid erroneous correlations resulting in false statements about the root cause. Implementation of such functions require high implementation effort because it requires the knowledge of all dependencies.

An additional problem is that clause 11.2 of TS 28.532, which defines the Fault Supervision MnS, does not provide the necessary definitions and descriptions required to understand the current state of art as to alarm management. This is because much of the material specified and available for the IRP Framework was not moved to SBMA.

For that reason, this study investigates which definitions and descriptions need to be added to clause 11.2 of TS 28.532 to make this clause understandable without need to consult other specifications. Besides descriptions for alarm management, the role and importance of state management are highlighted as well.

It is also in scope of this study to look at possibilities to clarify in TS 28.532 that internal behavior of functions is not subject to standardization. For example, the algorithm used to accomplish alarm correlation is outside the scope of standards. This implies that deliberations on if AI/ML is used for correlation or not is also outside the scope of standards. It is a vendor decision to use AI/ML or not.

The scope of this study includes potential enhancements to MDA assisted fault management. MDA capability may be used or enhanced for the fault related analysis. The MDA capability "failure prediction" supports the prediction of the running trend of network and potential failures to intervene in advance. More alarm/fault related analysis scenarios and capabilities will be studied. Existing alarm data is needed as one of the data sources for the analysis.

5 Issues and potential solutions p. 8

5.1 Issue 1: Missing definitions p. 8

5.1.1 Description p. 8

Clause 11.2 of TS 28.532, which defines the Fault Supervision MnS, does not provide the necessary definitions and descriptions required to understand the current state of art as to alarm management. This is because much of the material specified and available for the IRP Framework was not moved to SBMA.

5.1.2 Potential solutions p. 8

It is proposed to add the following definitions to an appropriate TS:

Event:

Anything that occurs at a certain point in time, for example a configuration change, a threshold crossing, a transition to an error state or a transition to a failure state. Events do not have states.

Error:

A state of the system different from the correct system state as defined by the service specification. An error may or may not lead to a service failure. An error has a begin and end time.

Failure:

A state of inability to deliver the correct service as defined by the service specification. A service failure may be the result of an error or a poor service function design.

Fault:

The (hypothesized or adjudged) cause for an error or a failure.

Alarm:

An error or failure that requires attention or reaction by an operator or some machine. Alarms have state.

Root cause:

The primary fault (cause), if any, leading to one or multiple errors or failures.

5.1.3 Conclusion - Impact on normative work p. 8

The proposal in clause 5.1.2 has been considered in TS 28.111 v1.0.0, therefore no further normative work is needed.

5.2 Issue 2: Potential enhancements for fault related analysis p. 9

5.2.1 Description p. 9

If a potential fault/failure is predicted and reported to the consumer, the consumer would like to know the consequence. More analysis information on 3GPP system may be provided for the consumer to perform more proper actions, e.g., performance degradation analysis and predictions, KPI anomaly analysis and predictions, etc.

For example, the impacts on 3GPP system may not be perceived significantly in densely populated urban areas if there are overlapping coverage when a few sites encounter faults. However, service outage may occur due to faults in a site if there are few overlapping coverage in suburban areas. If this kind of information can be provided, the different handlings may be performed by the consumer.

5.2.2 Potential solutions p. 9

5.2.2.1 Potential solution 1: Failure prediction enhancement p. 9

5.2.2.1.1 Introduction p. 9

Some potential enhancements to failure prediction are provided. The concrete potentialFailureType is defined as standardized value of the alarmType in TS 28.532.

5.2.2.1.2 Description p. 9

Failure of service and network functions may occur during the network operation. It is necessary to predict potential failures and prevent more severe impacts. The MDA capability of failure prediction has been specified in TS 28.104, the analytics output are as follows:

failurePredictionObject;
potentialFailureType;
eventTime;
issueID;
perceivedSeverity.

The potentialFailureType in the failure prediction analytics output need to be defined more concretely. The potentialFailureType may reference to the standardized AlarmType. The MDA service may coordinate with fault supervision for identification and analysis of alarm types. The attribute name is also modified to predictedFailureType.

Table 5.2.2.1.2-1: Analytics output for failure prediction

Attribute Name	Description
predictedFailureType	Indication of type of issues that can cause the failures. (see note)
NOTE: The values can be defined as a list of values of the alarmType described in TS 28.532.

5.2.2.2 Potential solution 2: Fault impact analysis p. 9

5.2.2.2.1 Introduction p. 9

The MDA type of failure prediction may need to provide fault impact analysis information.

5.2.2.2.2 Description p. 10

The scope and degree of fault impacts are evaluated and provided in MDA type of failure prediction in TS 28.104. It may contain the following aspects:

Scope and service types which are impacted by the fault.
A list of managed objects which are impacted by the fault.

Table 5.2.2.2.2-1: MDA analytics output of fault impacts in failure prediction

Attribute Name	Description
affectScope	Coverage areas which are affected by the fault, e.g. a list of cells, a list of tracking areas (TAs), etc.
affectService	Service types which are affected by the fault, e.g. the VoNR, URLLC service types.
affectNumOfPDUSessions	Number of PDU sessions which are affected by the fault.
affectManagedObjects	The object instances which are impacted by the fault, e.g. network slice, network slice subnet, network elements, network functions, a list of gNBs, etc.

5.2.2.3 Potential solution 3: Fault cause analysis enhancement p. 10

5.2.2.3.1 Introduction p. 10

The MDA may provide capability of probableCause analysis.

5.2.2.3.2 Description p. 10

In TS 28.532, the alarm notification can provide information of probableCause. The attribute probableCause uses the definitions in ITU-T Recommendation X. 733 [4]. The MDA analytics report may add the probableCause, therefore the fault supervision management service producer may use this probableCause of MDA analytics output for the next coming alarm notification of the same alarm type. The values can refer to the Probable Causes list in Annex A (normative) in TS 28.111. The attribute of probableCause may be added to existing MDA Type MDAAssistedFaultManagement.FailurePrediction.

Table 5.2.2.3.2-1: MDA analytics output of probableCause in failure prediction

Attribute Name	Description
probableCause	Provide the analysis report of the probableCause as listed in Annex A (normative) in TS 28.111.

5.2.3 Conclusion - Impact on normative work p. 10

It is proposed to study the issue further in the Study on Management Data Analytics (MDA) - Phase 3, objective Fault management related analytics and alarm prediction.

6 Fault supervision evolution relation and interaction with MDA and closed loop control p. 10

TS 28.104 provides the description of MDA role in management loop in clause 6. The attribute of MDA type specified in TS 28.104 provides the indicator in analytics outputs for particular management capability. The clause 7.2.3 of TS 28.104 includes the use case(s) and requirements of MDA assisted fault management. Some MDA capability enhancements to provide more analytics information related to fault management are described in clause 5.2 in the present document.

Currently, the closed control loop specified in TS 28.535 and TS 28.536 only concerns communication service assurance scenarios (including network slice and network slice subnet). It is to be studied if the scope of closed control loop can be extended to cover the fault management case.

7 Conclusions and recommendations p. 11

The present document investigated missing definitions, potential enhancements of fault supervision, and its interaction with MDA. The work is proposed in clause 5.2.3.