The present document investigates potential use cases, requirements and solutions for the fault supervision evolution, its relation with performance management and fault supervision, relation and interaction with MDAS and COSLA, and potential enhancements for MDA assisted fault management. The present document provides conclusions and recommendations on the normative work.
The following documents contain provisions which, through reference in this text, constitute provisions of the present document.
-
References are either specific (identified by date of publication, edition number, version number, etc.) or non-specific.
-
For a specific reference, subsequent revisions do not apply.
-
For a non-specific reference, the latest version applies. In the case of a reference to a 3GPP document (including a GSM document), a non-specific reference implicitly refers to the latest version of that document in the same Release as the present document.
[1]
TR 21.905: "Vocabulary for 3GPP Specifications".
[2]
ITU-T Recommendation X.731 (1992) | ISO/IEC 10164-2: 1992, Information technology - Open Systems Interconnection - Systems Management - State management function.
[3]
TS 28.625: "State Management Data Definitions".
[4]
ITU-T Recommendation X.733 (1992) | ISO/IEC 10164-4: 1992, Information technology - Open Systems Interconnection - Systems Management - Alarm reporting function.
[5]
[6]
ITU-T Recommendation X.739 (1993): "Information technology - Open Systems Interconnection - Systems Management - Metric Objects and attributes".
[7]
ITU-T Recommendation E.880 (1993): "Telephone network and ISDN Quality of service, network management and traffic engineering. Field data collection and evaluation on the performance of equipment, networks and services".
[8]
[9]
TS 28.554: "5G end to end Key Performance Indicators (KPI)".
[10]
[11]
TS 28.104: "Management and orchestration; Management Data Analytics (MDA)".
[12]
TS 28.535: "Management and orchestration; Management services for communication service assurance; Requirements".
[13]
TS 28.536: "Management and orchestration; Management services for communication service assurance; Stage 2 and stage 3".
Since several decades the telecommunication management network offers a multitude of possibilities to inform about specific states of the system
[2] and
[3], errors and faults by using alarms
[4] and
[5], and about the performance related indications like counters, KPI gauges, aggregations, statistics, and thresholds, e.g.
[6] -
[9].
Already the first paragraph on the model of alarm reporting
[4],
clause 7 describes the importance to use thresholds and to detect trends in order to provide warnings to the managers. This means the managed systems are encouraged to use means to detect abnormal conditions as early as possible in order to inform the management system by standardized means about the situation. Any new proposal has to consider already existing solutions in order to avoid diverging, non-interoperable frameworks.
ITU-T Recommendation X.733 [4], clause 7 also highlights the importance to correlate multiple events. While the correlation is an internal function of management systems, the interfaces are supporting the correlation by specific fields to associate multiple events to each other. This also is true for the corresponding 3GPP specifications, which to a large extent are based on the specifications by ITU-T. Correlation in existing specifications mainly covers alarm notifications, although other type of data e.g. normal performance measurements, KPIs, historical data etc could also be considered for more comprehensive analysis.
The combination of alarm reporting and state management would be able to reduce the number of alarm messages very efficiently if certain best practices are followed: If alarms are used to indicate that a resource requires maintenance, and states are used to inform about the well-being of a resource.
For example, in case a backhaul link towards a gNB has a problem, many logical and physical interfaces of the gNB, many protocol layers, and all cells will experience certain abnormal conditions. If all these resources are raising alarms, then the management system will choke in alarms -although none of these alarms requires any maintenance, since the problem is caused by the link, while the base station as such has no problem at all.
If in such situation the resources would consider the rule to issue alarms only in case they require maintenance, then the base station would not send any alarm, while all affected resources would set their operational state to
"disabled" and the availability state to
"dependency". In this case the human operator would be aware that the base station does not work as expected and would be also aware of the fact that the base station as such does not require any maintenance. However, although the mechanism described above have been standardized by ITU-T in 1992, such mechanisms are not applied in currents systems. Reducing the number of alarms in the network elements by simple filtering of alarms imposes the risk to miss important information that might be needed by other management functions. Therefore, it requires the network elements to perform thorough correlation of notifications and state information in order to suppress redundant information only, but not to suppress information that is needed by higher level management tools.
It is an unfortunate fact that -since ever- the management systems as well as the human operators are choking in alarms, although a combination of alarm reporting and state management would offer a technical means to reduce the number of alarms. As a matter of fact, the determination whether an abnormal behavior is caused by an entity itself or by another entity (or subsystem) requires sophisticated correlation functions that are able to be reliable in order to avoid erroneous correlations resulting in false statements about the root cause. Implementation of such functions require high implementation effort because it requires the knowledge of all dependencies.
An additional problem is that
clause 11.2 of TS 28.532, which defines the Fault Supervision MnS, does not provide the necessary definitions and descriptions required to understand the current state of art as to alarm management. This is because much of the material specified and available for the IRP Framework was not moved to SBMA.
For that reason, this study investigates which definitions and descriptions need to be added to
clause 11.2 of TS 28.532 to make this clause understandable without need to consult other specifications. Besides descriptions for alarm management, the role and importance of state management are highlighted as well.
It is also in scope of this study to look at possibilities to clarify in
TS 28.532 that internal behavior of functions is not subject to standardization. For example, the algorithm used to accomplish alarm correlation is outside the scope of standards. This implies that deliberations on if AI/ML is used for correlation or not is also outside the scope of standards. It is a vendor decision to use AI/ML or not.
The scope of this study includes potential enhancements to MDA assisted fault management. MDA capability may be used or enhanced for the fault related analysis. The MDA capability
"failure prediction" supports the prediction of the running trend of network and potential failures to intervene in advance. More alarm/fault related analysis scenarios and capabilities will be studied. Existing alarm data is needed as one of the data sources for the analysis.
TS 28.104 provides the description of MDA role in management loop in
clause 6. The attribute of MDA type specified in
TS 28.104 provides the indicator in analytics outputs for particular management capability. The
clause 7.2.3 of TS 28.104 includes the use case(s) and requirements of MDA assisted fault management. Some MDA capability enhancements to provide more analytics information related to fault management are described in
clause 5.2 in the present document.
Currently, the closed control loop specified in
TS 28.535 and
TS 28.536 only concerns communication service assurance scenarios (including network slice and network slice subnet). It is to be studied if the scope of closed control loop can be extended to cover the fault management case.
The present document investigated missing definitions, potential enhancements of fault supervision, and its interaction with MDA. The work is proposed in
clause 5.2.3.