gNB-CU-CP SW failure caused by software in gNB-CU-CP's equipment. This is mainly caused by information inconsistencies in the software, which in most cases can be resolved by restarting the system (or, in case of a virtualized implementation, by respawning the relevant software parts).
Failure impact
The impact of the failure and the extent of the restart depends on the configuration of the equipment and where the problem occurs. In some cases, the effect may be limited to only some functions in the gNB-CU-CP, while in other cases, all functions in the gNB-CU-CP may be affected.
gNB-CU-CP HW Failure is a problem caused by the hardware of the gNB-CU-CP, mainly due to aging and deterioration of the HW of the gNB-CU-CP, and in most cases can be resolved by replacing some of the HW in the equipment.
Failure impact
The impact of the failure and the extent of the replacement depends on the configuration of the equipment and the part of the problem. In some cases, only some functions of the gNB-CU-CP will be affected, but it is expected that the whole gNB-CU-CP will be affected, especially in the case of vRAN scenarios.
Power outages are caused by power grid failures due to various disasters (earthquakes, lightning, tsunamis, fires, windstorms, snowstorms, etc.), as well as by malfunctions of substation equipment or cable breaks in the station building.
Failure impact
The impact of a failure depends on the part of the system where the failure occurs, but often affects multiple gNB-CU-CPs. The higher degree of centralization, the more gNB-CU-CPs will be affected.
Transport network failures result from cable breaks in transmission lines, network equipment failures, and misconfigurations. These can also occur as software or hardware malfunctions, disasters, or human error.
Failure impact
The impact of a failure depends on the part of the system where the failure occurs. When the failure occurs outside line, it often affects multiple gNB-CU-CPs. The higher degree of centralization, the more gNB-CU-CPs will be affected.
Failures affecting only one gNB-CU-CP can be addressed by existing countermeasures.
For failures affecting multiple gNB-CU-CPs simultaneously, in particular when the cause of the failure is closely related to the region, such as a disaster, a solution that takes advantage of the different locations for deploying redundant infrastructure should be considered.
The most relevant scenarios in regard to gNB-CU-CP are those that will incur a failure of the entire gNB-CU-CP. Such cases would affect all the existing UE contexts under the gNB-CU-CP where the failure occurred. Likewise, the disaggregated gNB architecture allows for very large configurations. A given gNB-CU-CP can host 512 gNB-DUs, and a total of 16,384 cells with existing specifications. Furthermore, each cell may be serving hundreds of UEs Therefore, a gNB-CU-CP failure can incur a very high impact in service availability and user experience for a very large number of UEs. Further, these failure scenarios can incur further problems by generating very high signalling loads from e.g., re-establishment of connections and signalling interfaces.
In contrast issues affecting only limited portions of the gNB-CU-CP, for example, trouble incurred from individual hardware blades in a virtualized environment, are not the focus of this study.
This failure scenario includes cases where the gNB-CU-CP becomes completely unresponsive. This could be e.g., due to hardware failure, or power source becoming lost and unrecoverable.
(B) Natural disaster leading to loss of the gNB-CU-CP
This failure scenario includes cases where the gNB-CU-CP becomes unrecoverable due to the node itself becoming destroyed or its required connectivity, hence becoming unrecoverable. This could be result of e.g., an earthquake, tsunami, or major fire.
(C) Human-made disaster leading to loss of the gNB-CU-CP
This failure scenario is similar to that caused by natural means but with the source of the failure being due to human involvement. This failure could be result e.g., of war, civil war, terrorism or social unrest.
This failure includes cases where the control plane signalling link becomes unavailable. The issue may be temporary (e.g., intermittent issues at a switch/router at a given communication path), or (semi-)permanent.
This use case corresponds to maintenance events (e.g., Software upgrade, new feature activations, public protests or temporary unrest). Software upgrade may result in a full reboot of the gNB-CU-CP as well.
In the legacy network architecture, a gNB-CU-DU is only connected to one gNB-CU-CP. With the increasing number of connected gNB-CU-UP and gNB-DU, gNB-CU-CP will be in the risk of failure caused by the large amount of signalling processing. So this SI will introduce a mechanism to allow the gNB-CU-UP and gNB-DU to recovery service. Higher-layer split between gNB-CU and gNB-DU would enable highly-centralized gNB-CU deployment with large coverage area per gNB-CU, especially from C-Plane perspective. Likewise, Higher-layer split between gNB-CU-CP and gNB-CU-UP would enable highly-centralized gNB-CU-CP deployment with large coverage area per gNB-CU-CP.
As described in the draft SID, such a centralized gNB-CU(-CP) would be a single point of failure. Hence the resiliency of the gNB-CU(-CP) is highly important. Given the limited time allocated for gNB-CU resiliency, only gNB-CU-CP resiliency should be considered in this release, other network nod, e.g. AMF, gNB CU(non-split CU architecture) should not be considered in this release if time not allowed.
Failure condition:
Failure condition is the trigger of gNB-CU-CP resiliency. If a network node is considered a failure when it is crashed completely, then it would be too late to recovery. The network should have scheme to switch part or all of the services/connections to another resilient network node to offload before it completely crashes down. The decision of offloading the service/connection can be either of the below:
Up to the implantation of gNB-CU-CP
The configured threshold from OAM
Both options work and has no impact to RAN3. So the failure condition can be left to network implementation.
In addition, if any of the adjacent nodes of gNB-CU-CP, e.g. AMF, gNB-CU-CP and gNB-DU lost the connection to gNB-CU-CP, due to an unexpected gNB-CU-CP failure, adjacent node should be able to activate the resilient gNB-CU-CP automatically.
The most critical cases are failures that may happen if a gNB-CU-CP is deployed in a central office location with responsibility for a high number of gNB-DUs and therefore of cells. There are different failure scenarios that may happen in such a deployment scenario:
SW or HW failure in a PNF-based implementation;
VNF (SW) or GPP (HW) failure in a virtualized environment (e.g. cloud-based implementation);
Failures of NW connections, e.g., TN or server NW card failures;
Total failure of central location, e.g., caused by a power shutdown or a disaster case.
For failure scenarios (1) and (2) the gNB-CU(-CP) functionality can be recovered by taking local redundancy in the same central location into account (e.g., based on usage of spare HW (PNF/GPP) for redundancy purposes). A slow recovery can be initiated via OAM orchestrating a new gNB-CU(-CP) instance on the spare HW (e.g., in combination with cloud-based mechanisms in a virtualized environment). A fast recovery approach to bring the outage time down would require active/stand-by operation of a mirror gNB-CU-CP instance in parallel which may be realized in a local environment by appropriate implementation (incl. also cloud-based mechanism in case of virtualization).
Slow recovery may also work in case of failure scenarios (3) and (4) using a second central location for orchestrating a new gNB-CU-CP instance via OAM, but the critical and more important issue from an operator's perspective is fast recovery in the geo-redundant case as 3GPP specifications do not provide suitable support for it.
Therefore, the failure case where a benefit of a resiliency enhancement in 3GPP specifications is seen would be:
Outage of gNB-CU(-CP) when deployed in a central location without sufficient local redundancy, but availability of geo-redundancy (SW/HW) is given and fast recovery of gNB-CU(-CP) functionality is required by the operator.
If a gNB-CU-CP fails, the network support for the NR cells is lost and UEs cannot communicate. More specifically, the following can be observed:
CP signaling redundancy is possible in RAN, but still only one RRC-anchor (gNB-CU-CP) is defined:
SRB duplication is possible in DC (PDCP duplicates RRC messages, which are sent via different carriers to/from UE), but losing the main connection to the UE still means CP failure
RLF handling:
Rel-15: if SCG connection fails, inform the network; if MCG fails, the connection is lost
Rel-16: if SCG/MCG connection fails, inform the network via MCG/SCG. There will be MCG/SCG UP downtime due to DRBs suspension.
CP failure leading to UP failure (UP failure, leading to service interruption time, is probably the most important KPI associated with a CU-CP failure):
Generally, an absence of RRC messages is not noted by the UE, so the UE does not know that the CP has failed
RRC timer expiry procedures in RRC work as long as RRC connectivity is up. In some cases DRBs suspension is caused by the expiration of timers
If the gNB-CU-CP goes down, CP connectivity is not re-established automatically until the failure is detected
UP connection could be lost if RRC reconfiguration (critical to maintain UP) is not possible
e.g. Handover required; bearer reconfiguration required; meet QoS or other RAN reconfiguration required
There is always a single RRC anchor even for the case of DC; furthermore, there is currently no way for the UE to detect a gNB(-CU(-CP)) failure.
A CU-CP failure is essentially a node internal failure (i.e. it may not be the whole logical node that fails).
Geographical redundancy is the distribution of mission-critical components or infrastructures, such as the servers across multiple data centers that reside in different geographic locations, which is able to ensures high availability and disaster recovery. Geographical redundancy will replicate the data and store it in other databases located in the separate physical locations. Even one of the locations fails or simply needs to be taken offline, the other location with the replicated data will not be affected.
Considering the geographical redundancy for gNB-CU-CP, the backup gNB-CU-CP could be allocated in the separate physical locations. In this case, even the original gNB-CU-CP detects the failures and cannot work, the appearance of the backup gNB-CU-CP is able to avoid the interruption of multiple UP traffic or the disconnection of multiple UEs.
The split NG-RAN architecture consists of single logical gNB-CU-CP connected to multiple logical gNB-CU-UPs and multiple logical gNB-DUs, and the gNB-CU-CP is connected to multiple AMFs in 5GC. The logical gNB-CU-CP function can be implemented and deployed with the Network Function Virtualization as shown in Figure A.7.1-1. And the logical gNB-CU-CP function can be split further with sub-functions, e.g. network interface sub-function, and each sub-function can be implemented as a virtual network function component (VNFC).
The server may not work properly because of hardware failure or virtualization platform failure. The hardware failure is caused owing to some deterioration or the instant failure of the hardware. Depending on the failure cause, the hardware needs to be replaced or be recovered by rebooting the system. The virtualization platform failure is caused by software error, so it could be also recovered by rebooting the system or by the software upgrade.
The logical gNB-CU-CP function can be implemented with sub-functions, e.g. network interface sub-function, and they can be separately implemented as VNFC (Virtual Network Function Component) and interact each other. One or more VNFCs consisting a gNB-CU-CP may not work properly owing to software error or other reason. Mostly the VNFC failure could be recovered by rebooting the system or by the software upgrade.
Transport link failure is caused by the malfunction of transport link hardware or software. The failure also comes from the transmission line problem or the transport network equipment failure. If the transport link failure caused in the gNB-CU-CP system, the failure could be recovered by rebooting the system or by the software upgrade. If the failure caused outside the gNB-CU-CP system, the transmission line needs to be substituted or the network equipment needs to be rebooted, upgraded or substituted.
In any failure scenarios, there exist solutions to recover the logical gNB-CU-CP operation, e.g. system reboot, software upgrade, hardware replacement. Depending on the recovery solution, it may take a few minutes or some days and it may break the service continuity of the gNB-CU-CP.
So to minimize the interruption of UP traffic and disconnection of UEs, the virtualized gNB-CU-CP function could be duplicated, and the data used is shared and synchronized between duplicated gNB-CU-CP function. And if a failure is detected, the role of duplicated gNB-CU-CP functions are exchanged to support fast switch-over by implementation.
The purpose of this SI is to Study and identify failure scenarios associated with the gNB-CU-CP, based on the current architecture for the NG-RAN, which could be referred in 38.401 [2], see below:
gNB-CU-CP as a logical node, could be deployed in different way which is up to operator and vendor's strategy, e.g. could be software plus dedicated hardware within a physical box or, could be software instance run over virtualized environment (generic hardware). Thus, the failure could happen at any single point distributed at any place in Figure A.8.1-1, i.e. either to software or to hardware which would finally lead to the unavailable of gNB-CU-CP; in addition, power is also a main factor causing failure.