Internet Engineering Task Force (IETF) N. Sprecher, Ed. Request for Comments: 6372 Nokia Siemens Networks Category: Informational A. Farrel, Ed. ISSN: 2070-1721 Juniper Networks September 2011 MPLS Transport Profile (MPLS-TP) Survivability FrameworkAbstract
Network survivability is the ability of a network to recover traffic delivery following failure or degradation of network resources. Survivability is critical for the delivery of guaranteed network services, such as those subject to strict Service Level Agreements (SLAs) that place maximum bounds on the length of time that services may be degraded or unavailable. The Transport Profile of Multiprotocol Label Switching (MPLS-TP) is a packet-based transport technology based on the MPLS data plane that reuses many aspects of the MPLS management and control planes. This document comprises a framework for the provision of survivability in an MPLS-TP network; it describes recovery elements, types, methods, and topological considerations. To enable data-plane recovery, survivability may be supported by the control plane, management plane, and by Operations, Administration, and Maintenance (OAM) functions. This document describes mechanisms for recovering MPLS-TP Label Switched Paths (LSPs). A detailed description of pseudowire recovery in MPLS-TP networks is beyond the scope of this document. This document is a product of a joint Internet Engineering Task Force (IETF) / International Telecommunication Union Telecommunication Standardization Sector (ITU-T) effort to include an MPLS Transport Profile within the IETF MPLS and Pseudowire Emulation Edge-to-Edge (PWE3) architectures to support the capabilities and functionalities of a packet-based transport network as defined by the ITU-T. Status of This Memo This document is not an Internet Standards Track specification; it is published for informational purposes. This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Not all documents
approved by the IESG are a candidate for any level of Internet Standard; see Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc6372. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.Table of Contents
1. Introduction ....................................................4 1.1. Recovery Schemes ...........................................4 1.2. Recovery Action Initiation .................................5 1.3. Recovery Context ...........................................6 1.4. Scope of This Framework ....................................7 2. Terminology and References ......................................8 3. Requirements for Survivability .................................10 4. Functional Architecture ........................................10 4.1. Elements of Control .......................................10 4.1.1. Operator Control ...................................11 4.1.2. Defect-Triggered Actions ...........................12 4.1.3. OAM Signaling ......................................12 4.1.4. Control-Plane Signaling ............................12 4.2. Recovery Scope ............................................13 4.2.1. Span Recovery ......................................13 4.2.2. Segment Recovery ...................................13 4.2.3. End-to-End Recovery ................................14 4.3. Grades of Recovery ........................................15 4.3.1. Dedicated Protection ...............................15 4.3.2. Shared Protection ..................................16 4.3.3. Extra Traffic ......................................17 4.3.4. Restoration ........................................19 4.3.5. Reversion ..........................................20 4.4. Mechanisms for Protection .................................20
4.4.1. Link-Level Protection ..............................20 4.4.2. Alternate Paths and Segments .......................21 4.4.3. Protection Tunnels .................................22 4.5. Recovery Domains ..........................................23 4.6. Protection in Different Topologies ........................24 4.7. Mesh Networks .............................................25 4.7.1. 1:n Linear Protection ..............................26 4.7.2. 1+1 Linear Protection ..............................28 4.7.3. P2MP Linear Protection .............................29 4.7.4. Triggers for the Linear Protection Switching Action ...................................30 4.7.5. Applicability of Linear Protection for LSP Segments ...........................................31 4.7.6. Shared Mesh Protection .............................32 4.8. Ring Networks .............................................33 4.9. Recovery in Layered Networks ..............................34 4.9.1. Inherited Link-Level Protection ....................35 4.9.2. Shared Risk Groups .................................35 4.9.3. Fault Correlation ..................................36 5. Applicability and Scope of Survivability in MPLS-TP ............37 6. Mechanisms for Providing Survivability for MPLS-TP LSPs ........39 6.1. Management Plane ..........................................39 6.1.1. Configuration of Protection Operation ..............40 6.1.2. External Manual Commands ...........................41 6.2. Fault Detection ...........................................41 6.3. Fault Localization ........................................42 6.4. OAM Signaling .............................................43 6.4.1. Fault Detection ....................................44 6.4.2. Testing for Faults .................................44 6.4.3. Fault Localization .................................45 6.4.4. Fault Reporting ....................................45 6.4.5. Coordination of Recovery Actions ...................46 6.5. Control Plane .............................................46 6.5.1. Fault Detection ....................................47 6.5.2. Testing for Faults .................................47 6.5.3. Fault Localization .................................48 6.5.4. Fault Status Reporting .............................48 6.5.5. Coordination of Recovery Actions ...................49 6.5.6. Establishment of Protection and Restoration LSPs ...49 7. Pseudowire Recovery Considerations .............................50 7.1. Utilization of Underlying MPLS-TP Recovery ................50 7.2. Recovery in the Pseudowire Layer ..........................51 8. Manageability Considerations ...................................51 9. Security Considerations ........................................52 10. Acknowledgments ...............................................52 11. References ....................................................53 11.1. Normative References .....................................53 11.2. Informative References ...................................54
1. Introduction
Network survivability is the network's ability to recover traffic delivery following the failure or degradation of traffic delivery caused by a network fault or a denial-of-service attack on the network. Survivability plays a critical role in the delivery of reliable services in transport networks. Guaranteed services in the form of Service Level Agreements (SLAs) require a resilient network that very rapidly detects facility or node degradation or failures, and immediately starts to recover network operations in accordance with the terms of the SLA. The MPLS Transport Profile (MPLS-TP) is described in [RFC5921]. MPLS-TP is designed to be consistent with existing transport network operations and management models, while providing survivability mechanisms, such as protection and restoration. The functionality provided is intended to be similar to or better than that found in established transport networks that set a high benchmark for reliability. That is, it is intended to provide the operator with functions with which they are familiar through their experience with other transport networks, although this does not preclude additional techniques. This document provides a framework for MPLS-TP-based survivability that meets the recovery requirements specified in [RFC5654]. It uses the recovery terminology defined in [RFC4427], which draws heavily on [G.808.1], and it refers to the requirements specified in [RFC5654]. This document is a product of a joint Internet Engineering Task Force (IETF) / International Telecommunication Union Telecommunication Standardization Sector (ITU-T) effort to include an MPLS Transport Profile within the IETF MPLS and PWE3 architectures to support the capabilities and functionalities of a packet-based transport network, as defined by the ITU-T.1.1. Recovery Schemes
Various recovery schemes (for protection and restoration) and processes have been defined and analyzed in [RFC4427] and [RFC4428]. These schemes can also be applied in MPLS-TP networks to re-establish end-to-end traffic delivery according to the agreed service parameters, and to trigger recovery from "failed" or "degraded" transport entities. In the context of this document, transport entities are nodes, links, transport path segments, concatenated transport path segments, and entire transport paths. Recovery actions are initiated by the detection of a defect, or by an external request (e.g., an operator's request for manual control of protection switching).
[RFC4427] makes a distinction between protection switching and restoration mechanisms. - Protection switching uses pre-assigned capacity between nodes, where the simplest scheme has a single, dedicated protection entity for each working entity, while the most complex scheme has m protection entities shared between n working entities (m:n). - Restoration uses any capacity available between nodes and usually involves rerouting. The resources used for restoration may be pre- planned (i.e., predetermined, but not yet allocated to the recovery path), and recovery priority may be used as a differentiation mechanism to determine which services are recovered and which are not recovered. Both protection switching and restoration may be either unidirectional or bidirectional; unidirectional implies that protection switching is performed independently for each direction of a bidirectional transport path, while bidirectional means that both directions are switched simultaneously using appropriate coordination, even if the fault applies to only one direction of the path. Both protection and restoration mechanisms may be either revertive or non-revertive as described in Section 4.11 of [RFC4427]. Preemption priority may be used to determine which services are sacrificed to enable the recovery of other services. Restoration may also be either unidirectional or bidirectional. In general, protection actions are completed within time frames amounting to tens of milliseconds, while automated restoration actions are normally completed within periods ranging from hundreds of milliseconds to a maximum of a few seconds. Restoration is not guaranteed (for example, because network resources may not be available at the time of the defect).1.2. Recovery Action Initiation
The recovery schemes described in [RFC4427] and evaluated in [RFC4428] are presented in the context of control-plane-driven actions (such as the configuration of the protection entities and functions, etc.). The presence of a distributed control plane in an MPLS-TP network is optional. However, the absence of such a control plane does not affect the operation of the network and the use of MPLS-TP forwarding, Operations, Administration, and Maintenance (OAM), and survivability capabilities. In particular, the concepts
discussed in [RFC4427] and [RFC4428] refer to recovery actions effected in the data plane; they are equally applicable in MPLS-TP, with or without the use of a control plane. Thus, some of the MPLS-TP recovery mechanisms do not depend on a control plane and use MPLS-TP OAM mechanisms or management actions to trigger recovery actions. The principles of MPLS-TP protection-switching actions are similar to those described in [RFC4427], since the protection mechanism is based on the capability to detect certain defects in the transport entities within the recovery domain. The protection-switching controller does not care which initiation method is used, provided that it can be given information about the status of the transport entities within the recovery domain (e.g., OK, signal failure, signal degradation, etc.). In the context of MPLS-TP, it is imperative to ensure that performing switchovers is possible, regardless of the way in which the network is configured and managed (for example, regardless of whether a control-plane, management-plane, or OAM initiation mechanism is used). All MPLS and GMPLS protection mechanisms [RFC4428] are applicable in an MPLS-TP environment. It is also possible to provision and manage the related protection entities and functions defined in MPLS and GMPLS using the management plane [RFC5654]. Regardless of whether an OAM, management, or control plane initiation mechanism is used, the protection-switching operation is a data-plane operation. In some recovery schemes (such as bidirectional protection switching), it is necessary to coordinate the protection state between the edges of the recovery domain to achieve initiation of recovery actions for both directions. An MPLS-TP protocol may be used as an in-band (i.e., data-plane based) control protocol in order to coordinate the protection state between the edges of the protection domain. When the MPLS-TP control plane is in use, a control-plane-based mechanism can also be used to coordinate the protection states between the edges of the protection domain.1.3. Recovery Context
An MPLS-TP Label Switched Path (LSP) may be subject to any part of or all of MPLS-TP link recovery, path-segment recovery, or end-to-end recovery, where:
o MPLS-TP link recovery refers to the recovery of an individual link (and hence all or a subset of the LSPs routed over the link) between two MPLS-TP nodes. For example, link recovery may be provided by server-layer recovery. o Segment recovery refers to the recovery of an LSP segment (i.e., segment and concatenated segment in the language of [RFC5654]) between two nodes and is used to recover from the failure of one or more links or nodes. o End-to-end recovery refers to the recovery of an entire LSP, from its ingress to its egress node. For additional resiliency, more than one of these recovery techniques may be configured concurrently for a single path. Co-routed bidirectional MPLS-TP LSPs are defined in a way that allows both directions of the LSP to follow the same route through the network. In this scenario, the operator often requires the directions to fate-share (that is, if one direction fails, both directions should cease to operate). Associated bidirectional MPLS-TP LSPs exist where the two directions of a bidirectional LSP follow different paths through the network. An operator may also request fate-sharing for associated bidirectional LSPs. The requirement for fate-sharing causes a direct interaction between the recovery processes affecting the two directions of an LSP, so that both directions of the bidirectional LSP are recovered at the same time. This mode of recovery is termed bidirectional recovery and may be seen as a consequence of fate-sharing. The recovery scheme operating at the data-plane level can function in a multi-domain environment (in the wider sense of a "domain" [RFC4726]). It can also protect against a failure of a boundary node in the case of inter-domain operation. MPLS-TP recovery schemes are intended to protect client services when they are sent across the MPLS-TP network.1.4. Scope of This Framework
This framework introduces the architecture of the MPLS-TP recovery domain and describes the recovery schemes in MPLS-TP (based on the recovery types defined in [RFC4427]) as well as the principles of operation, recovery states, recovery triggers, and information exchanges between the different elements that support the reference model.
The framework also describes the qualitative grades of the survivability functions that can be provided, such as dedicated recovery, shared protection, restoration, etc. In the event of a network failure, the grade of recovery directly affects the service grade provided to the end-user. The general description of the functional architecture is applicable to both LSPs and pseudowires (PWs); however, PW recovery is only introduced in Section 7, and the relevant details are beyond the scope of this document and are for further study. This framework applies to general recovery schemes as well as to mechanisms that are optimized for specific topologies and are tailored to efficiently handle protection switching. This document addresses the need for the coordination of protection switching across multiple layers and at sub-layers (for clarity, we use the term "layer" to refer equally to layers and sub-layers). This allows an operator to prevent race conditions and allows the protection-switching mechanism of one layer to recover from a failure before switching is invoked at another layer. This framework also specifies the functions that must be supported by MPLS-TP to provide the recovery mechanisms. MPLS-TP introduces a tool kit to enable recovery in MPLS-TP-based networks and to ensure that affected services are recovered in the event of a failure. Generally, network operators aim to provide the fastest, most stable, and best protection mechanism at a reasonable cost in accordance with customer requirements. The greater the grade of protection required, the greater the number of resources will be consumed. It is therefore expected that network operators will offer a wide spectrum of service grade. MPLS-TP-based recovery offers the flexibility to select a recovery mechanism, define the granularity at which traffic delivery is to be protected, and choose the specific traffic types that are to be protected. With MPLS-TP-based recovery, it should be possible to provide different grades of protection for different traffic classes within the same path based on the service requirements.2. Terminology and References
The terminology used in this document is consistent with that defined in [RFC4427]. The latter is consistent with [G.808.1]. However, certain protection concepts (such as ring protection) are not discussed in [RFC4427]; for those concepts, the terminology used in this document is drawn from [G.841].
Readers should refer to those documents for normative definitions. This document supplies brief summaries of a number of terms for reasons of clarity and to assist the reader, but it does not redefine terms. Note, in particular, the distinction and definitions made in [RFC4427] for the following three terms: o Protection: re-establishing end-to-end traffic delivery using pre- allocated resources. o Restoration: re-establishing end-to-end traffic delivery using resources allocated at the time of need; sometimes referred to as "repair" of a service, LSP, or the traffic. o Recovery: a generic term covering both Protection and Restoration. Note that the term "survivability" is used in [RFC5654] to cover the functional elements of "protection" and "restoration", which are collectively known as "recovery". Important background information on survivability can be found in [RFC3386], [RFC3469], [RFC4426], [RFC4427], and [RFC4428]. In this document, the following additional terminology is applied: o "Fault Management", as defined in [RFC5950]. o The terms "defect" and "failure" are used interchangeably to indicate any defect or failure in the sense that they are defined in [G.806]. The terms also include any signal degradation event as defined in [G.806]. o A "fault" is a fault or fault cause as defined in [G.806]. o "Trigger" indicates any event that may initiate a recovery action. See Section 4.1 for a more detailed discussion of triggers. o The acronym "OAM" is defined as Operations, Administration, and Maintenance, consistent with [RFC6291]. o A "Transport Entity" is a node, link, transport path segment, concatenated transport path segment, or entire transport path. o A "Working Entity" is a transport entity that carries traffic during normal network operation.
o A "Protection Entity" is a transport entity that is pre-allocated and used to protect and transport traffic when the working entity fails. o A "Recovery Entity" is a transport entity that is used to recover and transport traffic when the working entity fails. o "Survivability Actions" are the steps that may be taken by network nodes to communicate faults and to switch traffic from faulted or degraded paths to other paths. This may include sending messages and establishing new paths. General terminology for MPLS-TP is found in [RFC5921] and [ROSETTA]. Background information on MPLS-TP requirements can be found in [RFC5654].3. Requirements for Survivability
MPLS-TP requirements are presented in [RFC5654] and serve as normative references for the definition of all MPLS-TP functionality, including survivability. Survivability is presented in [RFC5654] as playing a critical role in the delivery of reliable services, and the requirements for survivability are set out using the recovery terminology defined in [RFC4427].