Tech-invite3GPPspaceIETFspace
9796959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 7325

MPLS Forwarding Compliance and Performance Requirements

Pages: 59
Informational
Part 2 of 4 – Pages 11 to 30
First   Prev   Next

Top   ToC   RFC7325 - Page 11   prevText

2. Forwarding Issues

A brief review of forwarding issues is provided in the subsections that follow. This section provides some background on why some of these requirements exist. The questions to ask of suppliers is covered in Section 3. Some guidelines for testing are provided in Section 4.

2.1. Forwarding Basics

Basic MPLS architecture and MPLS encapsulation, and therefore packet forwarding, are defined in [RFC3031] and [RFC3032]. RFC 3031 and RFC 3032 are somewhat LDP centric. RSVP-TE supports traffic engineering (TE) and fast reroute, features that LDP lacks. The base document for MPLS RSVP-TE is [RFC3209]. A few RFCs update RFC 3032. Those with impact on forwarding include the following. 1. TTL processing is clarified in [RFC3443]. 2. The use of MPLS Explicit NULL is modified in [RFC4182]. 3. Differentiated Services is supported by [RFC3270] and [RFC4124]. The "EXP" field is renamed to "Traffic Class" in [RFC5462], removing any misconception that it was available for experimentation or could be ignored. 4. ECN is supported by [RFC5129]. 5. The MPLS G-ACh and GAL are defined in [RFC5586]. 6. [RFC5332] redefines the two data link layer codepoints for MPLS packets. Tunneling encapsulations carrying MPLS, such as MPLS in IP [RFC4023], MPLS in GRE [RFC4023], MPLS in L2TPv3 [RFC4817], or MPLS in UDP [MPLS-IN-UDP], are out of scope.
Top   ToC   RFC7325 - Page 12
   Other RFCs have implications to MPLS Forwarding and do not update RFC
   3032 or RFC 3209, including:

   1.  The pseudowire (PW) Associated Channel Header (ACH) is defined by
       [RFC5085] and was later generalized by the MPLS G-ACh [RFC5586].

   2.  The Entropy Label Indicator (ELI) and Entropy Label (EL) are
       defined by [RFC6790].

   A few RFCs update RFC 3209.  Those that are listed as updating RFC
   3209 generally impact only RSVP-TE signaling.  Forwarding is modified
   by major extensions built upon RFC 3209.

   RFCs that impact forwarding are discussed in the following
   subsections.

2.1.1. MPLS Special-Purpose Labels

[RFC3032] specifies that label values 0-15 are special-purpose labels with special meanings. [RFC7274] renamed these from the term "reserved labels" used in [RFC3032] to "special-purpose labels". Three values of NULL label are defined (two of which are later updated by [RFC4182]) and a Router Alert Label is defined. The original intent was that special-purpose labels, except the NULL labels, could be sent to the routing engine CPU rather than be processed in forwarding hardware. Hardware support is required by new RFCs such as those defining Entropy Label and OAM processed as a result of receiving a GAL. For new special-purpose labels, some accommodation is needed for LSRs that will send the labels to a general-purpose CPU or other highly programmable hardware. For example, ELI will only be sent to LSRs that have signaled support for [RFC6790], and a high OAM packet rate must be negotiated among endpoints. [RFC3429] reserves a label for ITU-T Y.1711; however, Y.1711 does not work with multipath and its use is strongly discouraged. The current list of special-purpose labels can be found on the "Multiprotocol Label Switching Architecture (MPLS) Label Values" registry reachable at IANA's pages at <http://www.iana.org>. [RFC7274] introduces an IANA "Extended Special-Purpose MPLS Label Values" registry and makes use of the "extension" label, label 15, to indicate that the next label is an extended special-purpose label and requires special handling. The range of only 16 values for special- purpose labels allows a table to be used. The range of extended special-purpose labels with 20 bits available for use may have to be handled in some other way in the unlikely event that in the future
Top   ToC   RFC7325 - Page 13
   the range of currently reserved values 256-1048575 is used.  If only
   the Standards Action range, 16-239, and the Experimental range,
   240-255, are used, then a table of 256 entries can be used.

   Unknown special-purpose labels and unknown extended special-purpose
   labels are handled the same.  When an unknown special-purpose label
   is encountered or a special purpose label not directly handled in
   forwarding hardware is encountered, the packet should be sent to a
   general-purpose CPU by default.  If this capability is supported,
   there must be an option to either drop or rate limit such packets
   based on the value of each special-purpose label.

2.1.2. MPLS Differentiated Services

[RFC2474] deprecates the IP Type of Service (TOS) and IP Precedence (Prec) fields and replaces them with the Differentiated Services Field more commonly known as the Differentiated Services Code Point (DSCP) field. [RFC2475] defines the Differentiated Services architecture, which in other forums, is often called a Quality of Service (QoS) architecture. MPLS uses the Traffic Class (TC) field to support Differentiated Services [RFC5462]. There are two primary documents describing how DSCP is mapped into TC. 1. [RFC3270] defines E-LSP and L-LSP. E-LSP uses a static mapping of DSCP into TC. L-LSP uses a per-LSP mapping of DSCP into TC, with one PHB Scheduling Class (PSC) per L-LSP. Each PSC can use multiple Per-Hop Behavior (PHB) values. For example, the Assured Forwarding service defines three PSCs, each with three PHB [RFC2597]. 2. [RFC4124] defines assignment of a class-type (CT) to an LSP, where a per-CT static mapping of TC to PHB is used. [RFC4124] provides a means to support up to eight E-LSP-like mappings of DSCP to TC. To meet Differentiated Services requirements specified in [RFC3270], the following forwarding requirements must be met. An ingress LER MUST be able to select an LSP and then apply a per-LSP map of DSCP into TC. A midpoint LSR MUST be able to apply a per-LSP map of TC to PHB. The number of mappings supported will be far less than the number of LSPs supported. To meet Differentiated Services requirements specified in [RFC4124], the following forwarding requirements must be met. An ingress LER MUST be able to select an LSP and then apply a per-LSP map of DSCP into TC. A midpoint LSR MUST be able to map LSP number to Class Type
Top   ToC   RFC7325 - Page 14
   (CT), then use a per-CT map to map TC to PHB.  Since there are only
   eight allowed values of CT, only eight maps of TC to PHB need to be
   supported.  The LSP label can be used directly to find the TC-to-PHB
   mapping, as is needed to support L-LSPs as defined by [RFC3270].

   While support for [RFC4124] and not [RFC3270] would allow support for
   only eight mappings of TC to PHB, it is common to support both and
   simply state a limit on the number of unique TC-to-PHB mappings that
   can be supported.

2.1.3. Time Synchronization

PTP or NTP may be carried over MPLS [TIMING-OVER-MPLS]. Generally, NTP will be carried within IP, and IP will be carried in MPLS [RFC5905]. Both PTP and NTP benefit from accurate timestamping of incoming packets and the ability to insert accurate timestamps in outgoing packets. PTP correction that occurs when forwarding requires updating a timestamp compensation field based on the difference between packet arrival at an LSR and packet transmit time at that same LSR. Since the label stack depth may vary, hardware should allow a timestamp to be placed in an outgoing packet at any specified byte position. It may be necessary to modify Layer 2 checksums or frame check sequences after insertion. PTP and NTP timestamp formats differ in such a way as to require different implementations of the timestamp correction. If NTP or PTP is carried over UDP/IP or UDP/IP/MPLS, the UDP checksum will also have to be updated. Accurate time synchronization, in addition to being generally useful, is required for MPLS-TP Delay Measurement (DM) OAM. See Section 2.6.4.

2.1.4. Uses of Multiple Label Stack Entries

MPLS deployments in the early part of the prior decade (circa 2000) tended to support either LDP or RSVP-TE. LDP was favored by some for its ability to scale to a very large number of PE devices at the edge of the network, without adding deployment complexity. RSVP-TE was favored, generally in the network core, where traffic engineering and/or fast reroute were considered important. Both LDP and RSVP-TE are used simultaneously within major service provider networks using a technique known as "LDP over RSVP-TE Tunneling". This technique allows service providers to carry LDP tunnels inside RSVP-TE tunnels. This makes it possible to take advantage of the traffic engineering and fast reroute on more expensive intercity and intercontinental transport paths. The
Top   ToC   RFC7325 - Page 15
   ingress RSVP-TE PE places many LDP tunnels on a single RSVP-TE LSP
   and carries it to the egress RSVP-TE PE.  The LDP PEs are situated
   further from the core, for example, within a metro network.  LDP over
   RSVP-TE tunneling requires a minimum of two MPLS labels: one each for
   LDP and RSVP-TE.

   The use of MPLS FRR [RFC4090] might add one more label to MPLS
   traffic but only when FRR protection is in use (active).  If LDP over
   RSVP-TE is in use, and FRR protection is in use, then at least three
   MPLS labels are present on the label stack on the links through which
   the Bypass LSP traverses.  FRR is covered in Section 2.1.7.

   LDP L2VPN, LDP IPVPN, BGP L2VPN, and BGP IPVPN added support for VPN
   services that are deployed by the vast majority of service providers.
   These VPN services added yet another label, bringing the label stack
   depth (when FRR is active) to four.

   Pseudowires and VPN are discussed in further detail in Sections 2.1.8
   and 2.1.9.

   MPLS hierarchy as described in [RFC4206] and updated by [RFC7074] can
   in principle add at least one additional label.  MPLS hierarchy is
   discussed in Section 2.1.6.

   Other features such as Entropy Label (discussed in Section 2.4.4) and
   Flow Label (discussed in Section 2.4.3) can add additional labels to
   the label stack.

   Although theoretical scenarios can easily result in eight or more
   labels, such cases are rare if they occur at all today.  For the
   purpose of forwarding, only the top label needs to be examined if PHP
   is used, and a few more if UHP is used (see Section 2.5).  For deep
   label stacks, quite a few labels may have to be examined for the
   purpose of load balancing across parallel links (see Section 2.4);
   however, this depth can be bounded by a provider through use of
   Entropy Label.

   Other creative uses of MPLS within the IETF, such as the use of MPLS
   label stack in source routing, may result in label stacks that are
   considerably deeper than those encountered today.

2.1.5. MPLS Link Bundling

MPLS Link Bundling was the first RFC to address the need for multiple parallel links between nodes [RFC4201]. MPLS Link Bundling is notable in that it tried not to change MPLS forwarding, except in
Top   ToC   RFC7325 - Page 16
   specifying the "all-ones" component link.  MPLS Link Bundling is
   seldom if ever deployed.  Instead, multipath techniques described in
   Section 2.4 are used.

2.1.6. MPLS Hierarchy

MPLS hierarchy is defined in [RFC4206] and updated by [RFC7074]. Although RFC 4206 is considered part of GMPLS, the Packet Switching Capable (PSC) portion of the MPLS hierarchy is applicable to MPLS and may be supported in an otherwise GMPLS-free implementation. The MPLS PSC hierarchy remains the most likely means of providing further scaling in an RSVP-TE MPLS network, particularly where the network is designed to provide RSVP-TE connectivity to the edges. This is the case for envisioned MPLS-TP networks. The use of the MPLS PSC hierarchy can add at least one additional label to a label stack, though it is likely that only one layer of PSC will be used in the near future.

2.1.7. MPLS Fast Reroute (FRR)

Fast reroute is defined by [RFC4090]. Two significantly different methods are defined in RFC 4090: the "One-to-One Backup" method, which uses the "Detour LSP", and the "Facility Backup", which uses a "bypass tunnel". These are commonly referred to as the detour and bypass methods, respectively. The detour method makes use of a presignaled LSP. Hardware assistance may be needed for detour FRR in order to accomplish local repair of a large number of LSPs within the target of tens of milliseconds. For each affected LSP, a swap operation must be reprogrammed or otherwise switched over. The use of detour FRR doubles the number of LSPs terminating at any given hop and will increase the number of LSPs within a network by a factor dependent on the average detour path length. The bypass method makes use of a tunnel that is unused when no fault exists but may carry many LSPs when a local repair is required. There is no presignaling indicating which working LSP will be diverted into any specific bypass LSP. If interface label space is used, the bypass LSP MUST extend one hop beyond the merge point, except if the merge point is the egress and PHP is used. If the bypass LSPs are not extended in this way, then the merge LSR (egress LSR of the bypass LSP) MUST use platform label space (as defined in [RFC3031]) so that an LSP working path on any given interface can be backed up using a bypass LSP terminating on any other interface. Hardware assistance may be needed to accomplish local repair of a large number of LSPs within the target of tens of milliseconds. For each affected LSP a swap operation must be reprogrammed or otherwise
Top   ToC   RFC7325 - Page 17
   switched over with an additional push of the bypass LSP label.  The
   use of platform label space impacts the size of the LSR ILM for an
   LSR with a very large number of interfaces.

   IP/LDP Fast Reroute (IP/LDP FRR) [RFC5714] is also applicable in MPLS
   networks.  ECMP and Loop-Free Alternates (LFAs) [RFC5286] are well-
   established IP/LDP FRR techniques and were the first methods to be
   widely deployed.  Work on IP/LDP FRR is ongoing within the IETF
   RTGWG.  Two topics actively discussed in RTGWG are microloops and
   partial coverage of the established techniques in some network
   topologies.  [RFC5715] covers the topic of IP/LDP Fast Reroute
   microloops and microloop prevention.  RTGWG has developed additional
   IP/LDP FRR techniques to handle coverage concerns.  RTGWG is
   extending LFA through the use of remote LFA [REMOTE-LFA].  Other
   techniques that require new forwarding paths to be established are
   also under consideration, including the IPFRR "not-via" technique
   defined in [RFC6981] and maximally redundant trees (MRT) [MRT].
   ECMP, LFA (but not remote LFA), and MRT swap the top label to an
   alternate MPLS label.  The other methods operate in a similar manner
   to the facility backup described in RFC 4090 and push an additional
   label.  IP/LDP FRR methods that push more than one label have been
   suggested but are in early discussion.

2.1.8. Pseudowire Encapsulation

The pseudowire (PW) architecture is defined in [RFC3985]. A pseudowire, when carried over MPLS, adds one or more additional label entries to the MPLS label stack. A PW Control Word is defined in [RFC4385] with motivation for defining the Control Word in [RFC4928]. The PW Associated Channel defined in [RFC4385] is used for OAM in [RFC5085]. The PW Flow Label is defined in [RFC6391] and is discussed further in this document in Section 2.4.3. There are numerous pseudowire encapsulations, supporting emulation of services such as Frame Relay, ATM, Ethernet, TDM, and SONET/SDH over packet switched networks (PSNs) using IP or MPLS. The pseudowire encapsulation is out of scope for this document. Pseudowire impact on MPLS forwarding at the midpoint LSR is within scope. The impact on ingress MPLS push and egress MPLS UHP pop are within scope. While pseudowire encapsulation is out of scope, some advice is given on Sequence Number support.
2.1.8.1. Pseudowire Sequence Number
Pseudowire (PW) Sequence Number support is most important for PW payload types with a high expectation of lossless and/or in-order delivery. Identifying lost PW packets and the exact amount of lost
Top   ToC   RFC7325 - Page 18
   payload is critical for PW services that maintain bit timing, such as
   Time Division Multiplexing (TDM) services since these services MUST
   compensate lost payload on a bit-for-bit basis.

   With PW services that maintain bit timing, packets that have been
   received out of order also MUST be identified and MAY be either
   reordered or dropped.  Resequencing requires, in addition to sequence
   numbering, a "reorder buffer" in the egress PE, and the ability to
   reorder is limited by the depth of this buffer.  The down side of
   maintaining a large reorder buffer is added end-to-end service delay.

   For PW services that maintain bit timing or any other service where
   jitter must be bounded, a jitter buffer is always necessary.  The
   jitter buffer is needed regardless of whether reordering is done.  In
   order to be effective, a reorder buffer must often be larger than a
   jitter buffer needs to be, thus creating a tradeoff between reducing
   loss and minimizing delay.

   PW services that are not timing critical bit streams in nature are
   cell oriented or frame oriented.  Though resequencing support may be
   beneficial to PW cell- and frame-oriented payloads such as ATM, FR,
   and Ethernet, this support is desirable but not required.
   Requirements to handle out-of-order packets at all vary among
   services and deployments.  For example, for Ethernet PW, occasional
   (very rare) reordering is usually acceptable.  If the Ethernet PW is
   carrying MPLS-TP, then this reordering may be acceptable.

   Reducing jitter is best done by an end-system, given that the
   tradeoff of loss vs. delay varies among services.  For example, with
   interactive real-time services, low delay is preferred, while with
   non-interactive (one-way) real-time services, low loss is preferred.
   The same end-site may be receiving both types of traffic.  Regardless
   of this, bounded jitter is sometimes a requirement for specific
   deployments.

   Packet reordering should be rare except in a small number of
   circumstances, most of which are due to network design or equipment
   design errors:

   1.  The most common case is where reordering is rare, occurring only
       when a network or equipment fault forces traffic on a new path
       with different delay.  The packet loss that accompanies a network
       or equipment fault is generally more disruptive than any
       reordering that may occur.
Top   ToC   RFC7325 - Page 19
   2.  A path change can be caused by reasons other than a network or
       equipment fault, such as an administrative routing change.  This
       may result in packet reordering but generally without any packet
       loss.

   3.  If the edge is not using pseudowire Control Word (CW) and the
       core is using multipath, reordering will be far more common.  If
       this is occurring, using CW on the edge will solve the problem.
       Without CW, resequencing is not possible since the Sequence
       Number is contained in the CW.

   4.  Another avoidable case is where some core equipment has multipath
       and for some reason insists on periodically installing a new
       random number as the multipath hash seed.  If supporting MPLS-TP,
       equipment MUST provide a means to disable periodic hash
       reseeding, and deployments MUST disable periodic hash reseeding.
       Operator experience dictates that even if not supporting MPLS-TP,
       equipment SHOULD provide a means to disable periodic hash
       reseeding, and deployments SHOULD disable periodic hash
       reseeding.

   In provider networks that use multipath techniques and that may
   occasionally rebalance traffic or that may change PW paths
   occasionally for other reasons, reordering may be far more common
   than loss.  Where reordering is more common than loss, resequencing
   packets is beneficial, rather than dropping packets at egress when
   out-of-order arrival occurs.  Resequencing is most important for PW
   payload types with a high expectation of lossless delivery since in
   such cases out-of-order delivery within the network results in PW
   loss.

2.1.9. Layer 2 and Layer 3 VPN

Layer 2 VPN [RFC4664] and Layer 3 VPN [RFC4110] add one or more label entry to the MPLS label stack. VPN encapsulations are out of scope for this document. Their impact on forwarding at the midpoint LSR are within scope. Any of these services may be used on an ingress and egress that are MPLS Entropy Label enabled (see Section 2.4.4 for discussion of Entropy Label); this would add an additional two labels to the MPLS label stack. The need to provide a useful Entropy Label value impacts the requirements of the VPN ingress LER but is out of scope for this document.
Top   ToC   RFC7325 - Page 20

2.2. MPLS Multicast

MPLS Multicast encapsulation is clarified in [RFC5332]. MPLS Multicast may be signaled using RSVP-TE [RFC4875] or LDP [RFC6388]. [RFC4875] defines a root-initiated RSVP-TE LSP setup rather than the leaf-initiated join used in IP multicast. [RFC6388] defines a leaf- initiated LDP setup. Both [RFC4875] and [RFC6388] define point-to- multipoint (P2MP) LSP setup. [RFC6388] also defined multipoint-to- multipoint (MP2MP) LSP setup. The P2MP LSPs have a single source. An LSR may be a leaf node, an intermediate node, or a "bud" node. A bud serves as both a leaf and intermediate. At a leaf, an MPLS pop is performed. The payload may be an IP multicast packet that requires further replication. At an intermediate node, an MPLS swap operation is performed. The bud requires that both a pop operation and a swap operation be performed for the same incoming packet. One strategy to support P2MP functionality is to pop at the LSR interface serving as ingress to the P2MP traffic and then optionally push labels at each LSR interface serving as egress to the P2MP traffic at that same LSR. A given LSR egress chip may support multiple egress interfaces, each of which requires a copy, but each with a different set of added labels and Layer 2 encapsulation. Some physical interfaces may have multiple sub-interfaces (such as Ethernet VLAN or channelized interfaces), each requiring a copy. If packet replication is performed at LSR ingress, then the ingress interface performance may suffer. If the packet replication is performed within a LSR switching fabric and at LSR egress, congestion of egress interfaces cannot make use of backpressure to ingress interfaces using techniques such as virtual output queuing (VOQ). If buffering is primarily supported at egress, then the need for backpressure is minimized. There may be no good solution for high volumes of multicast traffic if VOQ is used. Careful consideration should be given to the performance characteristics of high-fanout multicast for equipment that is intended to be used in such a role. MP2MP LSPs differ in that any branch may provide an input, including a leaf. Packets must be replicated onto all other branches. This forwarding is often implemented as multiple P2MP forwarding trees, one for each potential input interface at a given LSR.
Top   ToC   RFC7325 - Page 21

2.3. Packet Rates

While average packet size of Internet traffic may be large, long sequences of small packets have both been predicted in theory and observed in practice. Traffic compression and TCP ACK compression can conspire to create long sequences of packets of 40-44 bytes in payload length. If carried over Ethernet, the 64-byte minimum payload applies, yielding a packet rate of approximately 150 Mpps (million packets per second) for the duration of the burst on a nominal 100 Gb/s link. The peak rate for other encapsulations can be as high as 250 Mpps (for example, when IP or MPLS is encapsulated using GFP over OTN ODU4). It is possible that the packet rates achieved by a specific implementation are acceptable for a minimum payload size, such as a 64-byte (64B) payload for Ethernet, but the achieved rate declines to an unacceptable level for other packet sizes, such as a 65B payload. There are other packet rates of interest besides TCP ACK. For example, a TCP ACK carried over an Ethernet PW over MPLS over Ethernet may occupy 82B or 82B plus an increment of 4B if additional MPLS labels are present. A graph of packet rate vs. packet size often displays a sawtooth. The sawtooth is commonly due to a memory bottleneck and memory widths, sometimes an internal cache, but often a very wide external buffer memory interface. In some cases, it may be due to a fabric transfer width. A fine packing, rounding up to the nearest 8B or 16B will result in a fine sawtooth with small degradation for 65B, and even less for 82B packets. A coarse packing, rounding up to 64B can yield a sharper drop in performance for 65B packets, or perhaps more important, a larger drop for 82B packets. The loss of some TCP ACK packets are not the primary concern when such a burst occurs. When a burst occurs, any other packets, regardless of packet length and packet QoS are dropped once on-chip input buffers prior to the decision engine are exceeded. Buffers in front of the packet decision engine are often very small or nonexistent (less than one packet of buffer) causing significant QoS- agnostic packet drop. Internet service providers and content providers at one time specified full rate forwarding with 40-byte payload packets as a requirement. Today, this requirement often can be waived if the provider can be convinced that when long sequences of short packets occur no packets will be dropped.
Top   ToC   RFC7325 - Page 22
   Many equipment suppliers have pointed out that the extra cost in
   designing hardware capable of processing the minimum size packets at
   full line rate is significant for very-high-speed interfaces.  If
   hardware is not capable of processing the minimum size packets at
   full line rate, then that hardware MUST be capable of handling large
   bursts of small packets, a condition that is often observed.  This
   level of performance is necessary to meet Differentiated Services
   [RFC2475] requirements; without it, packets are lost prior to
   inspection of the IP DSCP field [RFC2474] or MPLS TC field [RFC5462].

   With adequate on-chip buffers before the packet decision engine, an
   LSR can absorb a long sequence of short packets.  Even if the output
   is slowed to the point where light congestion occurs, the packets,
   having cleared the decision process, can make use of larger VOQ or
   output side buffers and be dealt with according to configured QoS
   treatment, rather than dropped completely at random.

   The buffering before the packet decision engine should be arranged
   such that 1) it can hold a relatively large number of small packets,
   2) it can hold a small number of large packets, and 3) it can hold a
   mix of packets of different sizes.

   These on-chip buffers need not contribute significant delay since
   they are only used when the packet decision engine is unable to keep
   up, not in response to congestion, plus these buffers are quite
   small.  For example, an on-chip buffer capable of handling 4K packets
   of 64 bytes in length, or 256KB, corresponds to 200 microseconds on a
   10 Gb/s link and 20 microseconds on a 100 Gb/s link.  If the packet
   decision engine is capable of handling packets at 90% of the full
   rate for small packets, then the maximum added delay is 20
   microseconds and 2 microseconds, respectively, and this delay only
   applies if a 4K burst of short packets occurs.  When no burst of
   short packets was being processed, no delay is added.  These buffers
   are only needed on high-speed interfaces where it is difficult to
   process small packets at full line rate.

   Packet rate requirements apply regardless of which network tier the
   equipment is deployed in.  Whether deployed in the network core or
   near the network edges, one of the two conditions MUST be met if
   Differentiated Services requirements are to be met:

   1.  Packets must be processed at full line rate with minimum-sized
       packets.  -OR-

   2.  Packets must be processed at a rate well under generally accepted
       average packet sizes, with sufficient buffering prior to the
       packet decision engine to accommodate long bursts of small
       packets.
Top   ToC   RFC7325 - Page 23

2.4. MPLS Multipath Techniques

In any large provider, service providers, and content providers, hash-based multipath techniques are used in the core and in the edge. In many of these providers, hash-based multipath is also used in the larger metro networks. For good reason, the Differentiated Services requirements dictate that packets within a common microflow SHOULD NOT be reordered [RFC2474]. Service providers generally impose stronger requirements, commonly requiring that packets within a microflow MUST NOT be reordered except in rare circumstances such as load balancing across multiple links, path change for load balancing, or path change for other reason. The most common multipath techniques are ECMP applied at the IP forwarding level, Ethernet Link Aggregation Group (LAG) with inspection of the IP payload, and multipath on links carrying both IP and MPLS, where the IP header is inspected below the MPLS label stack. In most core networks, the vast majority of traffic is MPLS encapsulated. In order to support an adequately balanced load distribution across multiple links, IP header information must be used. Common practice today is to reinspect the IP headers at each LSR and use the label stack and IP header information in a hash performed at each LSR. Further details are provided in Section 2.4.5. The use of this technique is so ubiquitous in provider networks that lack of support for multipath makes any product unsuitable for use in large core networks. This will continue to be the case in the near future, even as deployment of the MPLS Entropy Label begins to relax the core LSR multipath performance requirements given the existing deployed base of edge equipment without the ability to add an Entropy Label. A generation of edge equipment supporting the ability to add an MPLS Entropy Label is needed before the performance requirements for core LSRs can be relaxed. However, it is likely that two generations of deployment in the future will allow core LSRs to support full packet rate only when a relatively small number of MPLS labels need to be inspected before hashing. For now, don't count on it. Common practice today is to reinspect the packet at each LSR and use information from the packet combined with a hash seed that is selected by each LSR. Where Flow Labels or Entropy Labels are used, a hash seed must be used when creating these labels.
Top   ToC   RFC7325 - Page 24

2.4.1. Pseudowire Control Word

Within the core of a network, some form of multipath is almost certain to be used. Multipath techniques deployed today are likely to be looking beneath the label stack for an opportunity to hash on IP addresses. A pseudowire encapsulated at a network edge must have a means to prevent reordering within the core if the pseudowire will be crossing a network core, or any part of a network topology where multipath is used (see [RFC4385] and [RFC4928]). Not supporting the ability to encapsulate a pseudowire with a Control Word may lock a product out from consideration. A pseudowire capability without Control Word support might be sufficient for applications that are strictly both intra-metro and low bandwidth. However, a provider with other applications will very likely not tolerate having equipment that can only support a subset of their pseudowire needs.

2.4.2. Large Microflows

Where multipath makes use of a simple hash and simple load balance such as modulo or other fixed allocation (see Section 2.4), there can be the presence of large microflows that each consume 10% of the capacity of a component link of a potentially congested composite link. One such microflow can upset the traffic balance, and more than one can reduce the effective capacity of the entire composite link by more than 10%. When even a very small number of large microflows are present, there is a significant probability that more than one of these large microflows could fall on the same component link. If the traffic contribution from large microflows is small, the probability for three or more large microflows on the same component link drops significantly. Therefore, in a network where a significant number of parallel 10 Gb/s links exists, even a 1 Gb/s pseudowire or other large microflow that could not otherwise be subdivided into smaller flows should carry a Flow Label or Entropy Label if possible. Active management of the hash space to better accommodate large microflows has been implemented and deployed in the past; however, such techniques are out of scope for this document.
Top   ToC   RFC7325 - Page 25

2.4.3. Pseudowire Flow Label

Unlike a pseudowire Control Word, a pseudowire Flow Label [RFC6391] is required only for pseudowires that have a relatively large capacity. There are many cases where a pseudowire Flow Label makes sense. Any service such as a VPN that carries IP traffic within a pseudowire can make use of a pseudowire Flow Label. Any pseudowire carried over MPLS that makes use of the pseudowire Control Word and does not carry a Flow Label is in effect a single microflow (in the terms defined in [RFC2475]) and may result in the types of problems described in Section 2.4.2.

2.4.4. MPLS Entropy Label

The MPLS Entropy Label simplifies flow group identification [RFC6790] at midpoint LSRs. Prior to the MPLS Entropy Label, midpoint LSRs needed to inspect the entire label stack and often the IP headers to provide an adequate distribution of traffic when using multipath techniques (see Section 2.4.5). With the use of the MPLS Entropy Label, a hash can be performed closer to network edges, placed in the label stack, and used by midpoint LSRs without fully reinspecting the label stack and inspecting the payload. The MPLS Entropy Label is capable of avoiding full label stack and payload inspection within the core where performance levels are most difficult to achieve (see Section 2.3). The label stack inspection can be terminated as soon as the first Entropy Label is encountered, which is generally after a small number of labels are inspected. In order to provide these benefits in the core, an LSR closer to the edge must be capable of adding an Entropy Label. This support may not be required in the access tier, the tier closest to the customer, but is likely to be required in the edge or the border to the network core. An LSR peering with external networks will also need to be able to add an Entropy Label on incoming traffic.

2.4.5. Fields Used for Multipath Load Balance

The most common multipath techniques are based on a hash over a set of fields. Regardless of whether a hash is used or some other method is used, there is a limited set of fields that can safely be used for multipath.
Top   ToC   RFC7325 - Page 26
2.4.5.1. MPLS Fields in Multipath
If the "outer" or "first" layer of encapsulation is MPLS, then label stack entries are used in the hash. Within a finite amount of time (and for small packets arriving at high speed, that time can be quite limited), only a finite number of label entries can be inspected. Pipelined or parallel architectures improve this, but the limit is still finite. The following guidelines are provided for use of MPLS fields in multipath load balancing. 1. Only the 20-bit label field SHOULD be used. The TTL field SHOULD NOT be used. The S bit MUST NOT be used. The TC field (formerly EXP) MUST NOT be used. See text following this list for reasons. 2. If an ELI label is found, then if the LSR supports Entropy Labels, the EL label field in the next label entry (the EL) SHOULD be used, label entries below that label SHOULD NOT be used, and the MPLS payload SHOULD NOT be used. See below this list for reasons. 3. Special-purpose labels (label values 0-15) MUST NOT be used. Extended special-purpose labels (any label following label 15) MUST NOT be used. In particular, GAL and RA MUST NOT be used so that OAM traffic follows the same path as payload packets with the same label stack. 4. If a new special-purpose label or extended special-purpose label is defined that requires special load-balance processing, then, as is the case for the ELI label, a special action may be needed rather than skipping the special-purpose label or extended special-purpose label. 5. The most entropy is generally found in the label stack entries near the bottom of the label stack (innermost label, closest to S=1 bit). If the entire label stack cannot be used (or entire stack up to an EL), then it is better to use as many labels as possible closest to the bottom of stack. 6. If no ELI is encountered, and the first nibble of payload contains a 4 (IPv4) or 6 (IPv6), an implementation SHOULD support the ability to interpret the payload as IPv4 or IPv6 and extract and use appropriate fields from the IP headers. This feature is considered a nonnegotiable requirement by many service providers. If supported, there MUST be a way to disable it (if, for example, PW without CW are used). This ability to disable this feature is considered a nonnegotiable requirement by many service providers.
Top   ToC   RFC7325 - Page 27
       Therefore, an implementation has a very strong incentive to
       support both options.

   7.  A label that is popped at egress (UHP pop) SHOULD NOT be used.  A
       label that is popped at the penultimate hop (PHP pop) SHOULD be
       used.

   Apparently, some chips have made use of the TC (formerly EXP) bits as
   a source of entropy.  This is very harmful since it will reorder
   Assured Forwarding (AF) traffic [RFC2597] when a subset does not
   conform to the configured rates and is remarked but not dropped at a
   prior LSR.  Traffic that uses MPLS ECN [RFC5129] can also be
   reordered if TC is used for entropy.  Therefore, as stated in the
   guidelines above, the TC field (formerly EXP) MUST NOT be used in
   multipath load balancing as it violates Differentiated Services
   Ordered Aggregate (OA) requirements in these two instances.

   Use of the MPLS label entry S bit would result in putting OAM traffic
   on a different path if the addition of a GAL at the bottom of stack
   removed the S bit from the prior label.

   If an ELI label is found, then if the LSR supports Entropy Labels,
   the EL label field in the next label entry (the EL) SHOULD be used,
   and the search for additional entropy within the packet SHOULD be
   terminated.  Failure to terminate the search will impact client MPLS-
   TP LSPs carried within server MPLS LSPs.  A network operator has the
   option to use administrative attributes as a means to identify LSRs
   that do not terminate the entropy search at the first EL.
   Administrative attributes are defined in [RFC3209].  Some
   configuration is required to support this.

   If the label removed by a PHP pop is not used, then for any PW for
   which CW is used, there is no basis for multipath load split.  In
   some networks, it is infeasible to put all PW traffic on one
   component link.  Any PW that does not use CW will be improperly
   split, regardless of whether the label removed by a PHP pop is used.
   Therefore, the PHP pop label SHOULD be used as recommended above.

2.4.5.2. IP Fields in Multipath
Inspecting the IP payload provides the most entropy in provider networks. The practice of looking past the bottom of stack label for an IP payload is well accepted and documented in [RFC4928] and in other RFCs. Where IP is mentioned in the document, both IPv4 and IPv6 apply. All LSRs MUST fully support IPv6.
Top   ToC   RFC7325 - Page 28
   When information in the IP header is used, the following guidelines
   apply:

   1.  Both the IP source address and IP destination address SHOULD be
       used.  There MAY be an option to reverse the order of these
       addresses, improving the ability to provide symmetric paths in
       some cases.  Many service providers require that both addresses
       be used.

   2.  Implementations SHOULD allow inspection of the IP protocol field
       and use of the UDP or TCP port numbers.  For many service
       providers, this feature is considered mandatory, particularly for
       enterprise, data center, or edge equipment.  If this feature is
       provided, it SHOULD be possible to disable use of TCP and UDP
       ports.  Many service providers consider it a nonnegotiable
       requirement that use of UDP and TCP ports can be disabled.
       Therefore, there is a strong incentive for implementations to
       provide both options.

   3.  Equipment suppliers MUST NOT make assumptions that because the IP
       version field is equal to 4 (an IPv4 packet) that the IP protocol
       will either be TCP (IP protocol 6) or UDP (IP protocol 17) and
       blindly fetch the data at the offset where the TCP or UDP ports
       would be found.  With IPv6, TCP and UDP port numbers are not at
       fixed offsets.  With IPv4 packets carrying IP options, TCP and
       UDP port numbers are not at fixed offsets.

   4.  The IPv6 header flow field SHOULD be used.  This is the explicit
       purpose of the IPv6 flow field; however, observed flow fields
       rarely contain a non-zero value.  Some uses of the flow field
       have been defined, such as [RFC6438].  In the absence of MPLS
       encapsulation, the IPv6 flow field can serve a role equivalent to
       the Entropy Label.

   5.  Support for other protocols that share a common Layer 4 header
       such as RTP [RFC3550], UDP-Lite [RFC3828], SCTP [RFC4960], and
       DCCP [RFC4340] SHOULD be provided, particularly for edge or
       access equipment where additional entropy may be needed.
       Equipment SHOULD also use RTP, UDP-lite, SCTP, and DCCP headers
       when creating an Entropy Label.

   6.  The following IP header fields should not or must not be used:

       A.  Similar to avoiding TC in MPLS, the IP DSCP, and ECN bits
           MUST NOT be used.

       B.  The IPv4 TTL or IPv6 Hop Count SHOULD NOT be used.
Top   ToC   RFC7325 - Page 29
       C.  Note that the IP TOS field was deprecated.  ([RFC0791] was
           updated by [RFC2474].)  No part of the IP DSCP field can be
           used (formerly IP PREC and IP TOS bits).

   7.  Some IP encapsulations support tunneling, such as IP-in-IP, GRE,
       L2TPv3, and IPsec.  These provide a greater source of entropy
       that some provider networks carrying large amounts of tunneled
       traffic may need, for example, as used in [RFC5640] for GRE and
       L2TPv3.  The use of tunneling header information is out of scope
       for this document.

   This document makes the following recommendations.  These
   recommendations are not required to claim compliance to any existing
   RFC; therefore, implementers are free to ignore them, but due to
   service provider requirements should consider the risk of doing so.
   The use of IP addresses MUST be supported, and TCP and UDP ports
   (conditional on the protocol field and properly located) MUST be
   supported.  The ability to disable use of UDP and TCP ports MUST be
   available.

   Though potentially very useful in some networks, it is uncommon to
   support using payloads of tunneling protocols carried over IP.
   Though the use of tunneling protocol header information is out of
   scope for this document, it is not discouraged.

2.4.5.3. Fields Used in Flow Label
The ingress to a pseudowire (PW) can extract information from the payload being encapsulated to create a Flow Label. [RFC6391] references IP carried in Ethernet as an example. The Native Service Processing (NSP) function defined in [RFC3985] differs with pseudowire type. It is in the NSP function where information for a specific type of PW can be extracted for use in a Flow Label. Determining which fields to use for any given PW NSP is out of scope for this document.
2.4.5.4. Fields Used in Entropy Label
An Entropy Label is added at the ingress to an LSP. The payload being encapsulated is most often MPLS, a PW, or IP. The payload type is identified by the Layer 2 encapsulation (Ethernet, GFP, POS, etc.). If the payload is MPLS, then the information used to create an Entropy Label is the same information used for local load balancing (see Section 2.4.5.1). This information MUST be extracted for use in generating an Entropy Label even if the LSR local egress interface is not a multipath.
Top   ToC   RFC7325 - Page 30
   Of the non-MPLS payload types, only payloads that are forwarded are
   of interest.  For example, payloads using the Address Resolution
   Protocol (ARP) are not forwarded, and payloads using the
   Connectionless-mode Network Protocol (CLNP), which is used only for
   IS-IS, are not forwarded.

   The non-MPLS payload types of greatest interest are IPv4 and IPv6.
   The guidelines in Section 2.4.5.2 apply to fields used to create an
   Entropy Label.

   The IP tunneling protocols mentioned in Section 2.4.5.2 may be more
   applicable to generation of an Entropy Label at the edge or access
   where deep packet inspection is practical due to lower interface
   speeds than in the core where deep packet inspection may be
   impractical.



(page 30 continued on part 3)

Next Section