Tech-invite3GPPspaceIETFspace
9796959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 8584

Framework for Ethernet VPN Designated Forwarder Election Extensibility

Pages: 32
Proposed Standard
Errata
Updates:  7432
Part 2 of 2 – Pages 19 to 32
First   Prev   None

Top   ToC   RFC8584 - Page 19   prevText

3. The Highest Random Weight DF Election Algorithm

The procedure discussed in this section is applicable to the DF election in EVPN services [RFC7432] and the EVPN Virtual Private Wire Service (VPWS) [RFC8214]. HRW as defined in [HRW1999] is originally proposed in the context of Internet caching and proxy server load balancing. Given an object name and a set of servers, HRW maps a request to a server using the object-name (object-id) and server-name (server-id) rather than the server states. HRW forms a hash out of the server-id and the object-id and forms an ordered list of the servers for the particular object-id. The server for which the hash value is highest serves as the primary server responsible for that particular object, and the server with the next-highest value in that hash serves as the backup server. HRW always maps a given object name to the same server within a given cluster; consequently, it can be used at client sites to achieve global consensus on object-to-server mappings. When that server goes down, the backup server becomes the responsible designate. Choosing an appropriate hash function that is statistically oblivious to the key distribution and imparts a good uniform distribution of the hash output is an important aspect of the algorithm. Fortunately, many such hash functions exist. [HRW1999] provides
Top   ToC   RFC8584 - Page 20
   pseudorandom functions based on the Unix utilities rand and srand and
   easily constructed XOR functions that satisfy the desired hashing
   properties.  HRW already finds use in multicast and ECMP [RFC2991]
   [RFC2992].

3.1. HRW and Consistent Hashing

HRW is not the only algorithm that addresses the object-to-server mapping problem with goals of fair load distribution, redundancy, and fast access. There is another family of algorithms that also addresses this problem; these fall under the umbrella of the Consistent Hashing Algorithms [CHASH]. These will not be considered here.

3.2. HRW Algorithm for EVPN DF Election

This section describes the application of HRW to DF election. Let DF(V) denote the DF and BDF(V) denote the BDF for the Ethernet Tag V; Si is the IP address of PE i; Es is the ESI; and Weight is a function of V, Si, and Es. Note that while the DF election algorithm provided in [RFC7432] uses a PE address and VLAN as inputs, this document uses an Ethernet Tag, PE address, and ESI as inputs. This is because if the same set of PEs are multihomed to the same set of ESes, then the DF election algorithm used in [RFC7432] would result in the same PE being elected DF for the same set of BDs on each ES; this could have adverse side effects on both load balancing and redundancy. Including an ESI in the DF election algorithm introduces additional entropy, which significantly reduces the probability of the same PE being elected DF for the same set of BDs on each ES. Therefore, when using the HRW algorithm for EVPN DF election, the ESI value in the Weight function below SHOULD be set to that of the corresponding ES. In the case of a VLAN Bundle service, V denotes the lowest VLAN, similar to the "lowest VLAN in bundle" logic of [RFC7432]. 1. DF(V) = Si| Weight(V, Es, Si) >= Weight(V, Es, Sj), for all j. In the case of a tie, choose the PE whose IP address is numerically the least. Note that 0 <= i,j < number of PEs in the redundancy group. 2. BDF(V) = Sk| Weight(V, Es, Si) >= Weight(V, Es, Sk), and Weight(V, Es, Sk) >= Weight(V, Es, Sj). In the case of a tie, choose the PE whose IP address is numerically the least.
Top   ToC   RFC8584 - Page 21
   Where:

   o  DF(V) is defined to be the address Si (index i) for which
      Weight(V, Es, Si) is the highest; 0 <= i < N-1.

   o  BDF(V) is defined as that PE with address Sk for which the
      computed Weight is the next highest after the Weight of the DF.
      j is the running index from 0 to N-1; i and k are selected values.

   Since the Weight is a pseudorandom function with the domain as the
   three-tuple (V, Es, S), it is an efficient and deterministic
   algorithm that is independent of the Ethernet Tag V sample space
   distribution.  Choosing a good hash function for the pseudorandom
   function is an important consideration for this algorithm to perform
   better than the default algorithm.  As mentioned previously, such
   functions are described in [HRW1999].  We take as a candidate hash
   function the first one out of the two that are listed as preferred in
   [HRW1999]:

      Wrand(V, Es, Si) = (1103515245((1103515245.Si+12345) XOR
      D(V, Es))+12345)(mod 2^31)

   Here, D(V, Es) is the 31-bit digest (CRC-32 and discarding the
   most significant bit (MSB), as noted in [HRW1999]) of the 14-octet
   stream (the 4-octet Ethernet Tag V followed by the 10-octet ESI).  It
   is mandated that the 14-octet stream be formed by the concatenation
   of the Ethernet Tag and the ESI in network byte order.  The CRC
   should proceed as if the stream is in network byte order
   (big-endian).  Si is the address of the ith server.  The server's
   IP address length does not matter, as only the low-order 31 bits are
   modulo significant.

   A point to note is that the Weight function takes into consideration
   the combination of the Ethernet Tag, the ES, and the PE IP address,
   and the actual length of the server IP address (whether IPv4 or IPv6)
   is not really relevant.  The default algorithm defined in [RFC7432]
   cannot employ both IPv4 and IPv6 PE addresses, since [RFC7432] does
   not specify how to decide on the ordering (the ordinal list) when
   both IPv4 and IPv6 PEs are present.

   HRW solves the disadvantages pointed out in Section 1.3.1 of this
   document and ensures that:

   o  With very high probability, the task of DF election for the VLANs
      configured on an ES is more or less equally distributed among the
      PEs, even in the case of two PEs (see the first fundamental
      problem listed in Section 1.3.1).
Top   ToC   RFC8584 - Page 22
   o  If a PE that is not the DF or the BDF for that VLAN goes down or
      its connection to the ES goes down, it does not result in a DF or
      BDF reassignment.  This saves computation, especially in the case
      when the connection flaps.

   o  More importantly, it avoids the third fundamental problem listed
      in Section 1.3.1 (needless disruption) that is inherent in the
      existing default DF election.

   o  In addition to the DF, the algorithm also furnishes the BDF, which
      would be the DF if the current DF fails.

4. The AC-Influenced DF Election Capability

The procedure discussed in this section is applicable to the DF election in EVPN services [RFC7432] and EVPN VPWS [RFC8214]. The AC-DF capability is expected to be generally applicable to any future DF algorithm. It modifies the DF election procedures by removing from consideration any candidate PE in the ES that cannot forward traffic on the AC that belongs to the BD. This section is applicable to VLAN-based and VLAN Bundle service interfaces. Section 4.1 describes the procedures for VLAN-aware Bundle service interfaces. In particular, when used with the default DF algorithm, the AC-DF capability modifies Step 3 in the DF election procedure described in [RFC7432], Section 8.5, as follows: 3. When the timer expires, each PE builds an ordered candidate list of the IP addresses of all the PE nodes attached to the ES (including itself), in increasing numeric value. The candidate list is based on the Originating Router's IP addresses of the ES routes but excludes any PE from whom no Ethernet A-D per ES route has been received or from whom the route has been withdrawn. Afterwards, the DF election algorithm is applied on a per <ES, Ethernet Tag>; however, the IP address for a PE will not be considered to be a candidate for a given <ES, Ethernet Tag> until the corresponding Ethernet A-D per EVI route has been received from that PE. In other words, the ACS on the ES for a given PE must be UP so that the PE is considered to be a candidate for a given BD. If the default DF algorithm is used, every PE in the resulting candidate list is then given an ordinal indicating its position in the ordered list, starting with 0 as the ordinal for the PE with
Top   ToC   RFC8584 - Page 23
      the numerically lowest IP address.  The ordinals are used to
      determine which PE node will be the DF for a given Ethernet Tag on
      the ES, using the following rule:

      Assuming a redundancy group of N PE nodes, for VLAN-based service,
      the PE with ordinal i is the DF for an <ES, Ethernet Tag V> when
      (V mod N) = i.  In the case of a VLAN (-aware) Bundle service,
      then the numerically lowest VLAN value in that bundle on that ES
      MUST be used in the modulo function as the Ethernet Tag.

      It should be noted that using the Originating Router's IP Address
      field [RFC7432] in the ES route to get the PE IP address needed
      for the ordered list allows for a CE to be multihomed across
      different Autonomous Systems (ASes) if such a need ever arises.

   The modified Step 3, above, differs from [RFC7432], Section 8.5,
   Step 3 in two ways:

   o  Any DF Alg can be used -- not only the described modulus-based DF
      Alg (referred to as the default DF election or "DF Alg 0" in this
      document).

   o  The candidate list is pruned based upon non-receipt of Ethernet
      A-D routes: a PE's IP address MUST be removed from the ES
      candidate list if its Ethernet A-D per ES route is withdrawn.  A
      PE's IP address MUST NOT be considered to be a candidate DF for an
      <ES, Ethernet Tag> if its Ethernet A-D per EVI route for the
      <ES, Ethernet Tag> is withdrawn.

   The following example illustrates the AC-DF behavior applied to the
   default DF election algorithm, assuming the network in Figure 2:

   (a)  When PE1 and PE2 discover ES12, they advertise an ES route for
        ES12 with the associated ES-Import Extended Community and the DF
        Election Extended Community indicating AC-DF = 1; they start a
        DF Wait timer (independently).  Likewise, PE2 and PE3 advertise
        an ES route for ES23 with AC-DF = 1 and start a DF Wait timer.

   (b)  PE1 and PE2 advertise an Ethernet A-D per ES route for ES12.
        PE2 and PE3 advertise an Ethernet A-D per ES route for ES23.

   (c)  In addition, PE1, PE2, and PE3 advertise an Ethernet A-D per EVI
        route for AC1, AC2, AC3, and AC4 as soon as the ACs are enabled.
        Note that the AC can be associated with a single customer VID
        (e.g., VLAN-based service interfaces) or a bundle of customer
        VIDs (e.g., VLAN Bundle service interfaces).
Top   ToC   RFC8584 - Page 24
   (d)  When the timer expires, each PE builds an ordered candidate list
        of the IP addresses of all the PE nodes attached to the ES
        (including itself) as explained in the modified Step 3 above.
        Any PE from which an Ethernet A-D per ES route has not been
        received is pruned from the list.

   (e)  When electing the DF for a given BD, a PE will not be considered
        to be a candidate until an Ethernet A-D per EVI route has been
        received from that PE.  In other words, the ACS on the ES for a
        given PE must be UP so that the PE is considered to be a
        candidate for a given BD.  For example, PE1 will not consider
        PE2 as a candidate for DF election for <ES12, VLAN-1> until an
        Ethernet A-D per EVI route is received from PE2 for
        <ES12, VLAN-1>.

   (f)  Once the PEs with ACS = DOWN for a given BD have been removed
        from the candidate list, the DF election can be applied for the
        remaining N candidates.

   Note that this procedure only modifies the existing EVPN control
   plane by adding and processing the DF Election Extended Community
   and by pruning the candidate list of PEs that take part in the DF
   election.

   In addition to the events defined in the FSM in Section 2.1, the
   following events SHALL modify the candidate PE list and trigger the
   DF re-election in a PE for a given <ES, Ethernet Tag>.  In the FSM
   shown in Figure 3, the events below MUST trigger a transition from
   DF_DONE to DF_CALC:

   1.  Local AC going DOWN/UP.

   2.  Reception of a new Ethernet A-D per EVI route update/withdrawal
       for the <ES, Ethernet Tag>.

   3.  Reception of a new Ethernet A-D per ES route update/withdrawal
       for the ES.

4.1. AC-Influenced DF Election Capability for VLAN-Aware Bundle Services

The procedure described in Section 4 works for VLAN-based and VLAN Bundle service interfaces because, for those service types, a PE advertises only one Ethernet A-D per EVI route per <ES, VLAN> or <ES, VLAN Bundle>. In Section 4, an Ethernet Tag represents a given VLAN or VLAN Bundle for the purpose of DF election. The withdrawal
Top   ToC   RFC8584 - Page 25
   of such a route means that the PE cannot forward traffic on that
   particular <ES, VLAN> or <ES, VLAN Bundle>; therefore, the PE can be
   removed from consideration for DF election.

   According to [RFC7432], in VLAN-aware Bundle services, the PE
   advertises multiple Ethernet A-D per EVI routes per <ES, VLAN Bundle>
   (one route per Ethernet Tag), while the DF election is still
   performed per <ES, VLAN Bundle>.  The withdrawal of an individual
   route only indicates the unavailability of a specific AC and not
   necessarily all the ACs in the <ES, VLAN Bundle>.

   This document modifies the DF election for VLAN-aware Bundle services
   in the following ways:

   o  After confirming that all the PEs in the ES advertise the AC-DF
      capability, a PE will perform a DF election per <ES, VLAN>, as
      opposed to per <ES, VLAN Bundle> as described in [RFC7432].  Now,
      the withdrawal of an Ethernet A-D per EVI route for a VLAN will
      indicate that the advertising PE's ACS is DOWN and the rest of the
      PEs in the ES can remove the PE from consideration for DF election
      in the <ES, VLAN>.

   o  The PEs will now follow the procedures in Section 4.

   For example, assuming three bridge tables in PE1 for the same MAC-VRF
   (each one associated with a different Ethernet Tag, e.g., VLAN-1,
   VLAN-2, and VLAN-3), PE1 will advertise three Ethernet A-D per EVI
   routes for ES12.  Each of the three routes will indicate the status
   of each of the three ACs in ES12.  PE1 will be considered to be a
   valid candidate PE for DF election in <ES12, VLAN-1>, <ES12, VLAN-2>,
   and <ES12, VLAN-3> as long as its three routes are active.  For
   instance, if PE1 withdraws the Ethernet A-D per EVI routes for
   <ES12, VLAN-1>, the PEs in ES12 will not consider PE1 as a suitable
   DF candidate for <ES12, VLAN-1>.  PE1 will still be considered for
   <ES12, VLAN-2> and <ES12, VLAN-3>, since its routes are active.

5. Solution Benefits

The solution described in this document provides the following benefits: (a) It extends the DF election as defined in [RFC7432] to address the unfair load balancing and potential black-holing issues with the default DF election algorithm. The solution is applicable to the DF election in EVPN services [RFC7432] and EVPN VPWS [RFC8214].
Top   ToC   RFC8584 - Page 26
   (b)  It defines a way to signal the DF election algorithm and
        capabilities intended by the advertising PE.  This is done by
        defining the DF Election Extended Community, which allows the
        advertising PE to indicate its support for the capabilities
        defined in this document as well as any subsequently defined DF
        election algorithms or capabilities.

   (c)  It is backwards compatible with the procedures defined in
        [RFC7432].  If one or more PEs in the ES do not support the new
        procedures, they will all follow DF election as defined in
        [RFC7432].

6. Security Considerations

This document addresses some identified issues in the DF election procedures described in [RFC7432] by defining a new DF election framework. In general, this framework allows the PEs that are part of the same ES to exchange additional information and agree on the DF election type and capabilities to be used. By following the procedures in this document, the operator will minimize such undesirable situations as unfair load balancing, service disruption, and traffic black-holing. Because such situations could be purposely created by a malicious user with access to the configuration of one PE, this document also enhances the security of the network. Note that the network will not benefit from the new procedures if the DF election algorithm is not consistently configured on all the PEs in the ES (if there is no unanimity among all the PEs, the DF election algorithm falls back to the default DF election as provided in [RFC7432]). This behavior could be exploited by an attacker that manages to modify the configuration of one PE in the ES so that the DF election algorithm and capabilities in all the PEs in the ES fall back to the default DF election. If that is the case, the PEs will be exposed to the unfair load balancing, service disruption, and black-holing mentioned earlier. In addition, the new framework is extensible and allows for new security enhancements in the future. Note that such enhancements are out of scope for this document. Finally, since this document extends the procedures in [RFC7432], the same security considerations as those described in [RFC7432] are valid for this document.
Top   ToC   RFC8584 - Page 27

7. IANA Considerations

IANA has: o Allocated Sub-Type value 0x06 in the "EVPN Extended Community Sub-Types" registry defined in [RFC7153] as follows: Sub-Type Value Name Reference -------------- ------------------------------ ------------- 0x06 DF Election Extended Community This document o Set up a registry called "DF Alg" for the DF Alg field in the Extended Community. New registrations will be made through the "RFC Required" procedure defined in [RFC8126]. Value 31 is for experimental use and does not require any other RFC than this document. The following initial values in that registry exist: Alg Name Reference ---- ----------------------------- ------------- 0 Default DF Election This document 1 HRW Algorithm This document 2-30 Unassigned 31 Reserved for Experimental Use This document o Set up a registry called "DF Election Capabilities" for the 2-octet Bitmap field in the Extended Community. New registrations will be made through the "RFC Required" procedure defined in [RFC8126]. The following initial value in that registry exists: Bit Name Reference ---- ---------------- ------------- 0 Unassigned 1 AC-DF Capability This document 2-15 Unassigned
Top   ToC   RFC8584 - Page 28

8. References

8.1. Normative References

[RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 2015, <https://www.rfc-editor.org/info/rfc7432>. [RFC8214] Boutros, S., Sajassi, A., Salam, S., Drake, J., and J. Rabadan, "Virtual Private Wire Service Support in Ethernet VPN", RFC 8214, DOI 10.17487/RFC8214, August 2017, <https://www.rfc-editor.org/info/rfc8214>. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>. [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>. [RFC4360] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended Communities Attribute", RFC 4360, DOI 10.17487/RFC4360, February 2006, <https://www.rfc-editor.org/info/rfc4360>. [RFC7153] Rosen, E. and Y. Rekhter, "IANA Registries for BGP Extended Communities", RFC 7153, DOI 10.17487/RFC7153, March 2014, <https://www.rfc-editor.org/info/rfc7153>. [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 8126, DOI 10.17487/RFC8126, June 2017, <https://www.rfc-editor.org/info/rfc8126>.
Top   ToC   RFC8584 - Page 29

8.2. Informative References

[VPLS-MH] Kothari, B., Kompella, K., Henderickx, W., Balus, F., and J. Uttaro, "BGP based Multi-homing in Virtual Private LAN Service", Work in Progress, draft-ietf-bess-vpls-multihoming-03, March 2019. [CHASH] Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., and D. Lewin, "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web", ACM Symposium on Theory of Computing, ACM Press, New York, DOI 10.1145/258533.258660, May 1997. [CLRS2009] Cormen, T., Leiserson, C., Rivest, R., and C. Stein, "Introduction to Algorithms (3rd Edition)", MIT Press, ISBN 0-262-03384-8, 2009. [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and Multicast Next-Hop Selection", RFC 2991, DOI 10.17487/RFC2991, November 2000, <https://www.rfc-editor.org/info/rfc2991>. [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm", RFC 2992, DOI 10.17487/RFC2992, November 2000, <https://www.rfc-editor.org/info/rfc2992>. [RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route Reflection: An Alternative to Full Mesh Internal BGP (IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006, <https://www.rfc-editor.org/info/rfc4456>. [HRW1999] Thaler, D. and C. Ravishankar, "Using Name-Based Mappings to Increase Hit Rates", IEEE/ACM Transactions on Networking, Volume 6, No. 1, February 1998, <https://www.microsoft.com/en-us/research/wp-content/ uploads/2017/02/HRW98.pdf>. [Knuth] Knuth, D., "The Art of Computer Programming: Volume 3: Sorting and Searching", 2nd Edition, Addison-Wesley, Page 516, 1998.
Top   ToC   RFC8584 - Page 30

Acknowledgments

The authors want to thank Ranganathan Boovaraghavan, Sami Boutros, Luc Andre Burdet, Anoop Ghanwani, Mrinmoy Ghosh, Jakob Heitz, Leo Mermelstein, Mankamana Mishra, Tamas Mondal, Laxmi Padakanti, Samir Thoria, and Sriram Venkateswaran for their review and contributions. Special thanks to Stephane Litkowski for his thorough review and detailed contributions. They would also like to thank their working group chairs, Matthew Bocci and Stephane Litkowski, and their AD, Martin Vigoureux, for their guidance and support. Finally, they would like to thank the Directorate reviewers and the ADs for their thorough reviews and probing questions, the answers to which have substantially improved the quality of the document.

Contributors

The following people have contributed substantially to this document and should be considered coauthors: Antoni Przygienda Juniper Networks, Inc. 1194 N. Mathilda Ave. Sunnyvale, CA 94089 United States of America Email: prz@juniper.net Vinod Prabhu Nokia Email: vinod.prabhu@nokia.com Wim Henderickx Nokia Email: wim.henderickx@nokia.com Wen Lin Juniper Networks, Inc. Email: wlin@juniper.net
Top   ToC   RFC8584 - Page 31
   Patrice Brissette
   Cisco Systems

   Email: pbrisset@cisco.com

   Keyur Patel
   Arrcus, Inc.

   Email: keyur@arrcus.com

   Autumn Liu
   Ciena

   Email: hliu@ciena.com

Authors' Addresses

Jorge Rabadan (editor) Nokia 777 E. Middlefield Road Mountain View, CA 94043 United States of America Email: jorge.rabadan@nokia.com Satya Mohanty (editor) Cisco Systems, Inc. 225 West Tasman Drive San Jose, CA 95134 United States of America Email: satyamoh@cisco.com Ali Sajassi Cisco Systems, Inc. 225 West Tasman Drive San Jose, CA 95134 United States of America Email: sajassi@cisco.com
Top   ToC   RFC8584 - Page 32
   John Drake
   Juniper Networks, Inc.
   1194 N. Mathilda Ave.
   Sunnyvale, CA  94089
   United States of America

   Email: jdrake@juniper.net


   Kiran Nagaraj
   Nokia
   701 E. Middlefield Road
   Mountain View, CA  94043
   United States of America

   Email: kiran.nagaraj@nokia.com

   Senthil Sathappan
   Nokia
   701 E. Middlefield Road
   Mountain View, CA  94043
   United States of America

   Email: senthil.sathappan@nokia.com