Tech-invite3GPPspaceIETFspace
96959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 7609

IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

Pages: 143
Informational
Part 2 of 6 – Pages 11 to 26
First   Prev   Next

Top   ToC   RFC7609 - Page 11   prevText

2. Link Architecture

An SMC-R link is based on reliably connected queue pairs (QPs) that form a "logical point-to-point link" between the two SMC-R peers over a RoCE fabric. An SMC-R link extends from SMC-R peer to SMC-R peer, where typically each peer would be a TCP/IP stack and would reside on separate hosts. ,,.--..,_ +----+ _-`` `-, +-----+ |QP 8| - RoCE ', |QP 64| | | / VLAN M . | | +----+--------+/ \+-------+-----+ | RNIC 1 | SMC-R Link | RNIC 2 | | |<--------------------->| | +------------+ , /+------------+ MAC A (GID A) MAC B (GID B) . .` `', ,-` ``''--''`` Figure 1: SMC-R Link Overview
Top   ToC   RFC7609 - Page 12
   Figure 1 illustrates an overview of the basic concepts of SMC-R peer-
   to-peer connectivity; this is called the SMC-R link.  The SMC-R link
   forms a logical point-to-point connection between two SMC-R peers via
   RoCE.  The SMC-R link is defined and identified by the following
   attributes:

      SMC-R link = RC QPs
         (source VMAC GID QP + target VMAC GID QP + VLAN ID)

   The SMC-R link can optionally be associated with a VLAN ID.  If VLANs
   are in use for the associated IP (LAN) connection, then the VLAN
   attribute is carried over on the SMC-R link.  When VLANs are in use,
   each SMC-R link group is associated with a single and specific VLAN.
   The RoCE fabric is the same physical Ethernet LAN used for standard
   TCP/IP-over-Ethernet communications, with switches as described in
   Section 1.1.1.

   An SMC-R link is designed to support multiple TCP connections between
   the same two peers.  An SMC-R link is intended to be long lived,
   while the underlying TCP connections can dynamically come and go.
   The associated RMBs can also be dynamically added and removed from
   the link as needed.  The first TCP connection between the peers
   establishes the SMC-R link.  Subsequent TCP connections then use the
   previously established link.  When the last TCP connection
   terminates, the link can then be terminated, typically after an
   implementation-defined idle timeout period has elapsed.  The TCP
   server is responsible for initiating and terminating the SMC-R link.

2.1. Remote Memory Buffers (RMBs)

Figure 2 shows the hosts -- Hosts X and Y -- and their associated RMBs within each host. With the SMC-R link, and the associated RKeys and RDMA virtual addresses, each SMC-R-enabled TCP/IP stack can remotely access its peer's RMBs using RDMA. The RKeys and virtual addresses are exchanged during the rendezvous processing when the link is established. The combination of the RKey and the virtual address is the RToken. Note that the SMC-R link ends at the QP providing access to the RMB (via the link + RToken).
Top   ToC   RFC7609 - Page 13
          Host X                                     Host Y
     +-------------------+        ,.--.,_       +-------------------+
     |                   |     .'`       '.     |                   |
     | Protection        |   ,'            `,   |    Protection     |
     | Domain X          |  /                \  |    Domain Y       |
     |            +------+ /                  \ +------+            |
     |       QP 8 |RNIC 1| |   SMC-R Link     | |RNIC 2|  QP 64     |
     |        |   |      |<-------------------->|      |   |        |
     |        |   |      ||                    ||      |   |        |
     |        |   +------+|    VLAN A          |+------+   |        |
     |        |          ||                    ||          |        |
     |        |          | |   RoCE           | |          |        |
     |        |RToken X  | \                  / |RToken Y  |        |
     |        |          |  \                /  |          |        |
     |        V          |   `.            ,'   |          V        |
     | +--------+        |     '._       ,'     |        +--------+ |
     | |        |        |        `''-'``       |        |        | |
     | | RMB    |        |                      |        | RMB    | |
     | |        |        |                      |        |        | |
     | +--------+        |                      |        +--------+ |
     +-------------------+                      +-------------------+

                       Figure 2: SMC-R Link and RMBs

   An SMC-R link can support multiple RMBs that are independently
   managed by each peer.  The number and the size of RMBs are managed by
   the peers based on the host's unique memory management requirements;
   however, the maximum number of RMBs that can be associated to a link
   group on one peer is 255.  The QP has a single protection domain, but
   each RMB has a unique RToken.  All RTokens must be exchanged with the
   peer.

   Each peer manages the RMBs in its local memory for its remote SMC-R
   peer by sharing access to the RMBs via RTokens with its peers.  The
   remote peer writes into the RMBs via RDMA, and the local peer (RMB
   owner) then reads from the RMBs.

   When two peers decide to use SMC-R for a given TCP connection, they
   each allocate a local RMB element for the TCP connection and
   communicate the location of this local RMB element during rendezvous
   processing.  To that end, RMB elements are created in pairs, with one
   RMB element allocated locally on each peer of the SMC-R link.
Top   ToC   RFC7609 - Page 14
                  ---  +------------+---------------+
                  /\   |Eye Catcher |               |
                   |   +------------+               |
                   |   |                            |
         RMB Element 1 |                            |
                   |   |   Receive Buffer           |
                   |   |                            |
                   |   |                            |
                  \/   |                            |
                  ---  +------------+---------------+
                  /\   |Eye Catcher |               |
                   |   +------------+               |
                   |   |                            |
         RMB Element 2 |                            |
                   |   |   Receive Buffer           |
                   |   |                            |
                   |   |                            |
                  \/   |                            |
                  ---  +----------------------------+
                       |            .               |
                       |            .               |
                       |            .               |
                       |            .               |
                       |    (up to 255 elements)    |
                       +----------------------------+

                           Figure 3: RMB Format

   Figure 3 illustrates the basic format of an RMB.  The RMB is a
   virtual memory buffer whose backing real memory is pinned, which can
   support up to 255 TCP connections to exactly one remote SMC-R peer.
   Each RMB is therefore associated with the SMC-R links within a link
   group for the two peers and a specific RoCE Protection Domain.  Other
   than the two peers identified by the SMC-R link, no other SMC-R peers
   can have RDMA access to an RMB; this requires a unique Protection
   Domain for every SMC-R link.  This is critical to ensure integrity of
   SMC-R communications.

   RMBs are subdivided into multiple elements for efficiency, with each
   RMB Element (RMBE) associated with a single TCP connection.
   Therefore, multiple TCP connections across an SMC-R link group can
   share the same memory for RDMA purposes, reducing the overhead of
   having to register additional memory with the RNIC for every new TCP
   connection.  The number of elements in an RMB and the size of each
   RMBE are entirely governed by the owning peer, subject to the SMC-R
   architecture rules; however, all RMB elements within a given RMB must
   be the same size.  Each peer can decide the level of resource-sharing
   that is desirable across TCP connections based on local constraints,
Top   ToC   RFC7609 - Page 15
   such as available system memory.  An RMB element is identified to the
   remote SMC-R peer via an RMB Element Token, which consists of the
   following:

   o  RMB RToken: The combination of the RKey and virtual address
      provided by the RNIC that identifies the start of the RMB for RDMA
      operations.

   o  RMB Index: Identifies the RMB element index in the RMB.  Used to
      locate a specific RMB element within an RMB.  Valid value range is
      1-255.

   o  RMB Element Length: The length of the RMB element's eye catcher
      plus the length of the receive buffer.  This length is equal for
      all RMB elements in a given RMB.  This length can be variable
      across different RMBs.

   Multiple RMBs can be associated to an SMC-R link group, and each peer
   in an SMC-R link group manages allocation of its RMBs.  RMB
   allocation can be asymmetric.  For example, Server X can allocate two
   RMBs to an SMC-R link group while Server Y allocates five.  This
   provides maximum implementation flexibility to allow hosts to
   optimize RMB management for their own local requirements.  The
   maximum number of RMBs that can be allocated on one peer to a link
   group is 255.  If more RMBs are required, the peer may fall back to
   IP for subsequent connections or, if the peer is the server, create a
   parallel link group.

   One use case for multiple RMBs is multiple receive buffer sizes.
   Since every element in an RMB must be the same size, multiple RMBs
   with different element sizes can be allocated if varying receive
   buffer sizes are required.

   Also, since the maximum number of TCP connections whose receive
   buffers can be allocated to an RMB is 255, multiple RMBs may be
   required to provide capacity for large numbers of TCP connections
   between two peers.
Top   ToC   RFC7609 - Page 16
   Separately from the RMB, the TCP/IP stack that owns each RMB
   maintains control data for each RMB element within its local control
   structures.  The control data contains flags for maintaining the
   state of the TCP data (for example, urgent data indicator) and, most
   importantly, the following two cursors, which are illustrated below
   in Figure 4:

   o  The peer producer cursor: This is a wrapping offset into the
      RMB element's receive buffer that points to the next byte of data
      to be written by the remote peer.  This cursor is provided by the
      remote peer in a Connection Data Control (CDC) message, which is
      sent using RoCE SendMsg processing, and tells the local peer how
      far it can consume data in the RMBE buffer.

   o  The peer consumer cursor: This is a wrapping offset into the
      remote peer's RMB element's receive buffer that points to the next
      byte of data to be consumed by the remote peer in its own RMBE.
      The local peer cannot write into the remote peer's RMBE beyond
      this point without causing data loss.  This cursor is also
      provided by the peer using a Connection Data Control message.

   Each TCP connection peer maintains its cursors for a TCP connection's
   RMBE in its local control structures.  In other words, the peer who
   writes into a remote peer's RMBE provides its producer cursor to the
   peer whose RMBE it has written into.  The peer who reads from its
   RMBE provides its consumer cursor to the writing peer.  In this
   manner, the reads and writes between peers are kept coordinated.

   For example, referring to Figure 4, Peer B writes the hashed data
   into the receive buffer of Peer A's RMBE.  After that write
   completes, Peer B uses a CDC message to update its producer cursor to
   Peer A, to indicate to Peer A how much data is available for Peer A
   to consume.  The CDC message that Peer B sends to Peer A wakes up
   Peer A and notifies it that there is data to be consumed.

   Similarly, when Peer A consumes data written by Peer B, it uses a CDC
   message to update its consumer cursor to Peer B to let Peer B know
   how much data it has consumed, so Peer B knows how much space is
   available for further writes.  If Peer B were to write enough data to
   Peer A that it would wrap the RMBE receive buffer and exceed the
   consumer cursor, data loss would result.

   Note that this is a simplistic description of the control flows, and
   they are optimized to minimize the number of CDC messages required,
   as described in Section 4.7 ("RMB Data Flows").
Top   ToC   RFC7609 - Page 17
      Peer A's RMBE Control Info            Peer B's RMBE Control Info
     +--------------------------+          +--------------------------+
     |                          |          |                          |
      /----Peer producer cursor |    +-----+-Peer consumer cursor     |
    /|                          |    |     |                          |
   | +--------------------------+    |     +--------------------------+
   |  Peer A's RMBE                  |
   | +--------------------------+    |
   | |            +------------------+
   | |            |             |
   | |            \/            |
   | |             +------------|
   | |-------------+/////////// |
   | |//RDMA data written by ///|
   | |/// Peer B that is ////// |
   | |/available to be consumed/|
   | |///////////////////////// |
   | |///////// +---------------|
   | |----------+/\             |
   | |            |             |
    \|            |             |
     \           /              |
     |\---------/               |
     |                          |
     |                          |

                          Figure 4: RMBE Cursors

   Additional flags and indicators are communicated between peers.  In
   all cases, these flags and indicators are updated by the peer using
   CDC messages, which are sent using RoCE SendMsg.  More details on
   these additional flags and indicators are described in Section 4.3
   ("RMBE Control Information").
Top   ToC   RFC7609 - Page 18

2.2. SMC-R Link Groups

SMC-R links are logically grouped together to form an SMC-R link group. The purpose of the link group is for supporting multiple links between the same two peers to provide for: o Resilience: Provides transparent and dynamic switching of the link used by existing TCP connections during link failures, typically hardware related. TCP traffic using the failing link can be switched to an active link within the link group, thereby avoiding disruptions to application workloads. o Link utilization: Provides an active/active link usage model allowing TCP traffic to be balanced across the links, which increases bandwidth and also avoids hardware imbalances and bottlenecks. Note that both adapter and switch utilization can become potential resource constraint issues. SMC-R link group support is required. Resilience is not optional. However, the user can elect to provision a single RNIC (on one or both hosts). Multiple links that are formed between the same two peers fall into two distinct categories: 1. Equal Links: Links providing equal access to the same RMB(s) at both endpoints, whereby all TCP connections associated with the links must have the same VLAN ID and have the same TCP server and TCP client roles or relationship. 2. Unequal Links: Links providing access to unique, unrelated and isolated RMB(s) (i.e., for unique VLANs or unique and isolated application workloads, etc.) or having unique TCP server or client roles. Links that are logically grouped together forming an SMC-R link group must be equal links.

2.2.1. Link Group Types

Equal links within a link group also have another "Link Group Type" attribute based on the link's associated underlying physical path. The following SMC-R link types are defined: 1. Single link: the only active link within a link group 2. Parallel link: not allowed -- SMC-R links having the same physical RNIC at both hosts
Top   ToC   RFC7609 - Page 19
   3. Asymmetric link: links that have unique RNIC adapters at one host
      but share a single adapter at the peer host

   4. Symmetric link: links that have unique RNIC adapters at both hosts

   These link group types are further explained in the following figures
   and descriptions.

   Figure 2 above shows the single-link case.  The single link
   illustrated in Figure 2 also establishes the SMC-R link group.  Link
   groups are supposed to have multiple links, but when only one RNIC is
   available at both hosts then only a single link can be created.  This
   is expected to be a transient case.

   Figure 5 shows the symmetric-link case.  Both hosts have unique and
   redundant RNIC adapters.  This configuration meets the objectives for
   providing full RoCE redundancy required to provide the level of
   resilience required for high availability for SMC-R.  While this
   configuration is not required, it is a strongly recommended "best
   practice" for the exploitation of SMC-R.  Single and asymmetric links
   must be supported but are intended to provide for short-term
   transient conditions -- for example, during a temporary outage or
   recycle of an RNIC.

          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |                   |                      |                   |
     | Protection        |                      |    Protection     |
     | Domain X          |                      |    Domain Y       |
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2|  QP 64     |
     |RToken X|   |      |<-------------------->|      |   |        |
     |        |   |      |                      |      |   |RToken Y|
     |       \/   +------+                      +------+  \/        |
     |+--------+         |                      |        +--------+ |
     ||        |         |                      |        |        | |
     || RMB    |         |                      |        | RMB    | |
     ||        |         |                      |        |        | |
     |+--------+         |                      |        +--------+ |
     |       /\   +------+                      +------+  /\        |
     |RToken Z|   |      |     SMC-R Link 2     |      |   |RToken W|
     |        |   |RNIC 3|<-------------------->|RNIC 4|   |        |
     |       QP 9 |      |                      |      |  QP 65     |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+

                      Figure 5: Symmetric SMC-R Links
Top   ToC   RFC7609 - Page 20
          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |                   |                      |                   |
     | Protection        |                      |    Protection     |
     | Domain X          |                      |    Domain Y       |
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2|  QP 64     |
     |RToken X|   |      |<-------------------->|      |   |        |
     |        |   |      |                   .->|      |   |RToken Y|
     |       \/   +------+                 .`   +------+  \/        |
     |+--------+         |               .`     |        +--------+ |
     ||        |         |             .`       |        |        | |
     || RMB    |         |           .`         |        | RMB    | |
     ||        |         |         .`SMC-R      |        |        | |
     |+--------+         |       .` Link 2      |        +--------+ |
     |       /\   +------+     .`               +------+            |
     |RToken Z|   |      |   .`                 |      |down or     |
     |        |   |RNIC 3|<-`                   |RNIC 4|unavailable |
     |       QP 9 |      |                      |      |            |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+

                     Figure 6: Asymmetric SMC-R Links

   In the example provided by Figure 6, Host X has two RNICs but Host Y
   only has one RNIC because RNIC 4 is not available.  This
   configuration allows for the creation of an asymmetric link.  While
   an asymmetric link will provide some resilience (for example, when
   RNIC 1 fails), ideally each host should provide two redundant RNICs.
   This should be a transient case, and when RNIC 4 becomes available,
   this configuration must transition to a symmetric-link configuration.
   This transition is accomplished by first creating the new symmetric
   link and then deleting the asymmetric link with reason code
   "Asymmetric link no longer needed" specified in the DELETE LINK LLC
   message.
Top   ToC   RFC7609 - Page 21
          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |                   |                      |                   |
     | Protection        |                      |    Protection     |
     | Domain X          |                      |    Domain Y       |
     |            +------+  SMC-R Link 1        +------+            |
     |       QP 8 |RNIC 1|<-------------------->|RNIC 2|  QP 64     |
     |RToken X|   |      |                      |      |   |        |
     |        |   |      |<-------------------->|      |   |RToken Y|
     |       \/   +------+  SMC-R Link 2        +------+  \/        |
     |+--------+   QP 9  |                      | QP 65  +--------+ |
     ||        |    |    |                      |  |     |        | |
     || RMB    |<-- +    |                      |  +---->| RMB    | |
     ||        |         |                      |        |        | |
     |+--------+         |                      |        +--------+ |
     |            +------+                      +------+            |
     |     down or|      |                      |      |down or     |
     | unavailable|RNIC 3|                      |RNIC 4|unavailable |
     |            |      |                      |      |            |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+

              Figure 7: SMC-R Parallel Links (Not Supported)

   Figure 7 shows parallel links, which are two links in the link group
   that use the same hardware.  This configuration is not permitted.
   Because SMC-R multiplexes multiple TCP connections over an SMC-R link
   and both links are using the exact same hardware, there is no
   additional redundancy or capacity benefit obtained from this
   configuration.  In addition to providing no real benefit, this
   configuration adds the unnecessary overhead of additional queue
   pairs, generation of additional RKeys, etc.

2.2.2. Maximum Number of Links in Link Group

The SMC-R protocol defines a maximum of eight symmetric SMC-R links within a single SMC-R link group. This allows for support for up to eight unique physical paths between peer hosts. However, in terms of meeting the basic requirements for redundancy, support for at least two symmetric links must be implemented. Supporting more than two links also simplifies implementation for practical matters relating to dynamically adding and removing links -- for example, starting a third SMC-R link prior to taking down one of the two existing links. Recall that all links within a link group must have equal access to all associated RMBs.
Top   ToC   RFC7609 - Page 22
   The SMC-R protocol allows an implementation to assign an
   implementation-specific and appropriate value for maximum symmetric
   links.  The implementation value must not exceed the architecture
   limit of 8; also, the value must not be lower than 2, because the
   SMC-R protocol requires redundancy.  This does not mean that two
   RNICs are physically required to enable SMC-R connectivity, but at
   least two RNICs for redundancy are strongly recommended.

   The SMC-R peers exchange their implementation maximum link values
   during the link group establishment using the defined maximum link
   value in the CONFIRM LINK LLC command.  Once the initial exchange
   completes, the value is set for the life of the link group.  The
   maximum link value can be provided by both the server and client.
   The server must supply a value, whereas the client maximum link value
   is optional.  When the client does not supply a value, it indicates
   that the client accepts the server-supplied maximum value.  If the
   client provides a value, it cannot exceed the server-supplied maximum
   value.  If the client passes a lower value, this lower value then
   becomes the final negotiated maximum number of symmetric links for
   this link group.  Again, the minimum value is 2.

   During run time, the client must never request that the server add a
   symmetric link to a link group that would exceed the negotiated
   maximum link value.  Likewise, the server must never attempt to add a
   symmetric link to a link group that would exceed the negotiated
   maximum value.

   In terms of counting the number of active links within a link group,
   the initial link (or the only/last) link is always counted as 1.
   Then, as additional links are added, they are either symmetric or
   asymmetric links.

   With regards to enforcing the maximum link rules, asymmetric links
   are an exception having a unique set of rules:

   o  Asymmetric links are always limited to one asymmetric link allowed
      per link group.

   o  Asymmetric links must not be counted in the maximum symmetric-link
      count calculation.  When tracking the current count or enforcing
      the negotiated maximum number of links, an asymmetric link is not
      to be counted.
Top   ToC   RFC7609 - Page 23

2.2.3. Forming and Managing Link Groups

SMC-R link groups are self-defining. The first SMC-R link in a link group is created using TCP option flows on the TCP three-way handshake followed by CLC message flows over the TCP connection. Subsequent SMC-R links in the link group are created by sending LLC messages over an SMC-R link that already exists in the link group. Once an SMC-R link group is created, no additional SMC-R links in that group are created using TCP and CLC negotiation. Because subsequent SMC-R links are created exclusively by sending LLC messages over an existing SMC-R link in a link group, the membership of SMC-R links in a link group is self-defining. This architecture does not define a specific identifier for an SMC-R link group. This identification may be useful for network management and may be assigned in a platform-specific manner, or in an extension to this architecture. In each SMC-R link group, one peer is the server for all TCP connections and the other peer is the client. If there are additional TCP connections between the peers that use SMC-R and have the client and server roles reversed, another SMC-R link group is set up between them with the opposite client-server relationship. This is required because there are specific responsibilities divided between the client and server in the management of an SMC-R link group. In this architecture, the decision of whether to use an existing SMC-R link group or create a new SMC-R link group for a TCP connection is made exclusively by the server. Management of the links in an SMC-R link group is also a server responsibility. The server is responsible for adding and deleting links in a link group. The client may request that the server take certain actions, but the final responsibility is the server's.
Top   ToC   RFC7609 - Page 24

2.2.4. SMC-R Link Identifiers

This architecture defines multiple identifiers to identify SMC-R links and peers. o Link number: This is a 1-byte value that identifies an SMC-R link within a link group. Both the server and the client use this number to distinguish an SMC-R link from other links within the same link group. It is only unique within a link group. In order to prevent timing windows that may occur when a server creates a new link while the client is still cleaning up a previously existing link, link numbers cannot be reused until the entire link numbering space has been exhausted. o Link user ID: This is an architecturally opaque 4-byte value that a peer uses to uniquely define an SMC-R link within its own space. This means that a link user ID is unique within one peer only. Each peer defines its own link user ID for a link. The peers exchange this information once during link setup, and it is never used architecturally again. The purpose of this identifier is for network management, display, and debugging. For example, an operator on a client could provide the operator on the server with the server's link user ID if he requires the server's operator to check on the operation of a link that the client is having trouble with. o Peer ID: The SMC-R peer ID uniquely identifies a specific instance of a specific TCP/IP stack. It is required because in clustered and load-balancing environments, an IP address does not uniquely identify a TCP/IP stack. An RNIC's MAC/GID also doesn't uniquely or reliably identify a TCP/IP stack, because RNICs can go up and down and even be redeployed to other TCP/IP stacks in a multiple-partitioned or virtualized environment. The peer ID is not only unique per TCP/IP stack but is also unique per instance of a TCP/IP stack, meaning that if a TCP/IP stack is restarted, its peer ID changes.

2.3. SMC-R Resilience and Load Balancing

The SMC-R multilink architecture provides resilience for network high availability via failover capability to an alternate RoCE adapter. The SMC-R multilink architecture does not define primary, secondary, or alternate roles to the links. Instead, there are multiple active links representing multiple redundant RoCE paths over the same LAN.
Top   ToC   RFC7609 - Page 25
   Assignment of TCP connections to links is unidirectional and
   asymmetric.  This means that the client and server may each choose a
   separate link for their RDMA writes associated with a specific TCP
   connection.

   If a hardware failure occurs or a QP failure associated with an
   individual link occurs, then the TCP connections that were associated
   with the failing link are dynamically and transparently switched to
   use another available link.  The server or the client can detect a
   failure, immediately move their TCP connections, and then notify
   their peer via the DELETE LINK LLC command.  While the client can
   notify the server of an apparent link failure with the DELETE LINK
   LLC command, the server performs the actual link deletion.

   The movement of TCP connections to another link can be accomplished
   with minimal coordination between the peers.  The TCP connection
   movement is also transparent to, and non-disruptive to, the TCP
   socket application workloads for most failure scenarios.  After a
   failure, the surviving links and all associated hardware must handle
   the link group's workload.

   As each SMC-R peer begins to move active TCP connections to another
   link, all current RDMA write operations must be allowed to complete.
   The moving peer then sends a signal to verify receipt of the last
   successful write by its peer.  If this verification fails, the TCP
   connection must be reset.  Once this verification is complete, all
   writes that failed may then be retried, in order, over the new link.
   Any data writes or CDC messages for which the sender did not receive
   write completion must be replayed before any subsequent data or CDC
   write operations are sent.  LLC messages are not retried over the new
   link, because they are dependent on a known link configuration, which
   has just changed because of the failure.  The initiator of an LLC
   message exchange that fails will be responsible for retrying once the
   link group configuration stabilizes.

   When a new link becomes available and is re-added to the link group,
   each peer is free to rebalance its current TCP connections as needed
   or only assign new TCP connections to the newly added link.  Both the
   server and client are free to manage TCP connections across the link
   group as needed.  TCP connection movement does not have to be
   stimulated by a link failure.

   The SMC-R architecture also defines orderly versus disorderly
   failover.  The type of failover is communicated in the LLC
   DELETE LINK command and is simply a means to indicate that the link
   has terminated (disorderly) or link termination is imminent
   (orderly).  The orderly link deletion could be initiated via operator
   command or programmatically to bring down an idle link.  For example,
Top   ToC   RFC7609 - Page 26
   an operator command could initiate orderly shutdown of an adapter for
   service.  Implementation of the two types is based on implementation
   requirements and is beyond the scope of the SMC-R architecture.



(page 26 continued on part 3)

Next Section