RFC 7609

IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

Pages: 143
Informational

Part 3 of 6 – Pages 26 to 59

RFC7609 - Page 26 prevText

3.  SMC-R Rendezvous Architecture

   "Rendezvous" is the process that SMC-R-capable peers use to
   dynamically discover each others' capabilities, negotiate SMC-R
   connections, set up SMC-R links and link groups, and manage those
   link groups.  A key aspect of SMC-R Rendezvous is that it occurs
   dynamically and automatically, without requiring SMC-R link
   configuration to be defined by an administrator.

   SMC-R Rendezvous starts with the TCP/IP three-way handshake, during
   which connection peers use TCP options to announce their SMC-R
   capabilities.  If both endpoints are SMC-R capable, then Connection
   Layer Control (CLC) messages are exchanged between the peers' SMC-R
   layers over the newly established TCP connection to negotiate SMC-R
   credentials.  The CLC message mechanism is analogous to the messages
   exchanged by SSL for its handshake processing.

   If a new SMC-R link is being set up, Link Layer Control (LLC)
   messages are used to confirm RDMA connectivity.  LLC messages are
   also used by the SMC-R layers at each peer to manage the links and
   link groups.

   Once an SMC-R link is set up or agreed to by the peers, the TCP
   sockets are passed to the peer applications, which use them as
   normal.  The SMC-R layer, which resides under the sockets layer,
   transmits the socket data between peers over RDMA using the SMC-R
   protocol, bypassing the TCP/IP stack.

3.1.  TCP Options

   During the TCP/IP three-way handshake, the client and server indicate
   their support for SMC-R by including experimental TCP option 254 on
   the three-way handshake flows, in accordance with [RFC6994] ("Shared
   Use of Experimental TCP Options").  The Experiment Identifier (ExID)
   value used is the string "SMCR" in EBCDIC (IBM-1047) encoding
   (0xE2D4C3D9).  This ExID has been registered in the "TCP Experimental
   Option Experiment Identifiers (TCP ExIDs)" registry maintained
   by IANA.

RFC7609 - Page 27

   After completion of the three-way TCP handshake, each peer queries
   its peer's options.  If both peers set the TCP option on the
   three-way handshake, inline SMC-R negotiation occurs using CLC
   messages.  If neither peer, or only one peer, sets the TCP option,
   SMC-R cannot be used for the TCP connection, and the TCP connection
   completes the setup using the IP fabric.

3.2.  Connection Layer Control (CLC) Messages

   CLC messages are sent as data payload over the IP network using the
   TCP connection between SMC-R layers at the peers.  They are analogous
   to the messages used to exchange parameters for SSL.

   The use of CLC messages is detailed in the following sections.  The
   following list provides a summary of the defined CLC messages and
   their purposes:

   o  SMC Proposal: Sent from the client to propose that this TCP
      connection is eligible to be moved to SMC-R.  The client
      identifies itself and its subnet to the server and passes the
      SMC-R elements for a suggested RoCE path via the MAC and GID.

   o  SMC Accept: Sent from the server to accept the client's TCP
      connection SMC Proposal.  The server responds to the client's
      proposal by identifying itself to the client and passing the
      elements of a RoCE path that the client can use to perform RDMA
      writes to the server.  This consists of such SMC-R link elements
      as RoCE MAC, GID, and RMB information.

   o  SMC Confirm: Sent from the client to confirm the server's
      acceptance of the SMC connection.  The client responds to the
      server's acceptance by passing the elements of a RoCE path that
      the server can use to perform RDMA writes to the client.  This
      consists of such SMC-R link elements as RoCE MAC, GID, and RMB
      information.

   o  SMC Decline: Sent from either the server or the client to reject
      the SMC connection, indicating the reason the peer must decline
      the SMC Proposal and allowing the TCP connection to revert back to
      IP connectivity.

3.3.  LLC Messages

   Link Layer Control (LLC) messages are sent between peer SMC-R layers
   over an SMC-R link to manage the link or the link group.  LLC
   messages are sent using RoCE SendMsg and are 44 bytes long.  The
   44-byte size is based on what can fit into a RoCE Work Queue Element
   (WQE) without requiring the posting of receive buffers.

RFC7609 - Page 28

   LLC messages generally follow a request-reply semantic.  Each message
   has a request flavor and a reply flavor, and each request must be
   confirmed with a reply, except where otherwise noted.  The use of LLC
   messages is detailed in the following sections.  The following list
   provides a summary of the defined LLC messages and their purposes:

   o  ADD LINK: Used to add a new link to a link group.  Sent from the
      server to the client to initiate addition of a new link to the
      link group, or from the client to the server to request that the
      server initiate addition of a new link.

   o  ADD LINK CONTINUATION: A continuation of ADD LINK that allows the
      ADD LINK to span multiple commands, because all of the link
      information cannot be contained in a single ADD LINK message.

   o  CONFIRM LINK: Used to confirm that RoCE connectivity over a newly
      created SMC-R link is working correctly.  Initiated by the server.
      Both this message and its reply must flow over the SMC-R link
      being confirmed.

   o  DELETE LINK: When initiated by the server, deletes a specific link
      from the link group or deletes the entire link group.  When
      initiated by the client, requests that the server delete a
      specific link or the entire link group.

   o  CONFIRM RKEY: Informs the peer on the SMC-R link of the addition
      of an RMB to the link group.

   o  CONFIRM RKEY CONTINUATION: A continuation of CONFIRM RKEY that
      allows the CONFIRM RKEY to span multiple commands, in the event
      that all of the information cannot be contained in a single
      CONFIRM RKEY message.

   o  DELETE RKEY: Informs the peer on the SMC-R link of the deletion of
      one or more RMBs from the link group.

   o  TEST LINK: Verifies that an already-active SMC-R link is active
      and healthy.

   o  Optional LLC message: Any LLC message in which the two high-order
      bits of the opcode are b'10'.  This optional message must be
      silently discarded by a receiving peer that does not support the
      opcode.  No such messages are defined in this version of the
      architecture; however, the concept is defined to allow for
      toleration of possible advanced, optional functions.

RFC7609 - Page 29

   CONFIRM LINK and TEST LINK are sensitive to which link they flow on
   and must flow on the link being confirmed or tested.  The other flows
   may flow over any active link in the link group.  When there are
   multiple links in a link group, a response to an LLC message must
   flow over the same link that the original message flowed over, with
   the following exceptions:

   o  ADD LINK request from a server in response to an ADD LINK from a
      client.

   o  DELETE LINK request from a server in response to a DELETE LINK
      from a client.

3.4.  CDC Messages

   Connection Data Control (CDC) messages are sent over the RoCE fabric
   between peers using RoCE SendMsg and are 44 bytes long.  The 44-byte
   size is based on the size that can fit into a RoCE WQE without
   requiring the posting of receive buffers.  CDC messages are used to
   describe the socket application data passed via RDMA write
   operations, as well as TCP connection state information, including
   producer cursors and consumer cursors, RMBE state information, and
   failover data validation.

3.5.  Rendezvous Flows

   Rendezvous information for SMC-R is exchanged as TCP options on the
   TCP three-way handshake flows to indicate capability, followed by
   inline TCP negotiation messages to actually do the SMC-R setup.
   Formats of all rendezvous options and messages discussed in this
   section are detailed in Appendix A.

3.5.1.  First Contact

   First contact between RoCE peers occurs when a new SMC-R link group
   is being set up.  This could be because no SMC-R links already exist
   between the peers, or the server decides to create a new SMC-R link
   group in parallel with an existing one.

3.5.1.1.  Pre-negotiation of TCP Options

   The client and server indicate their SMC-R capability to each other
   using TCP option 254 on the TCP three-way handshake flows.

   A client who wishes to do SMC-R will include TCP option 254 using an
   ExID equal to the EBCDIC (codepage IBM-1047) encoding of "SMCR" on
   its SYN flow.

RFC7609 - Page 30

   A server that supports SMC-R will include TCP option 254 with the
   ExID value of EBCDIC "SMCR" on its SYN-ACK flow.  Because the server
   is listening for connections and does not know where client
   connections will come from, the server implementation may choose to
   unconditionally include this TCP option if it supports SMC-R.  This
   may be required for server implementations where extensions to the
   TCP stack are not practical.  For server implementations that can add
   code to examine and react to packets during the three-way handshake,
   the server should only include the SMC-R TCP option on the SYN-ACK if
   the client included it on its SYN packet.

   A client who supports SMC-R and meets the three conditions outlined
   above may optionally include the TCP option for SMC-R on its ACK
   flow, regardless of whether or not the server included it on its
   SYN-ACK flow.  Some TCP/IP stacks may have to include it if the SMC-R
   layer cannot modify the options on the socket until the three-way
   handshake completes.  Proprietary servers should not include this
   option on the ACK flow, since including it on the SYN flow was
   sufficient to indicate the client's capabilities.

   Once the initial three-way TCP handshake is completed, each peer
   examines the socket options.  SMC-R implementations may do this by
   examining what was actually provided on the SYN and SYN-ACK packets
   or by performing a getsockopt() operation to determine the options
   sent by the peer.  If neither peer, or only one peer, specified the
   TCP option for SMC-R, then SMC-R cannot be used on this connection
   and it proceeds using normal IP flows and processing.

   If both peers specified the TCP option for SMC-R, then the TCP
   connection is not started yet and the peers proceed to SMC-R
   negotiation using inline data flows.  The socket is not yet turned
   over to the applications; instead, the respective SMC layers exchange
   CLC messages over the newly formed TCP connection.

3.5.1.2.  Client Proposal

   If SMC-R is supported by both peers, the client sends an SMC Proposal
   CLC message to the server.  It is not immediately apparent on this
   flow from client to server whether this is a new or existing SMC-R
   link, because in clustered environments a single IP address may
   represent multiple hosts.  This type of cluster virtual IP address
   can be owned by a network-based or host-based Layer 4 load balancer
   that distributes incoming TCP connections across a cluster of
   servers/hosts.  For purposes of high availability, other clustered
   environments may also support the movement of a virtual IP address
   dynamically from one host in the cluster to another.  In summary, the
   client cannot predetermine that a connection is targeting the same
   host by simply matching the destination IP address for outgoing TCP

RFC7609 - Page 31

   connections.  Therefore, it cannot predetermine the SMC-R link that
   will be used for a new TCP connection.  This information will be
   dynamically learned, and the appropriate actions will be taken as the
   SMC-R negotiation handshake unfolds.

   In the SMC-R proposal message, the initiator (client) proposes the
   use of SMC-R by including its peer ID, GID, and MAC addresses, as
   well as the IP subnet number of the outgoing interface (if IPv4) or
   the IP prefix list for the network over which the proposal is sent
   (if IPv6).  At this point in the flow, the client makes no local
   commitments of resources for SMC-R.

   When the server receives the SMC Proposal CLC message, it uses the
   peer ID provided by the client, plus subnet or prefix information
   provided by the client, to determine if it already has a usable SMC-R
   link with this SMC-R peer.  If there are one or more existing SMC-R
   links with this SMC-R peer, the server then decides which SMC-R link
   it will use for this TCP connection.  See Sections 3.5.2 and 3.5.3
   for the cases of reusing an existing SMC-R link or creating a
   parallel SMC-R link group between SMC-R peers.

   If this is a first contact between SMC-R peers, the server must
   validate that it is on the same LAN as the client before continuing.
   For IPv4, the server does this by verifying that it has an interface
   with an IP subnet number that matches the subnet number sent by the
   client in the SMC Proposal.  For IPv6, it does this by verifying that
   it is directly attached to at least one IP prefix that was listed by
   the client in its SMC Proposal message.

   If the server agrees to use SMC-R, the server begins the setup of a
   new SMC-R link by allocating local QP and RMB resources (setting its
   QP state to INIT) and providing its full SMC-R information in an SMC
   Accept CLC message to the client over the TCP connection, along with
   a flag set indicating that this is a first contact flow.  While the
   SMC Accept message could flow over any IP route back to the client
   depending upon Layer 3 IP routing, the SMC-R credentials provided
   must be for the common subnet or prefix between the server and
   client, as determined above.  If the server cannot or does not want
   to do SMC-R with the client, it sends an SMC Decline CLC message to
   the client, and the connection data may begin flowing using normal
   TCP/IP flows.

RFC7609 - Page 32

3.5.1.3.  Server Acceptance

   When the client receives the SMC Accept from the server, it
   determines whether this is a new or existing SMC-R link, using the
   combination of the following: the first contact flag, its MAC/GID and
   the MAC/GID returned by the server, the VLAN over which the
   connection is setting up, and the QP number provided by the server.

   If it is an existing SMC-R link and the client agrees to use that
   link for the TCP connection, see Section 3.5.2 ("Subsequent Contact")
   below.  If it is a new SMC-R link between peers that already have an
   SMC-R link, then the server is starting a new SMC-R link group.

   Assuming that either (1) this is a first contact between peers or
   (2) the server is starting a new SMC-R link group, the client now
   allocates local QP and RMB resources for the SMC-R link (setting the
   QP state to RTR (ready to receive)), associates them with the server
   QP as learned from the SMC Accept CLC message, and sends an SMC
   Confirm CLC message to the server over the TCP connection with its
   SMC-R link information included.  The client also starts a timer to
   wait for the server to confirm the reliably connected queue pair, as
   described below.

3.5.1.4.  Client Confirmation

   Upon receipt of the client's SMC Confirm CLC message, the server
   associates its QP for this SMC-R link with the client's QP as learned
   from the SMC Confirm CLC message and sets its QP state to RTS (ready
   to send).  The client and the server now have reliably connected
   queue pairs.

3.5.1.5.  Link (QP) Confirmation

   Since setting up the SMC-R link and its QPs did not require any
   network flows on the RoCE fabric, the client and server must now
   confirm connectivity over the RoCE fabric.  To accomplish this, the
   server will send a CONFIRM LINK Link Layer Control (LLC) message to
   the client over the newly created SMC-R link, using the RoCE fabric.
   The CONFIRM LINK LLC message will provide the server's MAC, GID, and
   QP information for the connection, allow each partner to communicate
   the maximum number of links it can tolerate in this link group (the
   "link limit"), and will additionally provide two link IDs:

   o  a 1-byte server-assigned link number that is used by both peers to
      identify the link within the link group and is only unique within
      a link group.

RFC7609 - Page 33

   o  a 4-byte link user ID.  This opaque value is assigned by the
      server for the server's local use and is provided to the client
      for management purposes -- for example, to use in network
      management displays and products.

   When the server sends this message, it will set a timer for receiving
   confirmation from the client.

   When the client receives the server's confirmation in the form of a
   CONFIRM LINK LLC message, it will cancel the confirmation timer it
   set when it sent the SMC Confirm message.  The client will also
   advance its QP state to RTS and respond over the RoCE fabric with a
   CONFIRM LINK response LLC message that (1) provides its MAC, GID,
   QP number, and link limit, (2) confirms the 1-byte link number sent
   by the server, and (3) provides its own 4-byte link user ID to the
   server.

RFC7609 - Page 34

       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|                      |RNIC 2|  QP 64     |
    |RToken X|   |MAC MA|                      |MAC MB|   |        |
    |        |   |GID GA|                      |GID GB|   |RToken Y|
    |       \/   +------+      (Subnet S1)     +------+  \/        |
    |+--------+         |                      |        +--------+ |
    || RMB    |         |                      |        | RMB    | |
    |+--------+         |                      |        +--------+ |
    |            +------+                      +------+            |
    |            |RNIC 3|                      |RNIC 4|            |
    |            |MAC MC|                      |MAC MD|            |
    |            |GID GC|                      |GID GD|            |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+

                     SYN TCP options(254,"SMCR")
        <---------------------------------------------------------

                     SYN-ACK TCP options(254,"SMCR")
        --------------------------------------------------------->

                     ACK [TCP options(254,"SMCR")]
        <--------------------------------------------------------

                    SMC Proposal(PC1,MB,GB,S1)
        <--------------------------------------------------------

    SMC Accept(PS1,first contact,MA,GA,MTU,QP8,RToken=X,RMB elem index)
        --------------------------------------------------------->

         SMC Confirm(PC1,MB,GB,MTU,QP64,RToken=Y,RMB element index)
         <--------------------------------------------------------

       CONFIRM LINK(MA,GA,QP8, link lim, server link user ID, linknum)
        .........................................................>

    CONFIRM LINK rsp(MB,GB,QP64, link lim, client link user ID, linknum)
        <........................................................

                           Legend:
                    ------------   TCP/IP and CLC flows
                    ............   RoCE (LLC) flows
           Square brackets ("[ ]") indicate optional information

                 Figure 8: First Contact Rendezvous Flows

RFC7609 - Page 35

   Technically, the data for the TCP connection could now flow over the
   RoCE path.  However, if this is a first contact, there is no
   alternate for this recently established RoCE path.  Since in the
   current architecture there is no failover from RoCE to IP once
   connection data starts flowing, this means that a failure of this
   path would disrupt the TCP connection, meaning that the level of
   redundancy and failover is less than that provided by IP.  If the
   network has alternate RoCE paths available, they would not be usable
   at this point.  This situation would be unacceptable.

3.5.1.6.  Second SMC-R Link Setup

   Because of the unacceptable situation described above, TCP data will
   not be allowed to flow on the newly established SMC-R link until a
   second path has been set up, or at least attempted.

   If the server has a second RNIC available on the same LAN, it
   attempts to set up the second SMC-R link over that second RNIC.  If
   it only has one RNIC available on the LAN, it will attempt to set up
   the second SMC-R link over that one RNIC.  In the latter case, the
   server is attempting to set up an asymmetric link, in case the client
   does have a second RNIC on the LAN.

   In either case, the server allocates a new QP over the RNIC it is
   attempting to use for the second link and assigns a link number to
   the new link; the server also creates an RToken for the RMB over this
   second QP (note that this means that the first and second QP each
   have their own RToken to represent the same RMB).  The server
   provides this information, as well as the MAC and GID of the RNIC
   over which it is attempting to set up the second link, in an ADD LINK
   LLC message that it sends to the client over the SMC-R link that is
   already set up.

3.5.1.6.1.  Client Processing of ADD LINK LLC Message from Server

   When the client receives the server's ADD LINK LLC message, it
   examines the GID and MAC provided by the server to determine whether
   the server is attempting to use the same server-side RNIC as the
   existing SMC-R link or a different one.

   If the server is attempting to use the same server-side RNIC as the
   existing SMC-R link, then the client verifies that it has a second
   RNIC on the same LAN.  If it does not, the client rejects the
   ADD LINK request from the server, because the resulting link would be
   a parallel link, which is not supported within a link group.  If the
   client does have a second RNIC on the same LAN, it accepts the
   request, and an asymmetric link will be set up.

RFC7609 - Page 36

   If the server is using a different server-side RNIC from the existing
   SMC-R link, then the client will accept the request and a second
   SMC-R link will be set up in this SMC-R link group.  If the client
   has a second RNIC on the same LAN, that second RNIC will be used for
   the second SMC-R link, creating symmetric links.  If the client does
   not have a second RNIC on the same LAN, it will use the same RNIC as
   was used for the initial SMC-R link, resulting in the setup of an
   asymmetric link in the SMC-R link group.

   In either case, when the client accepts the server's ADD LINK
   request, it allocates a new QP on the chosen RNIC and creates an RKey
   over that new QP for the client-side RMB for the SMC-R link group,
   then sends an ADD LINK reply LLC message to the server providing that
   information as well as echoing the link number that was sent by the
   server.

   If the client rejects the server's ADD LINK request, it sends an ADD
   LINK reply LLC message to the server with the reason code for the
   rejection.

3.5.1.6.2.  Server Processing of ADD LINK Reply LLC Message from Client

   If the client sends a negative response to the server or no reply is
   received, the server frees the RoCE resources it had allocated for
   the new link.  Having a single link in an SMC-R link group is
   undesirable.  The server's recovery is detailed in Appendix C.8
   ("Failure to Add Second SMC-R Link to a Link Group").

   If the client sends a positive reply to the server with
   MAC/GID/QP/RKey information, the server associates its QP for the new
   SMC-R link to the QP that the client provided.  Now, the new SMC-R
   link is in the same situation that the first was in after the client
   sent its ACK packet -- there is a reliably connected queue pair over
   the new RoCE path, but there have been no RoCE flows to confirm that
   it's actually usable.  So, at this point, the client and server will
   exchange CONFIRM LINK LLC messages just like they did on the first
   SMC-R link.

   If either peer receives a failure during this second CONFIRM LINK LLC
   exchange (either an immediate failure -- which implies that the
   message did not reach the partner -- or a timeout), it sends a DELETE
   LINK LLC message to the partner over the first (and now only) link in
   the link group.  This DELETE LINK LLC message must be acknowledged
   before data can flow on the single link in the link group.

RFC7609 - Page 37

       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|      SMC-R Link 1    |RNIC 2|  QP 64     |
    |RToken X|   |MAC MA|<-------------------->|MAC MB|   |        |
    |        |   |GID GA|                      |GID GB|   |RToken Y|
    |       \/   +------+                      +------+  \/        |
    |+--------+         |                      |        +--------+ |
    ||        |         |                      |        |        | |
    || RMB    |         |                      |        | RMB    | |
    ||        |         |                      |        |        | |
    |+--------+         |                      |        +--------+ |
    |       /\   +------+                      +------+  /\        |
    |        |   |RNIC 3|      SMC-R Link 2    |RNIC 4|  |         |
    |RToken Z|   |MAC MC|<-------------------->|MAC MD|  |RToken W |
    |       QP 9 |GID GC|      (being added)   |GID GD| QP 65      |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+

                First SMC-R link setup as shown in Figure 8
            <-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.->

            ADD LINK request(QP9,MC,GC, link number = 2)
            ............................................>

            ADD LINK response(QP65,MD,GD, link number = 2)
            <............................................

            ADD LINK CONTINUATION request(RToken=Z)
            ............................................>

           ADD LINK CONTINUATION response(RToken=W)
            <............................................

         CONFIRM LINK(MC,GC,QP9, link number = 2, link user ID)
            .............................................>

      CONFIRM LINK response(MD,GD,QP65, link number = 2, link user ID)
            <.............................................

                          Legend:
                   ------------   TCP/IP and CLC flows
                   ............   RoCE (LLC) flows

                Figure 9: First Contact, Second Link Setup

RFC7609 - Page 38

3.5.1.6.3.  Exchange of RKeys on Second SMC-R Link

   Note that in the scenario described here -- first contact -- there is
   only one RMB RKey to exchange on the second SMC-R link, and it is
   exchanged in the ADD LINK CONTINUATION request and reply.  In
   scenarios other than first contact -- for example, adding a new SMC-R
   link to a longstanding link group with multiple RMBs -- additional
   flows will be required to exchange additional RMB RKeys.  See
   Section 3.5.5.2.3 ("Adding a New SMC-R Link to a Link Group with
   Multiple RMBs") for more details on these flows.

3.5.1.6.4.  Aborting SMC-R and Falling Back to IP

   If both partners don't provide the SMC-R TCP option during the
   three-way TCP handshake, the connection falls back to normal TCP/IP.
   During the SMC-R negotiation that occurs after the three-way TCP
   handshake, either partner may break off SMC-R by sending an SMC
   Decline CLC message.  The SMC Decline CLC message may be sent in
   place of any expected message and may also be sent during the CONFIRM
   LINK LLC exchange if there is a failure before any application data
   has flowed over the RoCE fabric.  For more details on exactly when an
   SMC Decline can flow during link group setup, see Appendices C.1
   ("SMC Decline during CLC Negotiation") and C.2 ("SMC Decline during
   LLC Negotiation").

   If this fallback to IP happens while setting up a new SMC-R link
   group, the RoCE resources allocated for this SMC-R link group
   relationship are torn down, and it will be retried as a new SMC-R
   link group next time a connection starts between these peers with
   SMC-R proposed.  Note that if this happens because one side doesn't
   support SMC-R, there will be very little to tear down, as the TCP
   option will have failed to flow on either the initial SYN or the
   SYN-ACK before either side had reserved any local RoCE resources.

3.5.2.  Subsequent Contact

   "Subsequent contact" means setting up a new TCP connection between
   two peers that already have an SMC-R link group between them and
   reusing the existing SMC-R link group.  In this case, it is not
   necessary to allocate new QPs.  However, it is possible that a new
   RMB has been allocated for this TCP connection, if the previous TCP
   connection used the last element available in the previously used
   RMB, or for any other implementation-dependent reason.  For this
   reason, and for convenience and error checking, the same TCP
   option 254, followed by the inline negotiation method described for
   initial contact, will be used for subsequent contact, but the
   processing differs in some ways.  That processing is described below.

RFC7609 - Page 39

3.5.2.1.  SMC-R Proposal

   When the client begins the inline negotiation with the server, it
   does not know if this is a first contact or a subsequent contact.
   The client cannot know this information until it sees the server's
   peer ID, to determine whether or not it already has an SMC-R link
   with this peer that it can use.  There are several reasons why it is
   not sufficient to use the partner IP address, subnet, VLAN, or other
   IP information to make this determination.  The most obvious reason
   is distributed systems: if the server IP address is actually a
   virtual IP address representing a distributed cluster, the actual
   host serving this TCP connection may not be the same as the host that
   served the last TCP connection to this same IP address.

   After the TCP three-way handshake, assuming that both partners
   indicate SMC-R capability, the client builds and sends the
   SMC Proposal CLC message to the server in exactly the same manner as
   it does in the "first contact" case, and in fact at this point
   doesn't know if it's a first contact or a subsequent contact.  As in
   the "first contact" case, the client sends its peer ID value,
   suggested RNIC MAC/GID, and IP subnet or prefix information.

   Upon receiving the client's proposal, the server looks up the
   provided peer ID to determine if it already has a usable SMC-R
   link group with this peer.  If it does already have a usable SMC-R
   link group, the server then needs to decide whether it will use the
   existing SMC-R link group or create a new link group.  For the case
   of the new link group, see Section 3.5.3 ("First Contact Variation:
   Creating a Parallel Link Group") below.

   For this discussion, assume that the server decides to use the
   existing SMC-R link group for the TCP connection, which is expected
   to be the most common case.  The server is responsible for making
   this decision.  The server then needs to communicate that information
   to the client, but it is not necessary to allocate, associate, and
   confirm QPs for the chosen SMC-R link.  All that remains to be done
   is to set up RMB space for this TCP connection.

   If one of the RMBs already in use for this SMC-R link group has an
   available element that uses the appropriate buffer size, the server
   merely chooses one for this TCP connection and then sends an SMC
   Accept CLC message providing the full RoCE information for the chosen
   SMC-R link to the client, using the same format as the SMC Accept CLC
   message described in Section 3.5.1 ("First Contact") above.

RFC7609 - Page 40

   The server may choose to use the SMC-R link that matches the
   suggested MAC/GID provided by the client in the SMC Proposal for its
   RDMA writes but is not obligated to do so.  The final decision on
   which specific SMC-R link to assign a TCP connection to is an
   independent server and client decision.

   It may be necessary for the server to allocate a new RMB for this
   connection.  The reasons for this are implementation dependent and
   could include the following:

   o  no available space in existing RMB or RMBs, or

   o  desire to allocate a new RMB that uses a different buffer size
      from the ones already created, or

   o  any other implementation-dependent reason

   In this case, the server will allocate the new RMB and then perform
   the flows described in Section 3.5.5.2.1 ("Adding a New RMB to an
   SMC-R Link Group").  Once that processing is complete, the server
   then provides the full RoCE information, including the new RKey, for
   this connection in an SMC Confirm CLC message to the client.

3.5.2.2.  SMC-R Acceptance

   Upon receiving the SMC Accept CLC message from the server, the client
   examines the RoCE information provided by the server to determine
   whether this is a first contact for a new SMC-R link group or a
   subsequent contact for an existing SMC-R link group.  It is a
   subsequent contact if the server-side peer ID, GID, MAC, and QP
   number provided in the packet match a known SMC-R link, and the first
   contact flag is not set.  If this is not the case -- for example, the
   GID and MAC match but the QP is new -- then the server is creating a
   new, parallel SMC-R link group, and this is treated as a first
   contact.

   A different RMB RToken does not indicate a first contact, as the
   server may have allocated a new RMB or may be using several RMBs for
   this SMC-R link.  The client needs the server's RMB information only
   for its RDMA writes to the server, and since there is no requirement
   for symmetric RMBs, this information is simply control information
   for the RDMA writes on this SMC-R link.

   The client must validate that the RMB element being provided by the
   server is not in use by another TCP connection on this SMC-R link
   group.  This validation must validate the new <rtoken, index> across

RFC7609 - Page 41

   all known <rtoken, index> on this link group.  See Section 4.4.2
   ("RMB Element Reuse and Conflict Resolution") for the case in which
   the server tries to use an RMB element that is already in use on this
   link group.

   Once the client has determined that this TCP connection is a
   subsequent contact over an existing SMC-R link, it performs an RMB
   allocation process similar to what the server did: it either
   (1) allocates an element from an RMB already associated with this
   SMC-R link or (2) allocates a new RMB, associates it with this SMC-R
   link, and then chooses an element out of it.

   If the client allocates a new RMB for this TCP connection, it
   performs the processing described in Section 3.5.5.2.1 ("Adding a New
   RMB to an SMC-R Link Group").  Once that processing is complete, the
   client provides its full RoCE information for this TCP connection in
   an SMC Confirm CLC message.

   Because an SMC-R link with a verified connected QP already exists and
   is being reused, there is no need for verification or alternate QP
   selection flows or timers.

3.5.2.3.  SMC-R Confirmation

   When the server receives the client's SMC Confirm CLC message on a
   subsequent contact, it verifies the following:

   o  The RMB element provided by the client is not already in use by
      another TCP connection on this SMC-R link group (see Section 4.4.2
      ("RMB Element Reuse and Conflict Resolution") for the case in
      which it is).

   o  The MAC/GID/QP information provided by the client matches an
      active link within the link group.  The client is free to select
      any valid/active link.  The client is not required to select the
      same link as the server.

   If this validation passes, the server stores the client's RMB
   information for this connection, and the RoCE setup of the TCP
   connection is complete.

3.5.2.4.  TCP Data Flow Race with SMC Confirm CLC Message

   On a subsequent contact TCP/IP connection, a peer may send data as
   soon as it has received the peer RMB information for the connection.
   There are no additional RoCE confirmation flows, since the QPs on the
   SMC-R link are already reliably connected and verified.

RFC7609 - Page 42

   In the majority of cases, the first data will flow from the client to
   the server.  The client must send the SMC Confirm CLC message before
   sending any connection data over the chosen SMC-R link; however, the
   client need not wait for confirmation of this message, and in fact
   there will be no such confirmation.  Since the server is required to
   have the RMB fully set up and ready to receive data from the client
   before sending an SMC Accept CLC message, the client can begin
   sending data over the SMC-R link immediately upon completing the send
   of the SMC Confirm CLC message.

   It is possible that data from the client will arrive at the
   server-side RMB before the SMC Confirm CLC message from the client
   has been processed.  In this case, the server must handle this race
   condition and not provide the arrived TCP data to the socket
   application until the SMC Confirm CLC message has been received and
   fully processed, opening the socket.

   If the server has initial data to send to the client that is not a
   response to the client (this case should be rare), it can send the
   data immediately upon receiving and processing the SMC Confirm CLC
   message from the client.  The client must have opened the TCP socket
   to the client application upon sending the SMC Confirm CLC message so
   the client will be ready to process data from the server.

3.5.3.  First Contact Variation: Creating a Parallel Link Group

   Recall that parallel SMC-R links within an SMC-R link group are not
   supported.  These are multiple SMC-R links within a link group that
   use the same network path.  However, multiple SMC-R link groups
   between the same peers are supported.  This means that if multiple
   SMC-R links over the same RoCE path are desired, it is necessary to
   use multiple SMC-R link groups.  While not a recommended practice,
   this could be done for platform-specific reasons, like QP separation
   of different workloads.  Only the server can drive the creation of
   multiple SMC-R link groups between peers.

   At a high level, when the server decides to create an additional
   SMC-R link group with a client with which it already has an SMC-R
   link group, the flows are basically the same as the normal
   "first contact" case described above.  The following text provides
   more detail and clarification of processing in this case.

   When the server receives the SMC Proposal CLC message from the client
   and, using the MAC/GID information, determines that it already has an
   SMC-R link group with this client, the server can either reuse the
   existing SMC-R link group (detailed in Section 3.5.2 ("Subsequent
   Contact") above) or create a new SMC-R link group in addition to the
   existing one.

RFC7609 - Page 43

   If the server decides to create a new SMC-R link group, it does the
   same processing it would have done for first contact: allocate QP and
   RMB resources as well as alternate QP resources, and communicate the
   QP and RMB information to the client in the SMC Accept CLC message
   with the first contact flag set.

   When the client receives the server's SMC Accept CLC message with the
   new QP information and the first contact flag set, it knows that the
   server is creating a new SMC-R link group even though it already has
   an SMC-R link group with the server.  In this case, the client will
   also allocate a new QP for this new SMC-R link, allocate an RMB for
   it, and generate an RKey for it.

   Note that multiple SMC-R link groups between the same peers must
   access different RMB resources, so new RMBs will be required.  Using
   the same RMBs that are in use in another SMC-R link group is not
   permitted.

   The client then associates its new QP with the server's new QP and
   sends its SMC Confirm CLC message back to the server providing the
   new QP/RMB information, and then sets its confirmation timer for the
   new SMC-R link.

   When the server receives the client's SMC Confirm CLC message, it
   associates its QP with the client's QP as learned from the SMC
   Confirm CLC message and sends a confirmation LLC message.  The rest
   of the flow, with the confirmation QP and setup of additional SMC-R
   links, unfolds just like the "first contact" case.

3.5.4.  Normal SMC-R Link Termination

   The normal socket API trigger points are used by the SMC-R layer to
   initiate SMC-R connection termination flows.  The main design point
   for SMC-R normal connection flows is to use the SMC-R protocol to
   first shut down the SMC-R connection and free up any SMC-R RDMA
   resources, and then allow the normal TCP connection termination
   protocol (i.e., FIN processing) to drive cleanup of the TCP
   connection that exists on the IP fabric.  This design point is very
   important in ensuring that RDMA resources such as the RMBEs are only
   freed and reused when both SMC-R endpoints are completely done with
   their RDMA write operations to the partner's RMBE.

   When the last TCP connection over an SMC-R link group terminates, the
   link group can be terminated.  Similar to creation of SMC-R links and
   link groups, the primary responsibility for determining that normal
   termination is needed and initiating it lies with the server.

RFC7609 - Page 44

   Implementations may opt to set timers to keep SMC-R link groups up
   for a specified time after the last TCP connection ends, to avoid
   churn in cases where TCP connections come and go regularly.

   The link or link group may also be terminated as a result of a
   command initiated by the operator.  This command can be entered at
   either the client or the server.  If entered at the client, the
   client requests that the server perform link or link group
   termination, and the responsibility for doing so ultimately lies with
   the server.

   When the server determines that the SMC-R link group is to be
   terminated, it sends a DELETE LINK LLC message to the client, with a
   flag set indicating that all links in the link group are to be
   terminated.  After receiving confirmation from the adapter that the
   DELETE LINK LLC message has been sent, the server can clean up its
   end of the link group (QPs, RMBs, etc.).  Upon receipt of the DELETE
   LINK message from the server, the client must immediately comply and
   clean up its end of the link group.  Any TCP connections that the
   client believes to be active on the link group must be immediately
   terminated.

   The client can request that the server delete the link group as well.
   The client does this by sending a DELETE LINK message to the server,
   indicating that cleanup of all links is requested.  The server must
   comply by sending a DELETE LINK to the client and processing as
   described in the previous paragraph.  If there are TCP connections
   active on the link group when the server receives this request, they
   are immediately terminated by sending a RST flow over the IP fabric.

3.5.5.  Link Group Management Flows

3.5.5.1.  Adding and Deleting Links in an SMC-R Link Group

   The server has the lead role in managing the composition of the link
   group.  Links are added to the link group by the server.  The client
   may notify the server of new conditions that may result in the server
   adding a new link, but the server is ultimately responsible.  In
   general, links are deleted from the link group by the server;
   however, in certain error cases the client may inform the server that
   a link must be deleted and treat it as deleted without waiting for
   action from the server.  These flows are detailed in the sections
   that follow.

RFC7609 - Page 45

3.5.5.1.1.  Server-Initiated ADD LINK Processing

   As described in previous sections, the server initiates an ADD LINK
   exchange to create redundancy in a newly created link group.  Once a
   link group is established, the server may also initiate ADD LINK for
   other reasons, including:

   o  Availability of additional resources on the server host to support
      an additional SMC-R link.  This may include the provisioning of an
      additional RNIC, more storage becoming available to support
      additional QP resources, operator command, or any other
      implementation-dependent reason.  Note that in order to be
      available for an existing link group a new RNIC must be attached
      to the same RoCE LAN that the link group is using.

   o  Receipt of notification from the client that additional resources
      on the client are available to support an additional SMC-R link.
      See Section 3.5.5.1.2 ("Client-Initiated ADD LINK Processing").

   Server-initiated ADD LINK processing in an established SMC-R link
   group is the same as the ADD LINK processing described in
   Section 3.5.1.6 ("Second SMC-R Link Setup"), with the following
   changes:

   o  If an asymmetric SMC-R link already exists in the link group, a
      second asymmetric link will not be created.  Only one asymmetric
      link is permitted in a link group.

   o  TCP data flow on already-existing link(s) in the link group is not
      halted or otherwise affected during the process of setting up the
      additional link.

   The server will not initiate ADD LINK processing if the link group
   already has the maximum number of links negotiated by the partners.

3.5.5.1.2.  Client-Initiated ADD LINK Processing

   If an additional RNIC becomes available for an existing SMC-R link
   group on the client's side, the client notifies the server by sending
   an ADD LINK request LLC message to the server.  Unlike an ADD LINK
   request sent by the server to the client, this ADD LINK request
   merely informs the server that the client has a new RNIC.  If the
   link group lacks redundancy or has redundancy only on an asymmetric
   link with a single RNIC on the client side, the server must initiate
   an ADD LINK exchange in response to this message, to create or
   improve the link group's redundancy.

RFC7609 - Page 46

   If the link group already has symmetric-link redundancy but has fewer
   than the negotiated maximum number of links, the server may respond
   by initiating an ADD LINK exchange to create a new link using the
   client's new resource but is not required to do so.

   If the link group already has the negotiated maximum number of links,
   the server must ignore the client's ADD LINK request LLC message.

   Because the server is not required to respond to the client's
   ADD LINK LLC message in all cases, the client must not wait for a
   response or throw an error if one does not come.

3.5.5.1.3.  Server-Initiated DELETE LINK Processing

   Reasons that a server may delete a link include the following:

   o  The link has not been used for TCP connections for an
      implementation-defined time interval, and deleting the link will
      not cause the link group to lack redundancy.

   o  Errors in resources supporting the link occur.  These errors may
      include, but are not limited to, RNIC errors, QP errors, and
      software errors.

   o  The RNIC supporting this SMC-R link is being taken down, either
      because of an error case or because of an operator or software
      command.

   If a link being deleted is supporting TCP connections and there are
   one or more surviving links in the link group, the TCP connections
   are moved to the surviving links.  For more information on this
   processing, see Section 2.3 ("SMC-R Resilience and Load Balancing").

   The server deletes a link from the link group by sending a
   DELETE LINK request LLC message to the client over any of the usable
   links in the link group.  Because the DELETE LINK LLC message
   specifies which link is to be deleted, it may flow over any link in
   the link group.  The server must not clean up its RoCE resources for
   the link until the client responds.

   The client responds to the server's DELETE LINK request LLC message
   by sending the server a DELETE LINK response LLC message.  The client
   must respond positively; it cannot decline to delete the link.  Once
   the server has received the client's DELETE LINK response, both sides
   may clean up their resources for the link.

RFC7609 - Page 47

   Either a positive write completion or some other indication from the
   RNIC on the client's side is sufficient to indicate to the client
   that the server has received the DELETE LINK response.

         Host X                                     Host Y
    +-------------------+                      +-------------------+
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
    |RToken X|   |Failed|<--X----X----X----X-->|      |            |
    |        |   |      |                      |      |            |
    |       \/   +------+                      +------+            |
    |+--------+         |                      |                   |
    || Deleted|         |                      |                   |
    || RMB    |         |                      |                   |
    ||        |         |                      |                   |
    |+--------+         |                      |                   |
    |       /\   +------+                      +------+            |
    |RToken Z|   |      |     SMC-R Link 2     |      |            |
    |        |   |RNIC 3|<-------------------->|RNIC 4|            |
    |       QP 64|      |                      |      | QP 65      |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+

          DELETE LINK(request, link number = 1,
                ................................................>
                       reason code = RNIC failure)

          DELETE LINK(response, link number = 1)
               <................................................

           (Note: Architecturally, this exchange can flow over either
                  SMC-R link but most likely flows over Link 2, since
                  the RNIC for Link 1 has failed.)

               Figure 10: Server-Initiated DELETE LINK Flow

RFC7609 - Page 48

3.5.5.1.4.  Client-Initiated DELETE LINK Request

   The client may request that the server delete a link for the same
   reasons that the server may delete a link, except for inactivity
   timeout.

   Because the client depends on the server to delete links, there are
   two types of delete requests from client to server:

   o  Orderly: The client is requesting that the server delete the link
      when able.  This would result from an operator command to bring
      down the RNIC or some other nonfatal reason.  In this case, the
      server is required to delete the link but may not do it right
      away.

   o  Disorderly: The server must delete the link right away, because
      the client has experienced a fatal error with the link.

   In either case, the server responds by initiating a DELETE LINK
   exchange with the client, as described in the previous section.  The
   difference between the two is whether the server must do so
   immediately or can delay for an opportunity to gracefully delete the
   link.

RFC7609 - Page 49

          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
     |RToken X|   |      |<---X--X--X--X--X--X->|Failed|            |
     |        |   |      |                      |      |            |
     |       \/   +------+                      +------+            |
     |+--------+         |                      |                   |
     || Deleted|         |                      |                   |
     || RMB    |         |                      |                   |
     ||        |         |                      |                   |
     |+--------+         |                      |                   |
     |       /\   +------+                      +------+            |
     |RToken Z|   |      |     SMC-R Link 2     |      |            |
     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
     |       QP 64|      |                      |      | QP 65      |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+

           DELETE LINK(request, link number = 1, disorderly,
                <...............................................
                       reason code = RNIC failure)

           DELETE LINK(request, link number = 1,
                 ................................................>
                        reason code = RNIC failure)

           DELETE LINK(response, link number = 1)
                <................................................

           (Note: Architecturally, this exchange can flow over either
                  SMC-R link but most likely flows over Link 2, since
                  the RNIC for Link 1 has failed.)

               Figure 11: Client-Initiated DELETE LINK Flow

3.5.5.2.  Managing Multiple RKeys over Multiple SMC-R Links in a
          Link Group

   After the initial contact sequence completes and the number of TCP
   connections increases, it is possible that the SMC peers could add
   more RMBs to the link group.  Recall that each peer independently
   manages its RMBs.  Also recall that an RMB's RToken is specific to a
   QP, which means that when there are multiple SMC-R links in a link
   group, each RMB accessed with the link group requires a separate
   RToken for each SMC-R link in the group.

RFC7609 - Page 50

   Each RMB that is added to a link must be added to all links within
   the link group.  The set of RMBs created for the link is called the
   "RToken set".  The RTokens must be exchanged with the peer.  As RMBs
   are added and deleted, the RToken set must remain in sync.

3.5.5.2.1.  Adding a New RMB to an SMC-R Link Group

   A new RMB can be added to an SMC-R link group on either the client
   side or the server side.  When an additional RMB is added to an
   existing SMC-R link group, that RMB must be associated with the QPs
   for each link in the link group.  Therefore, when an RMB is added to
   an SMC-R link group, its RMB RToken for each SMC-R link's QP must be
   communicated to the peer.

   The tokens for a new RMB added to an existing SMC-R link group are
   communicated using CONFIRM RKEY LLC messages, as shown in Figure 12.
   The RToken set is specified as pairs: an SMC-R link number, paired
   with the new RMB's RToken over that SMC-R link.  To preserve failover
   capability, any TCP connection that uses a newly added RMB cannot go
   active until all RTokens for the RMB have been communicated for all
   of the links in the link group.

RFC7609 - Page 51

          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
     |RToken X|   |      |<-------------------->|      |            |
     |        |   |      |                      |      |            |
     |       \/   +------+                      +------+            |
     |+--------+         |                      |                   |
     || New    |         |                      |                   |
     || RMB    |         |                      |                   |
     ||        |         |                      |                   |
     |+--------+         |                      |                   |
     |       /\   +------+                      +------+            |
     |RToken Z|   |      |     SMC-R Link 2     |      |            |
     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
     |       QP 64|      |                      |      | QP 65      |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+

           CONFIRM RKEY(request, Add,
                 ................................................>
                      RToken set((Link 1,RToken X),(Link 2,RToken Z)))

           CONFIRM RKEY(response, Add,
                <................................................
                      RToken set((Link 1,RToken X),(Link 2,RToken Z)))

            (Note: This exchange can flow over either SMC-R link.)

                 Figure 12: Add RMB to Existing Link Group

   Implementations may choose to proactively add RMBs to link groups in
   anticipation of need.  For example, an implementation may add a new
   RMB when a certain usage threshold (e.g., percentage used) for all of
   its existing RMBs has been exceeded.

   A new RMB may also be added to an existing link group on an as-needed
   basis -- for example, when a new TCP connection is added to the link
   group but there are no available RMB elements.  In this case, the CLC
   exchange is paused while the peer that requires the new RMB adds it.
   An example of this is illustrated in Figure 13.

RFC7609 - Page 52

       Host X -- Server                            Host Y -- Client
    +-------------------+                      +--------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1    |
    |            +------+                      +------+             |
    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64      |
    |RToken X|   |MAC MA|<-------------------->|MAC MB|   |         |
    |        |   |GID GA|                      |GID GB|   |RToken Y2|
    |       \/   +------+                      +------+  \/         |
    |+--------+         |                      |        +--------+  |
    ||        |         |   Subnet S1          |        | New    |  |
    || RMB    |         |                      |        | RMB    |  |
    |+--------+         |                      |        +--------+  |
    |       /\   +------+                      +------+  /\         |
    |        |   |RNIC 3|    SMC-R Link 2      |RNIC 4|   |RToken W2|
    |        |   |MAC MC|<-------------------->|MAC MD|   |         |
    |       QP 9 |GID GC|                      |GID GD|  QP 65      |
    |            +------+                      +------+             |
    +-------------------+                      +--------------------+

           SYN / SYN-ACK / ACK TCP three-way handshake with TCP option
        <--------------------------------------------------------->

                    SMC Proposal(PC1,MB,GB,S1)
        <--------------------------------------------------------

      SMC Accept(PS1,not 1st contact,MA,GA,QP8,RToken=X,RMB elem index)
        --------------------------------------------------------->

          CONFIRM RKEY(request, Add,
        <........................................................
                  RToken set((Link 1,RToken Y2),(Link 2,RToken W2)))

          CONFIRM RKEY(response, Add,
         ........................................................>
                  RToken set((Link 1,RToken Y2),(Link 2,RToken W2)))

          SMC Confirm(PC1,MB,GB,QP64,RToken=Y2, RMB element index)
        <--------------------------------------------------------

                         Legend:
                  ------------   TCP/IP and CLC flows
                  ............   RoCE (LLC) flows

          Figure 13: Client Adds RMB during TCP Connection Setup

RFC7609 - Page 53

3.5.5.2.2.  Deleting an RMB from an SMC-R Link Group

   Either peer can delete one or more of its RMBs as long as it is not
   being used for any TCP connections.  Ideally, an SMC-R peer would use
   a timer to avoid freeing an RMB immediately after the last TCP
   connection stops using it, to keep the RMB available for later TCP
   connections and avoid thrashing with addition and deletion of RMBs.
   Once an SMC-R peer decides to delete an RMB, it sends a DELETE RKEY
   LLC message to its peer.  It can then free the RMB once it receives
   a response from the peer.  Multiple RMBs can be deleted in a
   DELETE RKEY exchange.

   Note that in a DELETE RKEY message, it is not necessary to specify
   the full RToken for a deleted RMB.  The RMB's RKey over one link in
   the link group is sufficient to specify which RMB is being deleted.

          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
     |RToken X|   |      |<-------------------->|      |            |
     |        |   |      |                      |      |            |
     |       \/   +------+                      +------+            |
     |+--------+         |                      |                   |
     || Deleted|         |                      |                   |
     || RMB    |         |                      |                   |
     ||        |         |                      |                   |
     |+--------+         |                      |                   |
     |       /\   +------+                      +------+            |
     |RToken Z|   |      |     SMC-R Link 2     |      |            |
     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
     |       QP 9 |      |                      |      |            |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+

           DELETE RKEY(request, RKey list(RKey X))
                 ................................................>

           DELETE RKEY(response, RKey list(RKey X))
                <................................................

           (Note: This exchange can flow over either SMC-R link.)

                Figure 14: Delete RMB from SMC-R Link Group

RFC7609 - Page 54

3.5.5.2.3.  Adding a New SMC-R Link to a Link Group with Multiple RMBs

   When a new SMC-R link is added to an existing link group, there could
   be multiple RMBs on each side already associated with the link group.
   There could also be a different number of RMBs on one side than on
   the other, because each peer manages its RMBs independently.  Each of
   these RMBs will require a new RToken to be used on the new SMC-R
   link, and those new RTokens must then be communicated to the peer.
   This requires two-way communication, as the server will have to
   communicate its RTokens to the client and vice versa.

   RTokens are communicated between peers in pairs.  Each RToken pair
   consists of:

   o  The RToken for the RMB, as is already known on an existing SMC-R
      link in the link group.

   o  The RToken for the same RMB, to be used on the new SMC-R link.

   These pairs are required to ensure that each peer knows which RTokens
   across QPs are equivalent.

   The ADD LINK request and response LLC messages do not have enough
   space to contain any RToken pairs.  ADD LINK CONTINUATION LLC
   messages are used to communicate these pairs, as shown in Figure 15.
   The ADD LINK CONTINUATION LLC messages are sent on the same SMC-R
   link that the ADD LINK LLC messages were sent over, and in both the
   ADD LINK and ADD LINK CONTINUATION LLC messages the first RToken in
   each RToken pair will be the RToken for the RMB as known on the SMC-R
   link over which the LLC message is being sent.

RFC7609 - Page 55

       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64     |
    |RKey set|   |MAC MA|<-------------------->|MAC MB|   |RKey set|
    |X,Y,Z   |   |GID GA|                      |GID GB|   |Q,R,S,T |
    |       \/   +------+                      +------+  \/        |
    |+--------+         |                      |        +--------+ |
    || 3 RMBs |         |                      |        | 4 RMBs | |
    |+--------+         |                      |        +--------+ |
    |       /\   +------+                      +------+  /\        |
    |RKey set|   |RNIC 3|    SMC-R Link 2      |RNIC 4|  | RKey set|
    |U,V,W   |   |MAC MC|<-------------------->|MAC MD|  | L,M,N,P |
    |       QP 9 |GID GC|    (being added)     |GID GD| QP 65      |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+

            ADD LINK request (QP9,MC,GC, link number = 2)
            ............................................>

            ADD LINK response (QP65,MD,GD, link number = 2)
            <............................................

    ADD LINK CONTINUATION req(RToken pairs=((X,U),(Y,V),(Z,W)))
             ............................................>

    ADD LINK CONTINUATION rsp(RToken pairs=((Q,L),(R,M),(S,N),(T,P)))
             <.............................................

           CONFIRM LINK req/rsp exchange on Link 2
            <.............................................>


                          Legend:
                   ------------   TCP/IP and CLC flows
                   ............   RoCE (LLC) flows

   Figure 15: Exchanging RKeys when a New Link Is Added to a Link Group

RFC7609 - Page 56

3.5.5.3.  Serialization of LLC Exchanges, and Collisions

   LLC flows can be divided into two main groups for serialization
   considerations.

   The first group is LLC messages that are independent and can flow at
   any time.  These are one-time, unsolicited messages that either do
   not have a required response or have a simple response that does not
   interfere with the operations of another group of messages.  These
   messages are as follows:

   o  TEST LINK from either the client or the server: This message
      requires a TEST LINK response to be returned but does not affect
      the configuration of the link group or the RKeys.

   o  ADD LINK from the client to the server: This message is provided
      as an "FYI" to the server to let it know that the client has an
      additional RNIC available.  The server is not required to act upon
      or respond to this message.

   o  DELETE LINK from the client to the server: This message informs
      the server that either (1) the client has experienced an error or
      problem that requires a link or link group to be terminated or
      (2) an operator has commanded that a link or link group be
      terminated.  The server does not respond directly to the message;
      rather, it initiates a DELETE LINK exchange as a result of
      receiving it.

   o  DELETE LINK from the server to the client, with the "delete entire
      link group" flag set: This message informs the client that the
      entire link group is being deleted.

   The second group is LLC messages that are part of an exchange of LLC
   messages that affects link group configuration; this exchange must
   complete before another exchange of LLC messages that affects link
   group configuration can be processed.  When a peer knows that one of
   these exchanges is in progress, it must not start another exchange.
   These exchanges are as follows:

   o  ADD LINK / ADD LINK response / ADD LINK CONTINUATION / ADD LINK
      CONTINUATION response / CONFIRM LINK / CONFIRM LINK response: This
      exchange, by adding a new link, changes the configuration of the
      link group.

   o  DELETE LINK / DELETE LINK response initiated by the server,
      without the "delete entire link group" flag set: This exchange, by
      deleting a link, changes the configuration of the link group.

RFC7609 - Page 57

   o  CONFIRM RKEY / CONFIRM RKEY response or DELETE RKEY / DELETE RKEY
      response: This exchange changes the RMB configuration of the link
      group.  RKeys cannot change while links are being added or deleted
      (while an ADD LINK or DELETE LINK is in progress).  However,
      CONFIRM RKEY and DELETE RKEY are unique in that both the client
      and server can independently manage (add or remove) their own
      RMBs.  This allows each peer to concurrently change their RKeys
      and therefore concurrently send CONFIRM RKEY or DELETE RKEY
      requests.  The concurrent CONFIRM RKEY or DELETE RKEY requests can
      be independently processed and do not represent a collision.

   Because the server is in control of the configuration of the link
   group, many timing windows and collisions are avoided, but there are
   still some that must be handled.

3.5.5.3.1.  Collisions with ADD LINK / CONFIRM LINK Exchange

   Colliding LLC message: TEST LINK

      Action to resolve: Send immediate TEST LINK reply.

   Colliding LLC message: ADD LINK from client to server

      Action to resolve: Server ignores the ADD LINK message.  When
      client receives server's ADD LINK, client will consider that
      message to be in response to its ADD LINK message and the flow
      works.  Since both client and server know not to start this
      exchange if an ADD LINK operation is already underway, this can
      only occur if the client sends this message before receiving the
      server's ADD LINK and this message crosses with the server's ADD
      LINK message; therefore, the server's ADD LINK arrives at the
      client immediately after the client sent this message.

   Colliding LLC message: DELETE LINK from client to server, specific
   link specified

      Action to resolve: Server queues the DELETE LINK message and
      processes it after the ADD LINK exchange completes.  If it is an
      orderly link termination, it can wait until after this exchange
      continues.  If it is disorderly and the link affected is the one
      that the current exchange is using, the server will discover the
      outage when a message in this exchange fails.

   Colliding LLC message: DELETE LINK from client to server, entire link
   group to be deleted

      Action to resolve: Immediately clean up the link group.

RFC7609 - Page 58

   Colliding LLC message: CONFIRM RKEY from client

      Action to resolve: Send a negative CONFIRM RKEY response to the
      client.  Once the current exchange finishes, client will have to
      recompute its RKey set to include the new link and then start a
      new CONFIRM RKEY exchange.

3.5.5.3.2.  Collisions during DELETE LINK Exchange

   Colliding LLC message: TEST LINK from either peer

      Action to resolve: Send immediate TEST LINK response.

   Colliding LLC message: ADD LINK from client to server

      Action to resolve: Server queues the ADD LINK and processes it
      after the current exchange completes.

   Colliding LLC message: DELETE LINK from client to server (specific
   link)

      Action to resolve: Server queues the DELETE LINK message and
      processes it after the current exchange completes.  If it is an
      orderly link termination, it can wait until after this exchange
      continues.  If it is disorderly and the link affected is the one
      that the current exchange is using, the server will discover the
      outage when a message in this exchange fails.

   Colliding LLC message: DELETE LINK from either client or server,
   deleting the entire link group

      Action to resolve: Immediately clean up the link group.

   Colliding LLC message: CONFIRM RKEY from client to server

      Action to resolve: Send a negative CONFIRM RKEY response to the
      client.  Once the current exchange finishes, client will have to
      recompute its RKey set to include the new link and then start a
      new CONFIRM RKEY exchange.

RFC7609 - Page 59

3.5.5.3.3.  Collisions during CONFIRM RKEY Exchange

   Colliding LLC message: TEST LINK

      Action to resolve: Send immediate TEST LINK reply.

   Colliding LLC message: ADD LINK from client to server

      Action to resolve: Queue the ADD LINK, and process it after the
      current exchange completes.

   Colliding LLC message: ADD LINK from server to client (CONFIRM RKEY
   exchange was initiated by the client, and it crossed with the server
   initiating an ADD LINK exchange)

      Action to resolve: Process the ADD LINK.  Client will receive a
      negative CONFIRM RKEY from the server and will have to redo this
      CONFIRM RKEY exchange after the ADD LINK exchange completes.

   Colliding LLC message: DELETE LINK from client to server, specific
   link to be deleted (CONFIRM RKEY exchange was initiated by the
   server, and it crossed with the client's DELETE LINK request)

      Action to resolve: Server queues the DELETE LINK message and
      processes it after the CONFIRM RKEY exchange completes.  If it is
      an orderly link termination, it can wait until after this exchange
      continues.  If it is disorderly and the link affected is the one
      that the current exchange is using, the server will discover the
      outage when a message in this exchange fails.

   Colliding LLC message: DELETE LINK from server to client, specific
   link deleted (CONFIRM RKEY exchange was initiated by the client, and
   it crossed with the server's DELETE LINK)

      Action to resolve: Process the DELETE LINK.  Client will receive a
      negative CONFIRM RKEY from the server and will have to redo this
      CONFIRM RKEY exchange after the ADD LINK exchange completes.

   Colliding LLC message: DELETE LINK from either client or server,
   entire link group deleted

      Action to resolve: Immediately clean up the link group.

   Colliding LLC message: CONFIRM LINK from the peer that did not start
   the current CONFIRM LINK exchange

      Action to resolve: Queue the request, and process it after the
      current exchange completes.

(next page on part 4)