Tech-invite3GPPspaceIETFspace
9796959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 7609

IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

Pages: 143
Informational
Part 3 of 6 – Pages 26 to 59
First   Prev   Next

Top   ToC   RFC7609 - Page 26   prevText

3. SMC-R Rendezvous Architecture

"Rendezvous" is the process that SMC-R-capable peers use to dynamically discover each others' capabilities, negotiate SMC-R connections, set up SMC-R links and link groups, and manage those link groups. A key aspect of SMC-R Rendezvous is that it occurs dynamically and automatically, without requiring SMC-R link configuration to be defined by an administrator. SMC-R Rendezvous starts with the TCP/IP three-way handshake, during which connection peers use TCP options to announce their SMC-R capabilities. If both endpoints are SMC-R capable, then Connection Layer Control (CLC) messages are exchanged between the peers' SMC-R layers over the newly established TCP connection to negotiate SMC-R credentials. The CLC message mechanism is analogous to the messages exchanged by SSL for its handshake processing. If a new SMC-R link is being set up, Link Layer Control (LLC) messages are used to confirm RDMA connectivity. LLC messages are also used by the SMC-R layers at each peer to manage the links and link groups. Once an SMC-R link is set up or agreed to by the peers, the TCP sockets are passed to the peer applications, which use them as normal. The SMC-R layer, which resides under the sockets layer, transmits the socket data between peers over RDMA using the SMC-R protocol, bypassing the TCP/IP stack.

3.1. TCP Options

During the TCP/IP three-way handshake, the client and server indicate their support for SMC-R by including experimental TCP option 254 on the three-way handshake flows, in accordance with [RFC6994] ("Shared Use of Experimental TCP Options"). The Experiment Identifier (ExID) value used is the string "SMCR" in EBCDIC (IBM-1047) encoding (0xE2D4C3D9). This ExID has been registered in the "TCP Experimental Option Experiment Identifiers (TCP ExIDs)" registry maintained by IANA.
Top   ToC   RFC7609 - Page 27
   After completion of the three-way TCP handshake, each peer queries
   its peer's options.  If both peers set the TCP option on the
   three-way handshake, inline SMC-R negotiation occurs using CLC
   messages.  If neither peer, or only one peer, sets the TCP option,
   SMC-R cannot be used for the TCP connection, and the TCP connection
   completes the setup using the IP fabric.

3.2. Connection Layer Control (CLC) Messages

CLC messages are sent as data payload over the IP network using the TCP connection between SMC-R layers at the peers. They are analogous to the messages used to exchange parameters for SSL. The use of CLC messages is detailed in the following sections. The following list provides a summary of the defined CLC messages and their purposes: o SMC Proposal: Sent from the client to propose that this TCP connection is eligible to be moved to SMC-R. The client identifies itself and its subnet to the server and passes the SMC-R elements for a suggested RoCE path via the MAC and GID. o SMC Accept: Sent from the server to accept the client's TCP connection SMC Proposal. The server responds to the client's proposal by identifying itself to the client and passing the elements of a RoCE path that the client can use to perform RDMA writes to the server. This consists of such SMC-R link elements as RoCE MAC, GID, and RMB information. o SMC Confirm: Sent from the client to confirm the server's acceptance of the SMC connection. The client responds to the server's acceptance by passing the elements of a RoCE path that the server can use to perform RDMA writes to the client. This consists of such SMC-R link elements as RoCE MAC, GID, and RMB information. o SMC Decline: Sent from either the server or the client to reject the SMC connection, indicating the reason the peer must decline the SMC Proposal and allowing the TCP connection to revert back to IP connectivity.

3.3. LLC Messages

Link Layer Control (LLC) messages are sent between peer SMC-R layers over an SMC-R link to manage the link or the link group. LLC messages are sent using RoCE SendMsg and are 44 bytes long. The 44-byte size is based on what can fit into a RoCE Work Queue Element (WQE) without requiring the posting of receive buffers.
Top   ToC   RFC7609 - Page 28
   LLC messages generally follow a request-reply semantic.  Each message
   has a request flavor and a reply flavor, and each request must be
   confirmed with a reply, except where otherwise noted.  The use of LLC
   messages is detailed in the following sections.  The following list
   provides a summary of the defined LLC messages and their purposes:

   o  ADD LINK: Used to add a new link to a link group.  Sent from the
      server to the client to initiate addition of a new link to the
      link group, or from the client to the server to request that the
      server initiate addition of a new link.

   o  ADD LINK CONTINUATION: A continuation of ADD LINK that allows the
      ADD LINK to span multiple commands, because all of the link
      information cannot be contained in a single ADD LINK message.

   o  CONFIRM LINK: Used to confirm that RoCE connectivity over a newly
      created SMC-R link is working correctly.  Initiated by the server.
      Both this message and its reply must flow over the SMC-R link
      being confirmed.

   o  DELETE LINK: When initiated by the server, deletes a specific link
      from the link group or deletes the entire link group.  When
      initiated by the client, requests that the server delete a
      specific link or the entire link group.

   o  CONFIRM RKEY: Informs the peer on the SMC-R link of the addition
      of an RMB to the link group.

   o  CONFIRM RKEY CONTINUATION: A continuation of CONFIRM RKEY that
      allows the CONFIRM RKEY to span multiple commands, in the event
      that all of the information cannot be contained in a single
      CONFIRM RKEY message.

   o  DELETE RKEY: Informs the peer on the SMC-R link of the deletion of
      one or more RMBs from the link group.

   o  TEST LINK: Verifies that an already-active SMC-R link is active
      and healthy.

   o  Optional LLC message: Any LLC message in which the two high-order
      bits of the opcode are b'10'.  This optional message must be
      silently discarded by a receiving peer that does not support the
      opcode.  No such messages are defined in this version of the
      architecture; however, the concept is defined to allow for
      toleration of possible advanced, optional functions.
Top   ToC   RFC7609 - Page 29
   CONFIRM LINK and TEST LINK are sensitive to which link they flow on
   and must flow on the link being confirmed or tested.  The other flows
   may flow over any active link in the link group.  When there are
   multiple links in a link group, a response to an LLC message must
   flow over the same link that the original message flowed over, with
   the following exceptions:

   o  ADD LINK request from a server in response to an ADD LINK from a
      client.

   o  DELETE LINK request from a server in response to a DELETE LINK
      from a client.

3.4. CDC Messages

Connection Data Control (CDC) messages are sent over the RoCE fabric between peers using RoCE SendMsg and are 44 bytes long. The 44-byte size is based on the size that can fit into a RoCE WQE without requiring the posting of receive buffers. CDC messages are used to describe the socket application data passed via RDMA write operations, as well as TCP connection state information, including producer cursors and consumer cursors, RMBE state information, and failover data validation.

3.5. Rendezvous Flows

Rendezvous information for SMC-R is exchanged as TCP options on the TCP three-way handshake flows to indicate capability, followed by inline TCP negotiation messages to actually do the SMC-R setup. Formats of all rendezvous options and messages discussed in this section are detailed in Appendix A.

3.5.1. First Contact

First contact between RoCE peers occurs when a new SMC-R link group is being set up. This could be because no SMC-R links already exist between the peers, or the server decides to create a new SMC-R link group in parallel with an existing one.
3.5.1.1. Pre-negotiation of TCP Options
The client and server indicate their SMC-R capability to each other using TCP option 254 on the TCP three-way handshake flows. A client who wishes to do SMC-R will include TCP option 254 using an ExID equal to the EBCDIC (codepage IBM-1047) encoding of "SMCR" on its SYN flow.
Top   ToC   RFC7609 - Page 30
   A server that supports SMC-R will include TCP option 254 with the
   ExID value of EBCDIC "SMCR" on its SYN-ACK flow.  Because the server
   is listening for connections and does not know where client
   connections will come from, the server implementation may choose to
   unconditionally include this TCP option if it supports SMC-R.  This
   may be required for server implementations where extensions to the
   TCP stack are not practical.  For server implementations that can add
   code to examine and react to packets during the three-way handshake,
   the server should only include the SMC-R TCP option on the SYN-ACK if
   the client included it on its SYN packet.

   A client who supports SMC-R and meets the three conditions outlined
   above may optionally include the TCP option for SMC-R on its ACK
   flow, regardless of whether or not the server included it on its
   SYN-ACK flow.  Some TCP/IP stacks may have to include it if the SMC-R
   layer cannot modify the options on the socket until the three-way
   handshake completes.  Proprietary servers should not include this
   option on the ACK flow, since including it on the SYN flow was
   sufficient to indicate the client's capabilities.

   Once the initial three-way TCP handshake is completed, each peer
   examines the socket options.  SMC-R implementations may do this by
   examining what was actually provided on the SYN and SYN-ACK packets
   or by performing a getsockopt() operation to determine the options
   sent by the peer.  If neither peer, or only one peer, specified the
   TCP option for SMC-R, then SMC-R cannot be used on this connection
   and it proceeds using normal IP flows and processing.

   If both peers specified the TCP option for SMC-R, then the TCP
   connection is not started yet and the peers proceed to SMC-R
   negotiation using inline data flows.  The socket is not yet turned
   over to the applications; instead, the respective SMC layers exchange
   CLC messages over the newly formed TCP connection.

3.5.1.2. Client Proposal
If SMC-R is supported by both peers, the client sends an SMC Proposal CLC message to the server. It is not immediately apparent on this flow from client to server whether this is a new or existing SMC-R link, because in clustered environments a single IP address may represent multiple hosts. This type of cluster virtual IP address can be owned by a network-based or host-based Layer 4 load balancer that distributes incoming TCP connections across a cluster of servers/hosts. For purposes of high availability, other clustered environments may also support the movement of a virtual IP address dynamically from one host in the cluster to another. In summary, the client cannot predetermine that a connection is targeting the same host by simply matching the destination IP address for outgoing TCP
Top   ToC   RFC7609 - Page 31
   connections.  Therefore, it cannot predetermine the SMC-R link that
   will be used for a new TCP connection.  This information will be
   dynamically learned, and the appropriate actions will be taken as the
   SMC-R negotiation handshake unfolds.

   In the SMC-R proposal message, the initiator (client) proposes the
   use of SMC-R by including its peer ID, GID, and MAC addresses, as
   well as the IP subnet number of the outgoing interface (if IPv4) or
   the IP prefix list for the network over which the proposal is sent
   (if IPv6).  At this point in the flow, the client makes no local
   commitments of resources for SMC-R.

   When the server receives the SMC Proposal CLC message, it uses the
   peer ID provided by the client, plus subnet or prefix information
   provided by the client, to determine if it already has a usable SMC-R
   link with this SMC-R peer.  If there are one or more existing SMC-R
   links with this SMC-R peer, the server then decides which SMC-R link
   it will use for this TCP connection.  See Sections 3.5.2 and 3.5.3
   for the cases of reusing an existing SMC-R link or creating a
   parallel SMC-R link group between SMC-R peers.

   If this is a first contact between SMC-R peers, the server must
   validate that it is on the same LAN as the client before continuing.
   For IPv4, the server does this by verifying that it has an interface
   with an IP subnet number that matches the subnet number sent by the
   client in the SMC Proposal.  For IPv6, it does this by verifying that
   it is directly attached to at least one IP prefix that was listed by
   the client in its SMC Proposal message.

   If the server agrees to use SMC-R, the server begins the setup of a
   new SMC-R link by allocating local QP and RMB resources (setting its
   QP state to INIT) and providing its full SMC-R information in an SMC
   Accept CLC message to the client over the TCP connection, along with
   a flag set indicating that this is a first contact flow.  While the
   SMC Accept message could flow over any IP route back to the client
   depending upon Layer 3 IP routing, the SMC-R credentials provided
   must be for the common subnet or prefix between the server and
   client, as determined above.  If the server cannot or does not want
   to do SMC-R with the client, it sends an SMC Decline CLC message to
   the client, and the connection data may begin flowing using normal
   TCP/IP flows.
Top   ToC   RFC7609 - Page 32
3.5.1.3. Server Acceptance
When the client receives the SMC Accept from the server, it determines whether this is a new or existing SMC-R link, using the combination of the following: the first contact flag, its MAC/GID and the MAC/GID returned by the server, the VLAN over which the connection is setting up, and the QP number provided by the server. If it is an existing SMC-R link and the client agrees to use that link for the TCP connection, see Section 3.5.2 ("Subsequent Contact") below. If it is a new SMC-R link between peers that already have an SMC-R link, then the server is starting a new SMC-R link group. Assuming that either (1) this is a first contact between peers or (2) the server is starting a new SMC-R link group, the client now allocates local QP and RMB resources for the SMC-R link (setting the QP state to RTR (ready to receive)), associates them with the server QP as learned from the SMC Accept CLC message, and sends an SMC Confirm CLC message to the server over the TCP connection with its SMC-R link information included. The client also starts a timer to wait for the server to confirm the reliably connected queue pair, as described below.
3.5.1.4. Client Confirmation
Upon receipt of the client's SMC Confirm CLC message, the server associates its QP for this SMC-R link with the client's QP as learned from the SMC Confirm CLC message and sets its QP state to RTS (ready to send). The client and the server now have reliably connected queue pairs.
3.5.1.5. Link (QP) Confirmation
Since setting up the SMC-R link and its QPs did not require any network flows on the RoCE fabric, the client and server must now confirm connectivity over the RoCE fabric. To accomplish this, the server will send a CONFIRM LINK Link Layer Control (LLC) message to the client over the newly created SMC-R link, using the RoCE fabric. The CONFIRM LINK LLC message will provide the server's MAC, GID, and QP information for the connection, allow each partner to communicate the maximum number of links it can tolerate in this link group (the "link limit"), and will additionally provide two link IDs: o a 1-byte server-assigned link number that is used by both peers to identify the link within the link group and is only unique within a link group.
Top   ToC   RFC7609 - Page 33
   o  a 4-byte link user ID.  This opaque value is assigned by the
      server for the server's local use and is provided to the client
      for management purposes -- for example, to use in network
      management displays and products.

   When the server sends this message, it will set a timer for receiving
   confirmation from the client.

   When the client receives the server's confirmation in the form of a
   CONFIRM LINK LLC message, it will cancel the confirmation timer it
   set when it sent the SMC Confirm message.  The client will also
   advance its QP state to RTS and respond over the RoCE fabric with a
   CONFIRM LINK response LLC message that (1) provides its MAC, GID,
   QP number, and link limit, (2) confirms the 1-byte link number sent
   by the server, and (3) provides its own 4-byte link user ID to the
   server.
Top   ToC   RFC7609 - Page 34
       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|                      |RNIC 2|  QP 64     |
    |RToken X|   |MAC MA|                      |MAC MB|   |        |
    |        |   |GID GA|                      |GID GB|   |RToken Y|
    |       \/   +------+      (Subnet S1)     +------+  \/        |
    |+--------+         |                      |        +--------+ |
    || RMB    |         |                      |        | RMB    | |
    |+--------+         |                      |        +--------+ |
    |            +------+                      +------+            |
    |            |RNIC 3|                      |RNIC 4|            |
    |            |MAC MC|                      |MAC MD|            |
    |            |GID GC|                      |GID GD|            |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+

                     SYN TCP options(254,"SMCR")
        <---------------------------------------------------------

                     SYN-ACK TCP options(254,"SMCR")
        --------------------------------------------------------->

                     ACK [TCP options(254,"SMCR")]
        <--------------------------------------------------------

                    SMC Proposal(PC1,MB,GB,S1)
        <--------------------------------------------------------

    SMC Accept(PS1,first contact,MA,GA,MTU,QP8,RToken=X,RMB elem index)
        --------------------------------------------------------->

         SMC Confirm(PC1,MB,GB,MTU,QP64,RToken=Y,RMB element index)
         <--------------------------------------------------------

       CONFIRM LINK(MA,GA,QP8, link lim, server link user ID, linknum)
        .........................................................>

    CONFIRM LINK rsp(MB,GB,QP64, link lim, client link user ID, linknum)
        <........................................................

                           Legend:
                    ------------   TCP/IP and CLC flows
                    ............   RoCE (LLC) flows
           Square brackets ("[ ]") indicate optional information

                 Figure 8: First Contact Rendezvous Flows
Top   ToC   RFC7609 - Page 35
   Technically, the data for the TCP connection could now flow over the
   RoCE path.  However, if this is a first contact, there is no
   alternate for this recently established RoCE path.  Since in the
   current architecture there is no failover from RoCE to IP once
   connection data starts flowing, this means that a failure of this
   path would disrupt the TCP connection, meaning that the level of
   redundancy and failover is less than that provided by IP.  If the
   network has alternate RoCE paths available, they would not be usable
   at this point.  This situation would be unacceptable.

3.5.1.6. Second SMC-R Link Setup
Because of the unacceptable situation described above, TCP data will not be allowed to flow on the newly established SMC-R link until a second path has been set up, or at least attempted. If the server has a second RNIC available on the same LAN, it attempts to set up the second SMC-R link over that second RNIC. If it only has one RNIC available on the LAN, it will attempt to set up the second SMC-R link over that one RNIC. In the latter case, the server is attempting to set up an asymmetric link, in case the client does have a second RNIC on the LAN. In either case, the server allocates a new QP over the RNIC it is attempting to use for the second link and assigns a link number to the new link; the server also creates an RToken for the RMB over this second QP (note that this means that the first and second QP each have their own RToken to represent the same RMB). The server provides this information, as well as the MAC and GID of the RNIC over which it is attempting to set up the second link, in an ADD LINK LLC message that it sends to the client over the SMC-R link that is already set up.
3.5.1.6.1. Client Processing of ADD LINK LLC Message from Server
When the client receives the server's ADD LINK LLC message, it examines the GID and MAC provided by the server to determine whether the server is attempting to use the same server-side RNIC as the existing SMC-R link or a different one. If the server is attempting to use the same server-side RNIC as the existing SMC-R link, then the client verifies that it has a second RNIC on the same LAN. If it does not, the client rejects the ADD LINK request from the server, because the resulting link would be a parallel link, which is not supported within a link group. If the client does have a second RNIC on the same LAN, it accepts the request, and an asymmetric link will be set up.
Top   ToC   RFC7609 - Page 36
   If the server is using a different server-side RNIC from the existing
   SMC-R link, then the client will accept the request and a second
   SMC-R link will be set up in this SMC-R link group.  If the client
   has a second RNIC on the same LAN, that second RNIC will be used for
   the second SMC-R link, creating symmetric links.  If the client does
   not have a second RNIC on the same LAN, it will use the same RNIC as
   was used for the initial SMC-R link, resulting in the setup of an
   asymmetric link in the SMC-R link group.

   In either case, when the client accepts the server's ADD LINK
   request, it allocates a new QP on the chosen RNIC and creates an RKey
   over that new QP for the client-side RMB for the SMC-R link group,
   then sends an ADD LINK reply LLC message to the server providing that
   information as well as echoing the link number that was sent by the
   server.

   If the client rejects the server's ADD LINK request, it sends an ADD
   LINK reply LLC message to the server with the reason code for the
   rejection.

3.5.1.6.2. Server Processing of ADD LINK Reply LLC Message from Client
If the client sends a negative response to the server or no reply is received, the server frees the RoCE resources it had allocated for the new link. Having a single link in an SMC-R link group is undesirable. The server's recovery is detailed in Appendix C.8 ("Failure to Add Second SMC-R Link to a Link Group"). If the client sends a positive reply to the server with MAC/GID/QP/RKey information, the server associates its QP for the new SMC-R link to the QP that the client provided. Now, the new SMC-R link is in the same situation that the first was in after the client sent its ACK packet -- there is a reliably connected queue pair over the new RoCE path, but there have been no RoCE flows to confirm that it's actually usable. So, at this point, the client and server will exchange CONFIRM LINK LLC messages just like they did on the first SMC-R link. If either peer receives a failure during this second CONFIRM LINK LLC exchange (either an immediate failure -- which implies that the message did not reach the partner -- or a timeout), it sends a DELETE LINK LLC message to the partner over the first (and now only) link in the link group. This DELETE LINK LLC message must be acknowledged before data can flow on the single link in the link group.
Top   ToC   RFC7609 - Page 37
       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|      SMC-R Link 1    |RNIC 2|  QP 64     |
    |RToken X|   |MAC MA|<-------------------->|MAC MB|   |        |
    |        |   |GID GA|                      |GID GB|   |RToken Y|
    |       \/   +------+                      +------+  \/        |
    |+--------+         |                      |        +--------+ |
    ||        |         |                      |        |        | |
    || RMB    |         |                      |        | RMB    | |
    ||        |         |                      |        |        | |
    |+--------+         |                      |        +--------+ |
    |       /\   +------+                      +------+  /\        |
    |        |   |RNIC 3|      SMC-R Link 2    |RNIC 4|  |         |
    |RToken Z|   |MAC MC|<-------------------->|MAC MD|  |RToken W |
    |       QP 9 |GID GC|      (being added)   |GID GD| QP 65      |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+

                First SMC-R link setup as shown in Figure 8
            <-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.->

            ADD LINK request(QP9,MC,GC, link number = 2)
            ............................................>

            ADD LINK response(QP65,MD,GD, link number = 2)
            <............................................

            ADD LINK CONTINUATION request(RToken=Z)
            ............................................>

           ADD LINK CONTINUATION response(RToken=W)
            <............................................

         CONFIRM LINK(MC,GC,QP9, link number = 2, link user ID)
            .............................................>

      CONFIRM LINK response(MD,GD,QP65, link number = 2, link user ID)
            <.............................................

                          Legend:
                   ------------   TCP/IP and CLC flows
                   ............   RoCE (LLC) flows

                Figure 9: First Contact, Second Link Setup
Top   ToC   RFC7609 - Page 38
3.5.1.6.3. Exchange of RKeys on Second SMC-R Link
Note that in the scenario described here -- first contact -- there is only one RMB RKey to exchange on the second SMC-R link, and it is exchanged in the ADD LINK CONTINUATION request and reply. In scenarios other than first contact -- for example, adding a new SMC-R link to a longstanding link group with multiple RMBs -- additional flows will be required to exchange additional RMB RKeys. See Section 3.5.5.2.3 ("Adding a New SMC-R Link to a Link Group with Multiple RMBs") for more details on these flows.
3.5.1.6.4. Aborting SMC-R and Falling Back to IP
If both partners don't provide the SMC-R TCP option during the three-way TCP handshake, the connection falls back to normal TCP/IP. During the SMC-R negotiation that occurs after the three-way TCP handshake, either partner may break off SMC-R by sending an SMC Decline CLC message. The SMC Decline CLC message may be sent in place of any expected message and may also be sent during the CONFIRM LINK LLC exchange if there is a failure before any application data has flowed over the RoCE fabric. For more details on exactly when an SMC Decline can flow during link group setup, see Appendices C.1 ("SMC Decline during CLC Negotiation") and C.2 ("SMC Decline during LLC Negotiation"). If this fallback to IP happens while setting up a new SMC-R link group, the RoCE resources allocated for this SMC-R link group relationship are torn down, and it will be retried as a new SMC-R link group next time a connection starts between these peers with SMC-R proposed. Note that if this happens because one side doesn't support SMC-R, there will be very little to tear down, as the TCP option will have failed to flow on either the initial SYN or the SYN-ACK before either side had reserved any local RoCE resources.

3.5.2. Subsequent Contact

"Subsequent contact" means setting up a new TCP connection between two peers that already have an SMC-R link group between them and reusing the existing SMC-R link group. In this case, it is not necessary to allocate new QPs. However, it is possible that a new RMB has been allocated for this TCP connection, if the previous TCP connection used the last element available in the previously used RMB, or for any other implementation-dependent reason. For this reason, and for convenience and error checking, the same TCP option 254, followed by the inline negotiation method described for initial contact, will be used for subsequent contact, but the processing differs in some ways. That processing is described below.
Top   ToC   RFC7609 - Page 39
3.5.2.1. SMC-R Proposal
When the client begins the inline negotiation with the server, it does not know if this is a first contact or a subsequent contact. The client cannot know this information until it sees the server's peer ID, to determine whether or not it already has an SMC-R link with this peer that it can use. There are several reasons why it is not sufficient to use the partner IP address, subnet, VLAN, or other IP information to make this determination. The most obvious reason is distributed systems: if the server IP address is actually a virtual IP address representing a distributed cluster, the actual host serving this TCP connection may not be the same as the host that served the last TCP connection to this same IP address. After the TCP three-way handshake, assuming that both partners indicate SMC-R capability, the client builds and sends the SMC Proposal CLC message to the server in exactly the same manner as it does in the "first contact" case, and in fact at this point doesn't know if it's a first contact or a subsequent contact. As in the "first contact" case, the client sends its peer ID value, suggested RNIC MAC/GID, and IP subnet or prefix information. Upon receiving the client's proposal, the server looks up the provided peer ID to determine if it already has a usable SMC-R link group with this peer. If it does already have a usable SMC-R link group, the server then needs to decide whether it will use the existing SMC-R link group or create a new link group. For the case of the new link group, see Section 3.5.3 ("First Contact Variation: Creating a Parallel Link Group") below. For this discussion, assume that the server decides to use the existing SMC-R link group for the TCP connection, which is expected to be the most common case. The server is responsible for making this decision. The server then needs to communicate that information to the client, but it is not necessary to allocate, associate, and confirm QPs for the chosen SMC-R link. All that remains to be done is to set up RMB space for this TCP connection. If one of the RMBs already in use for this SMC-R link group has an available element that uses the appropriate buffer size, the server merely chooses one for this TCP connection and then sends an SMC Accept CLC message providing the full RoCE information for the chosen SMC-R link to the client, using the same format as the SMC Accept CLC message described in Section 3.5.1 ("First Contact") above.
Top   ToC   RFC7609 - Page 40
   The server may choose to use the SMC-R link that matches the
   suggested MAC/GID provided by the client in the SMC Proposal for its
   RDMA writes but is not obligated to do so.  The final decision on
   which specific SMC-R link to assign a TCP connection to is an
   independent server and client decision.

   It may be necessary for the server to allocate a new RMB for this
   connection.  The reasons for this are implementation dependent and
   could include the following:

   o  no available space in existing RMB or RMBs, or

   o  desire to allocate a new RMB that uses a different buffer size
      from the ones already created, or

   o  any other implementation-dependent reason

   In this case, the server will allocate the new RMB and then perform
   the flows described in Section 3.5.5.2.1 ("Adding a New RMB to an
   SMC-R Link Group").  Once that processing is complete, the server
   then provides the full RoCE information, including the new RKey, for
   this connection in an SMC Confirm CLC message to the client.

3.5.2.2. SMC-R Acceptance
Upon receiving the SMC Accept CLC message from the server, the client examines the RoCE information provided by the server to determine whether this is a first contact for a new SMC-R link group or a subsequent contact for an existing SMC-R link group. It is a subsequent contact if the server-side peer ID, GID, MAC, and QP number provided in the packet match a known SMC-R link, and the first contact flag is not set. If this is not the case -- for example, the GID and MAC match but the QP is new -- then the server is creating a new, parallel SMC-R link group, and this is treated as a first contact. A different RMB RToken does not indicate a first contact, as the server may have allocated a new RMB or may be using several RMBs for this SMC-R link. The client needs the server's RMB information only for its RDMA writes to the server, and since there is no requirement for symmetric RMBs, this information is simply control information for the RDMA writes on this SMC-R link. The client must validate that the RMB element being provided by the server is not in use by another TCP connection on this SMC-R link group. This validation must validate the new <rtoken, index> across
Top   ToC   RFC7609 - Page 41
   all known <rtoken, index> on this link group.  See Section 4.4.2
   ("RMB Element Reuse and Conflict Resolution") for the case in which
   the server tries to use an RMB element that is already in use on this
   link group.

   Once the client has determined that this TCP connection is a
   subsequent contact over an existing SMC-R link, it performs an RMB
   allocation process similar to what the server did: it either
   (1) allocates an element from an RMB already associated with this
   SMC-R link or (2) allocates a new RMB, associates it with this SMC-R
   link, and then chooses an element out of it.

   If the client allocates a new RMB for this TCP connection, it
   performs the processing described in Section 3.5.5.2.1 ("Adding a New
   RMB to an SMC-R Link Group").  Once that processing is complete, the
   client provides its full RoCE information for this TCP connection in
   an SMC Confirm CLC message.

   Because an SMC-R link with a verified connected QP already exists and
   is being reused, there is no need for verification or alternate QP
   selection flows or timers.

3.5.2.3. SMC-R Confirmation
When the server receives the client's SMC Confirm CLC message on a subsequent contact, it verifies the following: o The RMB element provided by the client is not already in use by another TCP connection on this SMC-R link group (see Section 4.4.2 ("RMB Element Reuse and Conflict Resolution") for the case in which it is). o The MAC/GID/QP information provided by the client matches an active link within the link group. The client is free to select any valid/active link. The client is not required to select the same link as the server. If this validation passes, the server stores the client's RMB information for this connection, and the RoCE setup of the TCP connection is complete.
3.5.2.4. TCP Data Flow Race with SMC Confirm CLC Message
On a subsequent contact TCP/IP connection, a peer may send data as soon as it has received the peer RMB information for the connection. There are no additional RoCE confirmation flows, since the QPs on the SMC-R link are already reliably connected and verified.
Top   ToC   RFC7609 - Page 42
   In the majority of cases, the first data will flow from the client to
   the server.  The client must send the SMC Confirm CLC message before
   sending any connection data over the chosen SMC-R link; however, the
   client need not wait for confirmation of this message, and in fact
   there will be no such confirmation.  Since the server is required to
   have the RMB fully set up and ready to receive data from the client
   before sending an SMC Accept CLC message, the client can begin
   sending data over the SMC-R link immediately upon completing the send
   of the SMC Confirm CLC message.

   It is possible that data from the client will arrive at the
   server-side RMB before the SMC Confirm CLC message from the client
   has been processed.  In this case, the server must handle this race
   condition and not provide the arrived TCP data to the socket
   application until the SMC Confirm CLC message has been received and
   fully processed, opening the socket.

   If the server has initial data to send to the client that is not a
   response to the client (this case should be rare), it can send the
   data immediately upon receiving and processing the SMC Confirm CLC
   message from the client.  The client must have opened the TCP socket
   to the client application upon sending the SMC Confirm CLC message so
   the client will be ready to process data from the server.

3.5.3. First Contact Variation: Creating a Parallel Link Group

Recall that parallel SMC-R links within an SMC-R link group are not supported. These are multiple SMC-R links within a link group that use the same network path. However, multiple SMC-R link groups between the same peers are supported. This means that if multiple SMC-R links over the same RoCE path are desired, it is necessary to use multiple SMC-R link groups. While not a recommended practice, this could be done for platform-specific reasons, like QP separation of different workloads. Only the server can drive the creation of multiple SMC-R link groups between peers. At a high level, when the server decides to create an additional SMC-R link group with a client with which it already has an SMC-R link group, the flows are basically the same as the normal "first contact" case described above. The following text provides more detail and clarification of processing in this case. When the server receives the SMC Proposal CLC message from the client and, using the MAC/GID information, determines that it already has an SMC-R link group with this client, the server can either reuse the existing SMC-R link group (detailed in Section 3.5.2 ("Subsequent Contact") above) or create a new SMC-R link group in addition to the existing one.
Top   ToC   RFC7609 - Page 43
   If the server decides to create a new SMC-R link group, it does the
   same processing it would have done for first contact: allocate QP and
   RMB resources as well as alternate QP resources, and communicate the
   QP and RMB information to the client in the SMC Accept CLC message
   with the first contact flag set.

   When the client receives the server's SMC Accept CLC message with the
   new QP information and the first contact flag set, it knows that the
   server is creating a new SMC-R link group even though it already has
   an SMC-R link group with the server.  In this case, the client will
   also allocate a new QP for this new SMC-R link, allocate an RMB for
   it, and generate an RKey for it.

   Note that multiple SMC-R link groups between the same peers must
   access different RMB resources, so new RMBs will be required.  Using
   the same RMBs that are in use in another SMC-R link group is not
   permitted.

   The client then associates its new QP with the server's new QP and
   sends its SMC Confirm CLC message back to the server providing the
   new QP/RMB information, and then sets its confirmation timer for the
   new SMC-R link.

   When the server receives the client's SMC Confirm CLC message, it
   associates its QP with the client's QP as learned from the SMC
   Confirm CLC message and sends a confirmation LLC message.  The rest
   of the flow, with the confirmation QP and setup of additional SMC-R
   links, unfolds just like the "first contact" case.

3.5.4. Normal SMC-R Link Termination

The normal socket API trigger points are used by the SMC-R layer to initiate SMC-R connection termination flows. The main design point for SMC-R normal connection flows is to use the SMC-R protocol to first shut down the SMC-R connection and free up any SMC-R RDMA resources, and then allow the normal TCP connection termination protocol (i.e., FIN processing) to drive cleanup of the TCP connection that exists on the IP fabric. This design point is very important in ensuring that RDMA resources such as the RMBEs are only freed and reused when both SMC-R endpoints are completely done with their RDMA write operations to the partner's RMBE. When the last TCP connection over an SMC-R link group terminates, the link group can be terminated. Similar to creation of SMC-R links and link groups, the primary responsibility for determining that normal termination is needed and initiating it lies with the server.
Top   ToC   RFC7609 - Page 44
   Implementations may opt to set timers to keep SMC-R link groups up
   for a specified time after the last TCP connection ends, to avoid
   churn in cases where TCP connections come and go regularly.

   The link or link group may also be terminated as a result of a
   command initiated by the operator.  This command can be entered at
   either the client or the server.  If entered at the client, the
   client requests that the server perform link or link group
   termination, and the responsibility for doing so ultimately lies with
   the server.

   When the server determines that the SMC-R link group is to be
   terminated, it sends a DELETE LINK LLC message to the client, with a
   flag set indicating that all links in the link group are to be
   terminated.  After receiving confirmation from the adapter that the
   DELETE LINK LLC message has been sent, the server can clean up its
   end of the link group (QPs, RMBs, etc.).  Upon receipt of the DELETE
   LINK message from the server, the client must immediately comply and
   clean up its end of the link group.  Any TCP connections that the
   client believes to be active on the link group must be immediately
   terminated.

   The client can request that the server delete the link group as well.
   The client does this by sending a DELETE LINK message to the server,
   indicating that cleanup of all links is requested.  The server must
   comply by sending a DELETE LINK to the client and processing as
   described in the previous paragraph.  If there are TCP connections
   active on the link group when the server receives this request, they
   are immediately terminated by sending a RST flow over the IP fabric.

3.5.5. Link Group Management Flows

3.5.5.1. Adding and Deleting Links in an SMC-R Link Group
The server has the lead role in managing the composition of the link group. Links are added to the link group by the server. The client may notify the server of new conditions that may result in the server adding a new link, but the server is ultimately responsible. In general, links are deleted from the link group by the server; however, in certain error cases the client may inform the server that a link must be deleted and treat it as deleted without waiting for action from the server. These flows are detailed in the sections that follow.
Top   ToC   RFC7609 - Page 45
3.5.5.1.1. Server-Initiated ADD LINK Processing
As described in previous sections, the server initiates an ADD LINK exchange to create redundancy in a newly created link group. Once a link group is established, the server may also initiate ADD LINK for other reasons, including: o Availability of additional resources on the server host to support an additional SMC-R link. This may include the provisioning of an additional RNIC, more storage becoming available to support additional QP resources, operator command, or any other implementation-dependent reason. Note that in order to be available for an existing link group a new RNIC must be attached to the same RoCE LAN that the link group is using. o Receipt of notification from the client that additional resources on the client are available to support an additional SMC-R link. See Section 3.5.5.1.2 ("Client-Initiated ADD LINK Processing"). Server-initiated ADD LINK processing in an established SMC-R link group is the same as the ADD LINK processing described in Section 3.5.1.6 ("Second SMC-R Link Setup"), with the following changes: o If an asymmetric SMC-R link already exists in the link group, a second asymmetric link will not be created. Only one asymmetric link is permitted in a link group. o TCP data flow on already-existing link(s) in the link group is not halted or otherwise affected during the process of setting up the additional link. The server will not initiate ADD LINK processing if the link group already has the maximum number of links negotiated by the partners.
3.5.5.1.2. Client-Initiated ADD LINK Processing
If an additional RNIC becomes available for an existing SMC-R link group on the client's side, the client notifies the server by sending an ADD LINK request LLC message to the server. Unlike an ADD LINK request sent by the server to the client, this ADD LINK request merely informs the server that the client has a new RNIC. If the link group lacks redundancy or has redundancy only on an asymmetric link with a single RNIC on the client side, the server must initiate an ADD LINK exchange in response to this message, to create or improve the link group's redundancy.
Top   ToC   RFC7609 - Page 46
   If the link group already has symmetric-link redundancy but has fewer
   than the negotiated maximum number of links, the server may respond
   by initiating an ADD LINK exchange to create a new link using the
   client's new resource but is not required to do so.

   If the link group already has the negotiated maximum number of links,
   the server must ignore the client's ADD LINK request LLC message.

   Because the server is not required to respond to the client's
   ADD LINK LLC message in all cases, the client must not wait for a
   response or throw an error if one does not come.

3.5.5.1.3. Server-Initiated DELETE LINK Processing
Reasons that a server may delete a link include the following: o The link has not been used for TCP connections for an implementation-defined time interval, and deleting the link will not cause the link group to lack redundancy. o Errors in resources supporting the link occur. These errors may include, but are not limited to, RNIC errors, QP errors, and software errors. o The RNIC supporting this SMC-R link is being taken down, either because of an error case or because of an operator or software command. If a link being deleted is supporting TCP connections and there are one or more surviving links in the link group, the TCP connections are moved to the surviving links. For more information on this processing, see Section 2.3 ("SMC-R Resilience and Load Balancing"). The server deletes a link from the link group by sending a DELETE LINK request LLC message to the client over any of the usable links in the link group. Because the DELETE LINK LLC message specifies which link is to be deleted, it may flow over any link in the link group. The server must not clean up its RoCE resources for the link until the client responds. The client responds to the server's DELETE LINK request LLC message by sending the server a DELETE LINK response LLC message. The client must respond positively; it cannot decline to delete the link. Once the server has received the client's DELETE LINK response, both sides may clean up their resources for the link.
Top   ToC   RFC7609 - Page 47
   Either a positive write completion or some other indication from the
   RNIC on the client's side is sufficient to indicate to the client
   that the server has received the DELETE LINK response.

         Host X                                     Host Y
    +-------------------+                      +-------------------+
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
    |RToken X|   |Failed|<--X----X----X----X-->|      |            |
    |        |   |      |                      |      |            |
    |       \/   +------+                      +------+            |
    |+--------+         |                      |                   |
    || Deleted|         |                      |                   |
    || RMB    |         |                      |                   |
    ||        |         |                      |                   |
    |+--------+         |                      |                   |
    |       /\   +------+                      +------+            |
    |RToken Z|   |      |     SMC-R Link 2     |      |            |
    |        |   |RNIC 3|<-------------------->|RNIC 4|            |
    |       QP 64|      |                      |      | QP 65      |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+

          DELETE LINK(request, link number = 1,
                ................................................>
                       reason code = RNIC failure)

          DELETE LINK(response, link number = 1)
               <................................................

           (Note: Architecturally, this exchange can flow over either
                  SMC-R link but most likely flows over Link 2, since
                  the RNIC for Link 1 has failed.)

               Figure 10: Server-Initiated DELETE LINK Flow
Top   ToC   RFC7609 - Page 48
3.5.5.1.4. Client-Initiated DELETE LINK Request
The client may request that the server delete a link for the same reasons that the server may delete a link, except for inactivity timeout. Because the client depends on the server to delete links, there are two types of delete requests from client to server: o Orderly: The client is requesting that the server delete the link when able. This would result from an operator command to bring down the RNIC or some other nonfatal reason. In this case, the server is required to delete the link but may not do it right away. o Disorderly: The server must delete the link right away, because the client has experienced a fatal error with the link. In either case, the server responds by initiating a DELETE LINK exchange with the client, as described in the previous section. The difference between the two is whether the server must do so immediately or can delay for an opportunity to gracefully delete the link.
Top   ToC   RFC7609 - Page 49
          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
     |RToken X|   |      |<---X--X--X--X--X--X->|Failed|            |
     |        |   |      |                      |      |            |
     |       \/   +------+                      +------+            |
     |+--------+         |                      |                   |
     || Deleted|         |                      |                   |
     || RMB    |         |                      |                   |
     ||        |         |                      |                   |
     |+--------+         |                      |                   |
     |       /\   +------+                      +------+            |
     |RToken Z|   |      |     SMC-R Link 2     |      |            |
     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
     |       QP 64|      |                      |      | QP 65      |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+

           DELETE LINK(request, link number = 1, disorderly,
                <...............................................
                       reason code = RNIC failure)

           DELETE LINK(request, link number = 1,
                 ................................................>
                        reason code = RNIC failure)

           DELETE LINK(response, link number = 1)
                <................................................

           (Note: Architecturally, this exchange can flow over either
                  SMC-R link but most likely flows over Link 2, since
                  the RNIC for Link 1 has failed.)

               Figure 11: Client-Initiated DELETE LINK Flow

3.5.5.2. Managing Multiple RKeys over Multiple SMC-R Links in a Link Group
After the initial contact sequence completes and the number of TCP connections increases, it is possible that the SMC peers could add more RMBs to the link group. Recall that each peer independently manages its RMBs. Also recall that an RMB's RToken is specific to a QP, which means that when there are multiple SMC-R links in a link group, each RMB accessed with the link group requires a separate RToken for each SMC-R link in the group.
Top   ToC   RFC7609 - Page 50
   Each RMB that is added to a link must be added to all links within
   the link group.  The set of RMBs created for the link is called the
   "RToken set".  The RTokens must be exchanged with the peer.  As RMBs
   are added and deleted, the RToken set must remain in sync.

3.5.5.2.1. Adding a New RMB to an SMC-R Link Group
A new RMB can be added to an SMC-R link group on either the client side or the server side. When an additional RMB is added to an existing SMC-R link group, that RMB must be associated with the QPs for each link in the link group. Therefore, when an RMB is added to an SMC-R link group, its RMB RToken for each SMC-R link's QP must be communicated to the peer. The tokens for a new RMB added to an existing SMC-R link group are communicated using CONFIRM RKEY LLC messages, as shown in Figure 12. The RToken set is specified as pairs: an SMC-R link number, paired with the new RMB's RToken over that SMC-R link. To preserve failover capability, any TCP connection that uses a newly added RMB cannot go active until all RTokens for the RMB have been communicated for all of the links in the link group.
Top   ToC   RFC7609 - Page 51
          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
     |RToken X|   |      |<-------------------->|      |            |
     |        |   |      |                      |      |            |
     |       \/   +------+                      +------+            |
     |+--------+         |                      |                   |
     || New    |         |                      |                   |
     || RMB    |         |                      |                   |
     ||        |         |                      |                   |
     |+--------+         |                      |                   |
     |       /\   +------+                      +------+            |
     |RToken Z|   |      |     SMC-R Link 2     |      |            |
     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
     |       QP 64|      |                      |      | QP 65      |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+

           CONFIRM RKEY(request, Add,
                 ................................................>
                      RToken set((Link 1,RToken X),(Link 2,RToken Z)))

           CONFIRM RKEY(response, Add,
                <................................................
                      RToken set((Link 1,RToken X),(Link 2,RToken Z)))

            (Note: This exchange can flow over either SMC-R link.)

                 Figure 12: Add RMB to Existing Link Group

   Implementations may choose to proactively add RMBs to link groups in
   anticipation of need.  For example, an implementation may add a new
   RMB when a certain usage threshold (e.g., percentage used) for all of
   its existing RMBs has been exceeded.

   A new RMB may also be added to an existing link group on an as-needed
   basis -- for example, when a new TCP connection is added to the link
   group but there are no available RMB elements.  In this case, the CLC
   exchange is paused while the peer that requires the new RMB adds it.
   An example of this is illustrated in Figure 13.
Top   ToC   RFC7609 - Page 52
       Host X -- Server                            Host Y -- Client
    +-------------------+                      +--------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1    |
    |            +------+                      +------+             |
    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64      |
    |RToken X|   |MAC MA|<-------------------->|MAC MB|   |         |
    |        |   |GID GA|                      |GID GB|   |RToken Y2|
    |       \/   +------+                      +------+  \/         |
    |+--------+         |                      |        +--------+  |
    ||        |         |   Subnet S1          |        | New    |  |
    || RMB    |         |                      |        | RMB    |  |
    |+--------+         |                      |        +--------+  |
    |       /\   +------+                      +------+  /\         |
    |        |   |RNIC 3|    SMC-R Link 2      |RNIC 4|   |RToken W2|
    |        |   |MAC MC|<-------------------->|MAC MD|   |         |
    |       QP 9 |GID GC|                      |GID GD|  QP 65      |
    |            +------+                      +------+             |
    +-------------------+                      +--------------------+

           SYN / SYN-ACK / ACK TCP three-way handshake with TCP option
        <--------------------------------------------------------->

                    SMC Proposal(PC1,MB,GB,S1)
        <--------------------------------------------------------

      SMC Accept(PS1,not 1st contact,MA,GA,QP8,RToken=X,RMB elem index)
        --------------------------------------------------------->

          CONFIRM RKEY(request, Add,
        <........................................................
                  RToken set((Link 1,RToken Y2),(Link 2,RToken W2)))

          CONFIRM RKEY(response, Add,
         ........................................................>
                  RToken set((Link 1,RToken Y2),(Link 2,RToken W2)))

          SMC Confirm(PC1,MB,GB,QP64,RToken=Y2, RMB element index)
        <--------------------------------------------------------

                         Legend:
                  ------------   TCP/IP and CLC flows
                  ............   RoCE (LLC) flows

          Figure 13: Client Adds RMB during TCP Connection Setup
Top   ToC   RFC7609 - Page 53
3.5.5.2.2. Deleting an RMB from an SMC-R Link Group
Either peer can delete one or more of its RMBs as long as it is not being used for any TCP connections. Ideally, an SMC-R peer would use a timer to avoid freeing an RMB immediately after the last TCP connection stops using it, to keep the RMB available for later TCP connections and avoid thrashing with addition and deletion of RMBs. Once an SMC-R peer decides to delete an RMB, it sends a DELETE RKEY LLC message to its peer. It can then free the RMB once it receives a response from the peer. Multiple RMBs can be deleted in a DELETE RKEY exchange. Note that in a DELETE RKEY message, it is not necessary to specify the full RToken for a deleted RMB. The RMB's RKey over one link in the link group is sufficient to specify which RMB is being deleted. Host X Host Y +-------------------+ +-------------------+ | +------+ +------+ | | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 | |RToken X| | |<-------------------->| | | | | | | | | | | \/ +------+ +------+ | |+--------+ | | | || Deleted| | | | || RMB | | | | || | | | | |+--------+ | | | | /\ +------+ +------+ | |RToken Z| | | SMC-R Link 2 | | | | | |RNIC 3|<-------------------->|RNIC 4| | | QP 9 | | | | | | +------+ +------+ | +-------------------+ +-------------------+ DELETE RKEY(request, RKey list(RKey X)) ................................................> DELETE RKEY(response, RKey list(RKey X)) <................................................ (Note: This exchange can flow over either SMC-R link.) Figure 14: Delete RMB from SMC-R Link Group
Top   ToC   RFC7609 - Page 54
3.5.5.2.3. Adding a New SMC-R Link to a Link Group with Multiple RMBs
When a new SMC-R link is added to an existing link group, there could be multiple RMBs on each side already associated with the link group. There could also be a different number of RMBs on one side than on the other, because each peer manages its RMBs independently. Each of these RMBs will require a new RToken to be used on the new SMC-R link, and those new RTokens must then be communicated to the peer. This requires two-way communication, as the server will have to communicate its RTokens to the client and vice versa. RTokens are communicated between peers in pairs. Each RToken pair consists of: o The RToken for the RMB, as is already known on an existing SMC-R link in the link group. o The RToken for the same RMB, to be used on the new SMC-R link. These pairs are required to ensure that each peer knows which RTokens across QPs are equivalent. The ADD LINK request and response LLC messages do not have enough space to contain any RToken pairs. ADD LINK CONTINUATION LLC messages are used to communicate these pairs, as shown in Figure 15. The ADD LINK CONTINUATION LLC messages are sent on the same SMC-R link that the ADD LINK LLC messages were sent over, and in both the ADD LINK and ADD LINK CONTINUATION LLC messages the first RToken in each RToken pair will be the RToken for the RMB as known on the SMC-R link over which the LLC message is being sent.
Top   ToC   RFC7609 - Page 55
       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64     |
    |RKey set|   |MAC MA|<-------------------->|MAC MB|   |RKey set|
    |X,Y,Z   |   |GID GA|                      |GID GB|   |Q,R,S,T |
    |       \/   +------+                      +------+  \/        |
    |+--------+         |                      |        +--------+ |
    || 3 RMBs |         |                      |        | 4 RMBs | |
    |+--------+         |                      |        +--------+ |
    |       /\   +------+                      +------+  /\        |
    |RKey set|   |RNIC 3|    SMC-R Link 2      |RNIC 4|  | RKey set|
    |U,V,W   |   |MAC MC|<-------------------->|MAC MD|  | L,M,N,P |
    |       QP 9 |GID GC|    (being added)     |GID GD| QP 65      |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+

            ADD LINK request (QP9,MC,GC, link number = 2)
            ............................................>

            ADD LINK response (QP65,MD,GD, link number = 2)
            <............................................

    ADD LINK CONTINUATION req(RToken pairs=((X,U),(Y,V),(Z,W)))
             ............................................>

    ADD LINK CONTINUATION rsp(RToken pairs=((Q,L),(R,M),(S,N),(T,P)))
             <.............................................

           CONFIRM LINK req/rsp exchange on Link 2
            <.............................................>


                          Legend:
                   ------------   TCP/IP and CLC flows
                   ............   RoCE (LLC) flows

   Figure 15: Exchanging RKeys when a New Link Is Added to a Link Group
Top   ToC   RFC7609 - Page 56
3.5.5.3. Serialization of LLC Exchanges, and Collisions
LLC flows can be divided into two main groups for serialization considerations. The first group is LLC messages that are independent and can flow at any time. These are one-time, unsolicited messages that either do not have a required response or have a simple response that does not interfere with the operations of another group of messages. These messages are as follows: o TEST LINK from either the client or the server: This message requires a TEST LINK response to be returned but does not affect the configuration of the link group or the RKeys. o ADD LINK from the client to the server: This message is provided as an "FYI" to the server to let it know that the client has an additional RNIC available. The server is not required to act upon or respond to this message. o DELETE LINK from the client to the server: This message informs the server that either (1) the client has experienced an error or problem that requires a link or link group to be terminated or (2) an operator has commanded that a link or link group be terminated. The server does not respond directly to the message; rather, it initiates a DELETE LINK exchange as a result of receiving it. o DELETE LINK from the server to the client, with the "delete entire link group" flag set: This message informs the client that the entire link group is being deleted. The second group is LLC messages that are part of an exchange of LLC messages that affects link group configuration; this exchange must complete before another exchange of LLC messages that affects link group configuration can be processed. When a peer knows that one of these exchanges is in progress, it must not start another exchange. These exchanges are as follows: o ADD LINK / ADD LINK response / ADD LINK CONTINUATION / ADD LINK CONTINUATION response / CONFIRM LINK / CONFIRM LINK response: This exchange, by adding a new link, changes the configuration of the link group. o DELETE LINK / DELETE LINK response initiated by the server, without the "delete entire link group" flag set: This exchange, by deleting a link, changes the configuration of the link group.
Top   ToC   RFC7609 - Page 57
   o  CONFIRM RKEY / CONFIRM RKEY response or DELETE RKEY / DELETE RKEY
      response: This exchange changes the RMB configuration of the link
      group.  RKeys cannot change while links are being added or deleted
      (while an ADD LINK or DELETE LINK is in progress).  However,
      CONFIRM RKEY and DELETE RKEY are unique in that both the client
      and server can independently manage (add or remove) their own
      RMBs.  This allows each peer to concurrently change their RKeys
      and therefore concurrently send CONFIRM RKEY or DELETE RKEY
      requests.  The concurrent CONFIRM RKEY or DELETE RKEY requests can
      be independently processed and do not represent a collision.

   Because the server is in control of the configuration of the link
   group, many timing windows and collisions are avoided, but there are
   still some that must be handled.

3.5.5.3.1. Collisions with ADD LINK / CONFIRM LINK Exchange
Colliding LLC message: TEST LINK Action to resolve: Send immediate TEST LINK reply. Colliding LLC message: ADD LINK from client to server Action to resolve: Server ignores the ADD LINK message. When client receives server's ADD LINK, client will consider that message to be in response to its ADD LINK message and the flow works. Since both client and server know not to start this exchange if an ADD LINK operation is already underway, this can only occur if the client sends this message before receiving the server's ADD LINK and this message crosses with the server's ADD LINK message; therefore, the server's ADD LINK arrives at the client immediately after the client sent this message. Colliding LLC message: DELETE LINK from client to server, specific link specified Action to resolve: Server queues the DELETE LINK message and processes it after the ADD LINK exchange completes. If it is an orderly link termination, it can wait until after this exchange continues. If it is disorderly and the link affected is the one that the current exchange is using, the server will discover the outage when a message in this exchange fails. Colliding LLC message: DELETE LINK from client to server, entire link group to be deleted Action to resolve: Immediately clean up the link group.
Top   ToC   RFC7609 - Page 58
   Colliding LLC message: CONFIRM RKEY from client

      Action to resolve: Send a negative CONFIRM RKEY response to the
      client.  Once the current exchange finishes, client will have to
      recompute its RKey set to include the new link and then start a
      new CONFIRM RKEY exchange.

3.5.5.3.2. Collisions during DELETE LINK Exchange
Colliding LLC message: TEST LINK from either peer Action to resolve: Send immediate TEST LINK response. Colliding LLC message: ADD LINK from client to server Action to resolve: Server queues the ADD LINK and processes it after the current exchange completes. Colliding LLC message: DELETE LINK from client to server (specific link) Action to resolve: Server queues the DELETE LINK message and processes it after the current exchange completes. If it is an orderly link termination, it can wait until after this exchange continues. If it is disorderly and the link affected is the one that the current exchange is using, the server will discover the outage when a message in this exchange fails. Colliding LLC message: DELETE LINK from either client or server, deleting the entire link group Action to resolve: Immediately clean up the link group. Colliding LLC message: CONFIRM RKEY from client to server Action to resolve: Send a negative CONFIRM RKEY response to the client. Once the current exchange finishes, client will have to recompute its RKey set to include the new link and then start a new CONFIRM RKEY exchange.
Top   ToC   RFC7609 - Page 59
3.5.5.3.3. Collisions during CONFIRM RKEY Exchange
Colliding LLC message: TEST LINK Action to resolve: Send immediate TEST LINK reply. Colliding LLC message: ADD LINK from client to server Action to resolve: Queue the ADD LINK, and process it after the current exchange completes. Colliding LLC message: ADD LINK from server to client (CONFIRM RKEY exchange was initiated by the client, and it crossed with the server initiating an ADD LINK exchange) Action to resolve: Process the ADD LINK. Client will receive a negative CONFIRM RKEY from the server and will have to redo this CONFIRM RKEY exchange after the ADD LINK exchange completes. Colliding LLC message: DELETE LINK from client to server, specific link to be deleted (CONFIRM RKEY exchange was initiated by the server, and it crossed with the client's DELETE LINK request) Action to resolve: Server queues the DELETE LINK message and processes it after the CONFIRM RKEY exchange completes. If it is an orderly link termination, it can wait until after this exchange continues. If it is disorderly and the link affected is the one that the current exchange is using, the server will discover the outage when a message in this exchange fails. Colliding LLC message: DELETE LINK from server to client, specific link deleted (CONFIRM RKEY exchange was initiated by the client, and it crossed with the server's DELETE LINK) Action to resolve: Process the DELETE LINK. Client will receive a negative CONFIRM RKEY from the server and will have to redo this CONFIRM RKEY exchange after the ADD LINK exchange completes. Colliding LLC message: DELETE LINK from either client or server, entire link group deleted Action to resolve: Immediately clean up the link group. Colliding LLC message: CONFIRM LINK from the peer that did not start the current CONFIRM LINK exchange Action to resolve: Queue the request, and process it after the current exchange completes.


(next page on part 4)

Next Section