Tech-invite3GPPspaceIETFspace
9796959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 7609

IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

Pages: 143
Informational
Part 4 of 6 – Pages 60 to 91
First   Prev   Next

Top   ToC   RFC7609 - Page 60   prevText

4. SMC-R Memory-Sharing Architecture

4.1. RMB Element Allocation Considerations

Each TCP connection using SMC-R must be allocated an RMBE by each SMC-R peer. This allocation is performed by each endpoint independently to allow each endpoint to select an RMBE that best matches the characteristics on its TCP socket endpoint. The RMBE associated with a TCP socket endpoint must have a receive buffer that is at least as large as the TCP receive buffer size in effect for that connection. The receive buffer size can be determined by what is specified explicitly by the application using setsockopt() or implicitly via the system-configured default value. This will allow sufficient data to be RDMA-written by the SMC-R peer to fill an entire receive buffer size's worth of data on a given data flow. Given that each RMB must have fixed-length RMBEs, this implies that an SMC-R endpoint may need to maintain multiple RMBs of various sizes for SMC-R connections on a given SMC-R link and can then select an RMBE that most closely fits a connection.

4.2. RMB and RMBE Format

An RMB is a virtual memory buffer whose backing real memory is pinned. The RMB is subdivided into a whole number of equal-sized RMB Elements (RMBEs). Each RMBE begins with a 4-byte eye catcher for diagnostic and service purposes, followed by the receive data buffer. The contents of this diagnostic eye catcher are implementation dependent and should be used by the local SMC-R peer to check for overlay errors by verifying an intact eye catcher with every RMBE access. The RMBE is a wrapping receive buffer for receiving RDMA writes from the peer. Cursors, as described below, are exchanged between peers to manage and track RDMA writes and local data reads from the RMBE for a TCP connection.

4.3. RMBE Control Information

RMBE control information consists of consumer cursors, producer cursors, wrap counts, CDC message sequence numbers, control flags such as urgent data and "writer blocked" indicators, and TCP connection information such as termination flags. This information is exchanged between SMC-R peers using CDC messages, which are passed using RoCE SendMsg. A TCP/IP stack implementing SMC-R must receive and store this information in its internal data structures, as it is used to manage the RMBE and its data buffer.
Top   ToC   RFC7609 - Page 61
   The format and contents of the CDC message are described in detail in
   Appendix A.4 ("Connection Data Control (CDC) Message Format").  The
   following is a high-level description of what this control
   information contains.

   o  Connection state flags such as sending done, connection closed,
      failover data validation, and abnormal close.

   o  A sequence number that is managed by the sender.  This sequence
      number starts at 1, is increased each send, and wraps to 0.  This
      sequence number tracks the CDC message sent and is not related to
      the number of bytes sent.  It is used for failover data
      validation.

   o  Producer cursor: a wrapping offset into the receiver's RMBE data
      area.  Set by the peer that is writing into the RMBE, it points to
      where the writing peer will write the next byte of data into an
      RMBE.  This cursor is accompanied by a wrap sequence number to
      help the RMBE owner (the receiver) identify full window size
      wrapping writes.  Note that this cursor must account for (i.e.,
      skip over) the RMBE eye catcher that is in the beginning of the
      data area.

   o  Consumer cursor: a wrapping offset into the receiver's RMBE data
      area.  Set by the owner of the RMBE (the peer that is reading from
      it), this cursor points to the offset of the next byte of data to
      be consumed by the peer in its own RMBE.  The sender cannot write
      beyond this cursor into the receiver's RMBE without causing data
      loss.  Like the producer cursor, this is accompanied by a wrap
      count to help the writer identify full window size wrapping reads.
      Note that this cursor must account for (i.e., skip over) the RMBE
      eye catcher that is in the beginning of the data area.

   o  Data flags such as urgent data, writer blocked indicator, and
      cursor update requests.

4.4. Use of RMBEs

4.4.1. Initializing and Accessing RMBEs

The RMBE eye catcher is initialized by the RMB owner prior to assigning it to a specific TCP connection and communicating its RMB index to the SMC-R partner. After an RMBE index is communicated to the SMC-R partner, the RMBE can only be referenced in "read-only mode" by the owner, and all updates to it are performed by the remote SMC-R partner via RDMA write operations.
Top   ToC   RFC7609 - Page 62
   Initialization of an RMBE must include the following:

   o  Zeroing out the entire RMBE receive buffer, which helps minimize
      data integrity issues (e.g., data from a previous connection
      somehow being presented to the current connection).

   o  Setting the beginning RMBE eye catcher.  This eye catcher plays an
      important role in helping detect accidental overlays of the RMBE.
      The RMB owner should always validate these eye catchers before
      each new reference to the RMBE.  If the eye catchers are found to
      be corrupted, the local host must reset the TCP connection
      associated with this RMBE and log the appropriate diagnostic
      information.

4.4.2. RMB Element Reuse and Conflict Resolution

RMB elements can be reused once their associated TCP and SMC-R connections are terminated. Under normal and abnormal SMC-R connection termination processing, both SMC-R peers must explicitly acknowledge that they are done using an RMBE before that element can be freed and reassigned to another SMC-R connection instance. For more details on SMC-R connection termination, refer to Section 4.8. However, there are some error scenarios where this two-way explicit acknowledgment may not be completed. In these scenarios, an RMBE owner may choose to reassign this RMBE to a new SMC-R connection instance on this SMC-R link group. When this occurs, the partner SMC-R peer must detect this condition during SMC-R Rendezvous processing when presented with an RMBE that it believes is already in use for a different SMC-R connection. In this case, the SMC-R peer must abort the existing SMC-R connection associated with this RMBE. The abort processing resets the TCP connection (if it is still active), but it must not attempt to perform any RDMA writes to this RMBE and must also ignore any data sitting in the local RMBE associated with the existing connection. It then proceeds to free up the local RMBE and notify the local application that the connection is being abnormally reset. The remote SMC-R peer then proceeds to normal processing for this new SMC-R connection.
Top   ToC   RFC7609 - Page 63

4.5. SMC-R Protocol Considerations

The following sections describe considerations for the SMC-R protocol as compared to TCP.

4.5.1. SMC-R Protocol Optimized Window Size Updates

An SMC-R receiver host sends its consumer cursor information to the sender to convey the progress that the receiving application has made in consuming the sent data. The difference between the writer's producer cursor and the associated receiver's consumer cursor indicates the window size available for the sender to write into. This is somewhat similar to TCP window update processing and therefore has some similar considerations, such as silly window syndrome avoidance, whereby TCP has an optimization that minimizes the overhead of very small, unproductive window size updates associated with suboptimal socket applications consuming very small amounts of data on every receive() invocation. For SMC-R, the receiver only updates its consumer cursor via a unique CDC message under the following conditions: o The current window size (from a sender's perspective) is less than half of the receive buffer space, and the consumer cursor update will result in a minimum increase in the window size of 10% of the receive buffer space. Some examples: a. Receive buffer size: 64K, current window size (from a sender's perspective): 50K. No need to update the consumer cursor. Plenty of space is available for the sender. b. Receive buffer size: 64K, current window size (from a sender's perspective): 30K, current window size from a receiver's perspective: 31K. No need to update the consumer cursor; even though the sender's window size is < 1/2 of the 64K, the window update would only increase that by 1K, which is < 1/10th of the 64K buffer size. c. Receive buffer size: 64K, current window size (from a sender's perspective): 30K, current window size from a receiver's perspective: 64K. The receiver updates the consumer cursor (sender's window size is < 1/2 of the 64K; the window update would increase that by > 6.4K).
Top   ToC   RFC7609 - Page 64
   o  The receiver must always include a consumer cursor update whenever
      it sends a CDC message to the partner for another flow (i.e., send
      flow in the opposite direction).  This allows the window size
      update to be delivered with no additional overhead.  This is
      somewhat similar to TCP DelayAck processing and quite effective
      for request/response data patterns.

   o  If a peer has set the B-bit in a CDC message, then any consumption
      of data by the receiver causes a CDC message to be sent, updating
      the consumer cursor until a CDC message with that bit cleared is
      received from the peer.

   o  The optimized window size updates are overridden when the sender
      sets the Consumer Cursor Update Requested flag in a CDC message to
      the receiver.  When this indicator is on, the consumer must send a
      consumer cursor update immediately when data is consumed by the
      local application or if the cursor has not been updated for a
      while (i.e., local copy of the consumer cursor does not match the
      last consumer cursor value sent to the partner).  This allows the
      sender to perform optional diagnostics for detecting a stalled
      receiver application (data has been sent but not consumed).  It is
      recommended that the Consumer Cursor Update Requested flag only be
      sent for diagnostic procedures, as it may result in non-optimal
      data path performance.

4.5.2. Small Data Sends

The SMC-R protocol makes no special provisions for handling small data segments sent across a stream socket. Data is always sent if sufficient window space is available. In contrast to the TCP Nagle algorithm, there are no special provisions in SMC-R for coalescing small data segments. An implementation of SMC-R can be configured to optimize its sending processing by coalescing outbound data for a given SMC-R connection so that it can reduce the number of RDMA write operations it performs, in a fashion similar to Nagle's algorithm. However, any such coalescing would require a timer on the sending host that would ensure that data was eventually sent. Also, the sending host would have to opt out of this processing if Nagle's algorithm had been disabled (programmatically or via system configuration).
Top   ToC   RFC7609 - Page 65

4.5.3. TCP Keepalive Processing

TCP keepalive processing allows applications to direct the local TCP/IP host to periodically "test" the viability of an idle TCP connection. Since SMC-R connections have a TCP representation along with an SMC-R representation, there are unique keepalive processing considerations: o SMC-R-layer keepalive processing: If keepalive is enabled for an SMC-R connection, the local host maintains a keepalive timer that reflects how long an SMC-R connection has been idle. The local host also maintains a timestamp of last activity for each SMC-R link (for any SMC-R connection on that link). When it is determined that an SMC-R connection has been idle longer than the keepalive interval, the host checks to see whether or not the SMC-R link has been idle for a duration longer than the keepalive timeout. If both conditions are met, the local host then performs a TEST LINK LLC command to test the viability of the SMC-R link over the RoCE fabric (RC-QPs). If a TEST LINK LLC command response is received within a reasonable amount of time, then the link is considered viable, and all connections using this link are considered viable as well. If, however, a response is not received in a reasonable amount of time or there's a failure in sending the TEST LINK LLC command, then this is considered a failure in the SMC-R link, and failover processing to an alternate SMC-R link must be triggered. If no alternate SMC-R link exists in the SMC-R link group, then all of the SMC-R connections on this link are abnormally terminated by resetting the TCP connections represented by these SMC-R connections. Given that multiple SMC-R connections can share the same SMC-R link, implementing an SMC-R link-level probe using the TEST LINK LLC command will help reduce the amount of unproductive keepalive traffic for SMC-R connections; as long as some SMC-R connections on a given SMC-R link are active (i.e., have had I/O activity within the keepalive interval), then there is no need to perform additional link viability testing.
Top   ToC   RFC7609 - Page 66
   o  TCP-layer keepalive processing: Traditional TCP "keepalive"
      packets are not as relevant for SMC-R connections, given that the
      TCP path is not used for these connections once the SMC-R
      Rendezvous processing is completed.  All SMC-R connections by
      default have associated TCP connections that are idle.  Are TCP
      keepalive probes still needed for these connections?  There are
      two main scenarios to consider:

      1. TCP keepalives that are used to determine whether or not the
         peer TCP endpoint is still active.  This is not needed for
         SMC-R connections, as the SMC-R-level keepalives mentioned
         above will determine whether or not the remote endpoint
         connections are still active.

      2. TCP keepalives that are used to ensure that TCP connections
         traversing an intermediate proxy maintain an active state.  For
         example, stateful firewalls typically maintain state
         representing every valid TCP connection that traverses the
         firewall.  These types of firewalls are known to expire idle
         connections by removing their state in the firewall to conserve
         memory.  TCP keepalives are often used in this scenario to
         prevent firewalls from timing out otherwise idle connections.
         When using SMC-R, both endpoints must reside in the same
         Layer 2 network (i.e., the same subnet).  As a result,
         firewalls cannot be injected in the path between two SMC-R
         endpoints.  However, other intermediate proxies, such as
         TCP/IP-layer load balancers, may be injected in the path of two
         SMC-R endpoints.  These types of load balancers also maintain
         connection state so that they can forward TCP connection
         traffic to the appropriate cluster endpoint.  When using SMC-R,
         these TCP connections will appear to be completely idle, making
         them susceptible to potential timeouts at the load-balancing
         proxy.  As a result, for this scenario, TCP keepalives may
         still be relevant.

   The following are the TCP-level keepalive processing requirements for
   SMC-R-enabled hosts:

   o  SMC-R peers should allow TCP keepalives to flow on the TCP path of
      SMC-R connections based on existing TCP keepalive configuration
      and programming options.  However, it is strongly recommended that
      platforms provide the ability to specify very granular keepalive
      timers (for example, single-digit-second timers) and should
      consider providing a configuration option that limits the minimum
      keepalive timer that will be used for TCP-layer keepalives on
      SMC-R connections.  This is important to minimize the amount of
      TCP keepalive packets transmitted in the network for SMC-R
      connections.
Top   ToC   RFC7609 - Page 67
   o  SMC-R peers must always respond to inbound TCP-layer keepalives
      (by sending ACKs for these packets) even if the connection is
      using SMC-R.  Typically, once a TCP connection has completed the
      SMC-R Rendezvous processing and is using SMC-R for data flows, no
      new inbound TCP segments are expected on that TCP connection,
      other than TCP termination segments (FIN, RST, etc.).  TCP
      keepalives are the one exception that must be supported.  Also,
      since TCP keepalive probes do not carry any application-layer
      data, this has no adverse impact on the application's inbound data
      stream.

4.6. TCP Connection Failover between SMC-R Links

A peer may change which SMC-R link within a link group it sends its writes over in the event of a link failure. Since each peer independently chooses which link to send writes over for a specific TCP connection, this process is done independently by each peer.

4.6.1. Validating Data Integrity

Even though RoCE is a reliable transport, there is a small subset of failure modes that could cause unrecoverable loss of data. When an RNIC acknowledges receipt of an RDMA write to its peer, that creates a write completion event to the sending peer, which allows the sender to release any buffers it is holding for that write. In normal operation and in most failures, this operation is reliable. However, there are failure modes possible in which a receiving RNIC has acknowledged an RDMA write but then was not able to place the received data into its host memory -- for example, a sudden, disorderly failure of the interface between the RNIC and the host. While rare, these types of events must be guarded against to ensure data integrity. The process for switching SMC-R links during failover, as described in this section, guards against this possibility and is mandatory. Each peer must track the current state of the CDC sequence numbers for a TCP connection. The sender must keep track of the sequence number of the CDC message that described the last write acknowledged by the peer RNIC, or Sequence Sent (SS). In other words, SS describes the last write that the sender believes its peer has successfully received. The receiver must keep track of the sequence number of the CDC message that described the last write that it has successfully received (i.e., the data has been successfully placed into an RMBE), or Sequence Received (SR).
Top   ToC   RFC7609 - Page 68
   When an RNIC fails and the sender changes SMC-R links, the sender
   must first send a CDC message with the F-bit (failover validation
   indicator; see Appendix A.4) set over the new SMC-R link.  This is
   the failover data validation message.  The sequence number in this
   CDC message is equal to SS.  The CDC message key, the length, and the
   SMC-R alert token are the only other fields in this CDC message that
   are significant.  No reply is expected from this validation message,
   and once the sender has sent it, the sender may resume sending on the
   new SMC-R link as described in Section 4.6.2.

   Upon receipt of the failover validation message, the receiver must
   verify that its SR value for the TCP connection is equal to or
   greater than the sequence number in the failover validation message.
   If so, no further action is required, and the TCP connection resumes
   on the new SMC-R link.  If SR is less than the sequence number value
   in the validation message, data has been lost, and the receiver must
   immediately reset the TCP connection.

4.6.2. Resuming the TCP Connection on a New SMC-R Link

When a connection is moved to a new SMC-R link and the failover validation message has been sent, the sender can immediately resume normal transmission. In order to preserve the application message stream, the sender must replay any RDMA writes (and their associated CDC messages) that were in progress or failed when the previous SMC-R link failed, before sending new data on the new SMC-R link. The sender has two options for accomplishing this: o Preserve the sequence numbers "as is": Retry all failed and pending operations as they were originally done, including reposting all associated RDMA write operations and their associated CDC messages without making any changes. Then resume sending new data using new sequence numbers. o Combine pending messages and possibly add new data: Combine failed and pending messages into a single new write with a new sequence number. This allows the sender to combine pending messages into fewer operations. As a further optimization, this write can also include new data, as long as all failed and pending data are also included. If this approach is taken, the sequence number must be increased beyond the last failed or pending sequence number.
Top   ToC   RFC7609 - Page 69

4.7. RMB Data Flows

The following sections describe the RDMA wire flows for the SMC-R protocol after a TCP connection has switched into SMC-R mode (i.e., SMC-R Rendezvous processing is complete and a pair of RMB elements has been assigned and communicated by the SMC-R peers). The ladder diagrams below include the following: o RMBE control information kept by each peer. Only a subset of the information is depicted, specifically only the fields that reflect the stream of data written by Host A and read by Host B. o Time line 0-x, which shows the wire flows in a time-relative fashion. o Note that RMBE control information is only shown in a time interval if its value changed (otherwise, assume that the value is unchanged from the previously depicted value). o The local copy of the producer cursors and consumer cursors that is maintained by each host is not depicted in these figures. Note that the cursor values in the diagram reflect the necessity of skipping over the eye catcher in the RMBE data area. They start and wrap at 4, not 0.

4.7.1. Scenario 1: Send Flow, Window Size Unconstrained

SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 4 0 0 0 4 0 0 0 0 1 ---------------> 1 0 0 0 RDMA-WR Data (4:1003) 4 0 2 ...............> 2 1004 0 0 CDC Message Figure 16: Scenario 1: Send Flow, Window Size Unconstrained Scenario assumptions: o Kernel implementation. o New SMC-R connection; no data has been sent on the connection.
Top   ToC   RFC7609 - Page 70
   o  Host A: Application issues send for 1000 bytes to Host B.

   o  Host B: RMBE receive buffer size is 10,000; application has issued
      a recv for 10,000 bytes.

   Flow description:

   1. The application issues a send() for 1000 bytes; the SMC-R layer
      copies data into a kernel send buffer.  It then schedules an RDMA
      write operation to move the data into the peer's RMBE receive
      buffer, at relative position 4-1003 (to skip the 4-byte
      eye catcher in the RMBE data area).  Note that no immediate data
      or alert (i.e., interrupt) is provided to Host B for this RDMA
      operation.

   2. Host A sends a CDC message to update the producer cursor to
      byte 1004.  This CDC message will deliver an interrupt to Host B.
      At this point, the SMC-R layer can return control back to the
      application.  Host B, once notified of the completion of the
      previous RDMA operation, locates the RMBE associated with the RMBE
      alert token that was included in the message and proceeds to
      perform normal receive-side processing, waking up the suspended
      application read thread, copying the data into the application's
      receive buffer, etc.  It will use the producer cursor as an
      indicator of how much data is available to be delivered to the
      local application.  After this processing is complete, the SMC-R
      layer will also update its local consumer cursor to match the
      producer cursor (i.e., indicating that all data has been
      consumed).  Note that a message to the peer updating the consumer
      cursor is not needed at this time, as the window size is
      unconstrained (> 1/2 of the receive buffer size).  The window size
      is calculated by taking the difference between the producer cursor
      and the consumer cursor in the RMBEs (10,000 - 1004 = 8996).
Top   ToC   RFC7609 - Page 71

4.7.2. Scenario 2: Send/Receive Flow, Window Size Unconstrained

SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 4 0 0 0 4 0 0 0 0 1 ---------------> 1 0 0 0 RDMA-WR Data (4:1003) 4 0 2 ...............> 2 1004 0 0 CDC Message 0 0 3 <-------------- 3 1004 0 0 RDMA-WR Data (4:503) 1004 0 4 <.............. 4 1004 0 0 CDC Message Figure 17: Scenario 2: Send/Receive Flow, Window Size Unconstrained Scenario assumptions: o New SMC-R connection; no data has been sent on the connection. o Host A: Application issues send for 1000 bytes to Host B. o Host B: RMBE receive buffer size is 10,000; application has already issued a recv for 10,000 bytes. Once the receive is completed, the application sends a 500-byte response to Host A. Flow description: 1. The application issues a send() for 1000 bytes; the SMC-R layer copies data into a kernel send buffer. It then schedules an RDMA write operation to move the data into the peer's RMBE receive buffer, at relative position 4-1003. Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation. 2. Host A sends a CDC message to update the producer cursor to byte 1004. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application.
Top   ToC   RFC7609 - Page 72
   3. Host B, once notified of the receipt of the previous CDC message,
      locates the RMBE associated with the RMBE alert token and proceeds
      to perform normal receive-side processing, waking up the suspended
      application read thread, copying the data into the application's
      receive buffer, etc.  After this processing is complete, the SMC-R
      layer will also update its local consumer cursor to match the
      producer cursor (i.e., indicating that all data has been
      consumed).  Note that an update of the consumer cursor to the peer
      is not needed at this time, as the window size is unconstrained
      (> 1/2 of the receive buffer size).  The application then performs
      a send() for 500 bytes to Host A.  The SMC-R layer will copy the
      data into a kernel buffer and then schedule an RDMA write into the
      partner's RMBE receive buffer.  Note that this RDMA write
      operation includes no immediate data or notification to Host A.

   4. Host B sends a CDC message to update the partner's RMBE control
      information with the latest producer cursor (set to 503 and not
      shown in the diagram above) and to also inform the peer that the
      consumer cursor value is now 1004.  It also updates the local
      current consumer cursor and the last sent consumer cursor to 1004.
      This CDC message includes notification, since we are updating our
      producer cursor; this requires attention by the peer host.

4.7.3. Scenario 3: Send Flow, Window Size Constrained

SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 4 0 0 0 4 0 0 4 0 1 ---------------> 1 4 0 0 RDMA-WR Data (4:3003) 4 0 2 ...............> 2 3004 0 0 CDC Message 4 0 3 3 3004 0 0 4 0 4 ---------------> 4 3004 0 0 RDMA-WR Data (3004:7003) 4 0 5 ................> 5 7004 0 0 CDC Message 7004 0 6 <................ 6 7004 0 0 CDC Message Figure 18: Scenario 3: Send Flow, Window Size Constrained
Top   ToC   RFC7609 - Page 73
   Scenario assumptions:

   o  New SMC-R connection; no data has been sent on this connection.

   o  Host A: Application issues send for 3000 bytes to Host B and then
      another send for 4000 bytes.

   o  Host B: RMBE receive buffer size is 10,000.  Application has
      already issued a recv for 10,000 bytes.

   Flow description:

   1. The application issues a send() for 3000 bytes; the SMC-R layer
      copies data into a kernel send buffer.  It then schedules an RDMA
      write operation to move the data into the peer's RMBE receive
      buffer, at relative position 4-3003.  Note that no immediate data
      or alert (i.e., interrupt) is provided to Host B for this RDMA
      operation.

   2. Host A sends a CDC message to update its producer cursor to
      byte 3003.  This CDC message will deliver an interrupt to Host B.
      At this point, the SMC-R layer can return control back to the
      application.

   3. Host B, once notified of the receipt of the previous CDC message,
      locates the RMBE associated with the RMBE alert token and proceeds
      to perform normal receive-side processing, waking up the suspended
      application read thread, copying the data into the application's
      receive buffer, etc.  After this processing is complete, the SMC-R
      layer will also update its local consumer cursor to match the
      producer cursor (i.e., indicating that all data has been
      consumed).  It will not, however, update the partner with this
      information, as the window size is not constrained
      (10,000 - 3000 = 7000 bytes of available space).  The application
      on Host B also issues a new recv() for 10,000 bytes.

   4. On Host A, the application issues a send() for 4000 bytes.  The
      SMC-R layer copies the data into a kernel buffer and schedules an
      async RDMA write into the peer's RMBE receive buffer at relative
      position 3003-7004.  Note that no alert is provided to Host B for
      this flow.

   5. Host A sends a CDC message to update the producer cursor to
      byte 7004.  This CDC message will deliver an interrupt to Host B.
      At this point, the SMC-R layer can return control back to the
      application.
Top   ToC   RFC7609 - Page 74
   6. Host B, once notified of the receipt of the previous CDC message,
      locates the RMBE associated with the RMBE alert token and proceeds
      to perform normal receive-side processing, waking up the suspended
      application read thread, copying the data into the application's
      receive buffer, etc.  After this processing is complete, the SMC-R
      layer will also update its local consumer cursor to match the
      producer cursor (i.e., indicating that all data has been
      consumed).  It will then determine whether or not it needs to
      update the consumer cursor to the peer.  The available window size
      is now 3000 (10,000 - (producer cursor - last sent consumer
      cursor)), which is < 1/2 of the receive buffer size
      (10,000/2 = 5000), and the advance of the window size is > 10% of
      the window size (1000).  Therefore, a CDC message is issued to
      update the consumer cursor to Peer A.

4.7.4. Scenario 4: Large Send, Flow Control, Full Window Size Writes

SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 1004 1 0 0 1004 1 0 1004 1 1 ---------------> 1 1004 1 0 RDMA-WR Data (1004:9999) 1004 1 2 ---------------> 2 1004 1 0 RDMA-WR Data (4:1003) 1004 1 3 ...............> 3 1004 2 Wrt CDC Message Blk 1004 2 4 <............... 4 1004 2 Wrt CDC Message Blk 1004 2 5 ---------------> 5 1004 2 Wrt RDMA-WR Data Blk (1004:9999) 1004 2 6 ---------------> 6 1004 2 Wrt RDMA-WR Data Blk (4:1003) 1004 2 7 ...............> 7 1004 3 Wrt CDC Message Blk 1004 3 8 <............... 8 1004 3 Wrt CDC Message Blk Figure 19: Scenario 4: Large Send, Flow Control, Full Window Size Writes
Top   ToC   RFC7609 - Page 75
   Scenario assumptions:

   o  Kernel implementation.

   o  Existing SMC-R connection, Host B's receive window size is fully
      open (peer consumer cursor = peer producer cursor).

   o  Host A: Application issues send for 20,000 bytes to Host B.

   o  Host B: RMBE receive buffer size is 10,000; application has issued
      a recv for 10,000 bytes.

   Flow description:

   1. The application issues a send() for 20,000 bytes; the SMC-R layer
      copies data into a kernel send buffer (assumes that send buffer
      space of 20,000 is available for this connection).  It then
      schedules an RDMA write operation to move the data into the peer's
      RMBE receive buffer, at relative position 1004-9999.  Note that no
      immediate data or alert (i.e., interrupt) is provided to Host B
      for this RDMA operation.

   2. Host A then schedules an RDMA write operation to fill the
      remaining 1000 bytes of available space in the peer's RMBE receive
      buffer, at relative position 4-1003.  Note that no immediate data
      or alert (i.e., interrupt) is provided to Host B for this RDMA
      operation.  Also note that an implementation of SMC-R may optimize
      this processing by combining steps 1 and 2 into a single
      RDMA write operation (with two different data sources).

   3. Host A sends a CDC message to update the producer cursor to
      byte 1004.  Since the entire receive buffer space is filled, the
      producer writer blocked flag (the "Wrt Blk" indicator (flag) in
      Figure 19) is set and the producer cursor wrap sequence number
      (the producer "Wrap Seq#" in Figure 19) is incremented.  This CDC
      message will deliver an interrupt to Host B.  At this point, the
      SMC-R layer can return control back to the application.

   4. Host B, once notified of the receipt of the previous CDC message,
      locates the RMBE associated with the RMBE alert token and proceeds
      to perform normal receive-side processing, waking up the suspended
      application read thread, copying the data into the application's
      receive buffer, etc.  In this scenario, Host B notices that the
      producer cursor has not been advanced (same value as the consumer
      cursor); however, it notices that the producer cursor wrap
      sequence number is different from its local value (1), indicating
      that a full window of new data is available.  All of the data in
      the receive buffer can be processed, with the first segment
Top   ToC   RFC7609 - Page 76
      (1004-9999) followed by the second segment (4-1003).  Because the
      producer writer blocked indicator was set, Host B schedules a CDC
      message to update its latest information to the peer: consumer
      cursor (1004), consumer cursor wrap sequence number (the current
      value of 2 is used).

   5. Host A, upon receipt of the CDC message, locates the TCP
      connection associated with the alert token and, upon examining the
      control information provided, notices that Host B has consumed all
      of the data (based on the consumer cursor and the consumer cursor
      wrap sequence number) and initiates the next RDMA write to fill
      the receive buffer at offset 1003-9999.

   6. Host A then moves the next 1000 bytes into the beginning of the
      receive buffer (4-1003) by scheduling an RDMA write operation.
      Note that at this point there are still 8 bytes remaining to be
      written.

   7. Host A then sends a CDC message to set the producer writer blocked
      indicator and to increment the producer cursor wrap sequence
      number (3).

   8. Host B, upon notification, completes the same processing as step 4
      above, including sending a CDC message to update the peer to
      indicate that all data has been consumed.  At this point, Host A
      can write the final 8 bytes to Host B's RMBE into
      positions 1004-1011 (not shown).
Top   ToC   RFC7609 - Page 77

4.7.5. Scenario 5: Send Flow, Urgent Data, Window Size Unconstrained

SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag 1000 1 0 0 1000 1 0 1000 1 1 ---------------> 1 1000 1 0 RDMA-WR Data (1000:1499) 1000 1 2 ...............> 2 1500 1 UrgP CDC Message UrgA 1500 1 3 <............... 3 1500 1 UrgP CDC Message UrgA 1500 1 4 ---------------> 4 1500 1 UrgP RDMA-WR Data UrgA (1500:2499) 1500 1 5 ...............> 5 2500 1 0 CDC Message Figure 20: Scenario 5: Send Flow, Urgent Data, Window Size Open Scenario assumptions: o Kernel implementation. o Existing SMC-R connection; window size open (unconstrained); all data has been consumed by receiver. o Host A: Application issues send for 500 bytes with urgent data indicator (out of band) to Host B, then sends 1000 bytes of normal data. o Host B: RMBE receive buffer size is 10,000; application has issued a recv for 10,000 bytes and is also monitoring the socket for urgent data. Flow description: 1. The application issues a send() for 500 bytes of urgent data; the SMC-R layer copies data into a kernel send buffer. It then schedules an RDMA write operation to move the data into the peer's RMBE receive buffer, at relative position 1000-1499. Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation.
Top   ToC   RFC7609 - Page 78
   2. Host A sends a CDC message to update its producer cursor to
      byte 1500 and to turn on the producer Urgent Data Pending (UrgP)
      and Urgent Data Present (UrgA) flags.  This CDC message will
      deliver an interrupt to Host B.  At this point, the SMC-R layer
      can return control back to the application.

   3. Host B, once notified of the receipt of the previous CDC message,
      locates the RMBE associated with the RMBE alert token, notices
      that the Urgent Data Pending flag is on, and proceeds with out-of-
      band socket API notification -- for example, satisfying any
      outstanding select() or poll() requests on the socket by
      indicating that urgent data is pending (i.e., by setting the
      exception bit on).  The urgent data present indicator allows
      Host B to also determine the position of the urgent data (the
      producer cursor points 1 byte beyond the last byte of urgent
      data).  Host B can then perform normal receive-side processing
      (including specific urgent data processing), copying the data into
      the application's receive buffer, etc.  Host B then sends a CDC
      message to update the partner's RMBE control area with its latest
      consumer cursor (1500).  Note that this CDC message must occur,
      regardless of the current local window size that is available.
      The partner host (Host A) cannot initiate any additional RDMA
      writes until it receives acknowledgment that the urgent data has
      been processed (or at least processed/remembered at the SMC-R
      layer).

   4. Upon receipt of the message, Host A wakes up, sees that the peer
      consumed all data up to and including the last byte of urgent
      data, and now resumes sending any pending data.  In this case, the
      application had previously issued a send for 1000 bytes of normal
      data, which would have been copied in the send buffer, and control
      would have been returned to the application.  Host A now initiates
      an RDMA write to move that data to the peer's receive buffer at
      position 1500-2499.

   5. Host A then sends a CDC message to update its producer cursor
      value (2500) and to turn off the Urgent Data Pending and Urgent
      Data Present flags.  Host B wakes up, processes the new data
      (resumes application, copies data into the application receive
      buffer), and then proceeds to update the local current consumer
      cursor (2500).  Given that the window size is unconstrained, there
      is no need for a consumer cursor update in the peer's RMBE.
Top   ToC   RFC7609 - Page 79

4.7.6. Scenario 6: Send Flow, Urgent Data, Window Size Closed

SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag 1000 1 0 0 1000 2 Wrt Blk 1000 1 1 ...............> 1 1000 2 Wrt CDC Message Blk UrgP 1000 2 2 <............... 2 1000 2 Wrt CDC Message Blk UrgP 1000 2 3 ---------------> 3 1000 2 Wrt RDMA-WR Data Blk (1000:1499) UrgP 1000 2 4 ...............> 4 1500 2 UrgP CDC Message UrgA 1500 2 5 <............... 5 1500 2 UrgP CDC Message UrgA 1500 2 6 ---------------> 6 1500 2 UrgP RDMA-WR Data UrgA (1500:2499) 1000 2 7 ...............> 7 2500 2 0 CDC Message Figure 21: Scenario 6: Send Flow, Urgent Data, Window Size Closed Scenario assumptions: o Kernel implementation. o Existing SMC-R connection; window size closed; writer is blocked. o Host A: Application issues send for 500 bytes with urgent data indicator (out of band) to Host B, then sends 1000 bytes of normal data. o Host B: RMBE receive buffer size is 10,000; application has no outstanding recv() (for normal data) and is monitoring the socket for urgent data.
Top   ToC   RFC7609 - Page 80
   Flow description:

   1. The application issues a send() for 500 bytes of urgent data; the
      SMC-R layer copies data into a kernel send buffer (if available).
      Since the writer is blocked (window size closed), it cannot send
      the data immediately.  It then sends a CDC message to notify the
      peer of the Urgent Data Pending (UrgP) indicator (the writer
      blocked indicator remains on as well).  This serves as a signal to
      Host B that urgent data is pending in the stream.  Control is also
      returned to the application at this point.

   2. Host B, once notified of the receipt of the previous CDC message,
      locates the RMBE associated with the RMBE alert token, notices
      that the Urgent Data Pending flag is on, and proceeds with out-of-
      band socket API notification -- for example, satisfying any
      outstanding select() or poll() requests on the socket by
      indicating that urgent data is pending (i.e., by setting the
      exception bit on).  At this point, it is expected that the
      application will enter urgent data mode processing, expeditiously
      processing all normal data (by issuing recv API calls) so that it
      can get to the urgent data byte.  Whether the application has this
      urgent mode processing or not, at some point, the application will
      consume some or all of the pending data in the receive buffer.
      When this occurs, Host B will also send a CDC message to update
      its consumer cursor and consumer cursor wrap sequence number to
      the peer.  In the example above, a full window's worth of data was
      consumed.

   3. Host A, once awakened by the message, will notice that the window
      size is now open on this connection (based on the consumer cursor
      and the consumer cursor wrap sequence number, which now matches
      the producer cursor wrap sequence number) and resume sending of
      the urgent data segment by scheduling an RDMA write into relative
      position 1000-1499.

   4. Host A then sends a CDC message to advance its producer cursor
      (1500) and to also notify Host B of the Urgent Data Present (UrgA)
      indicator (and turn off the writer blocked indicator).  This
      signals to Host B that the urgent data is now in the local receive
      buffer and that the producer cursor points to the last byte of
      urgent data.

   5. Host B wakes up, processes the urgent data, and, once the urgent
      data is consumed, sends a CDC message to update its consumer
      cursor (1500).
Top   ToC   RFC7609 - Page 81
   6. Host A wakes up, sees that Host B has consumed the sequence number
      associated with the urgent data, and then initiates the next RDMA
      write operation to move the 1000 bytes associated with the next
      send() of normal data into the peer's receive buffer at
      position 1500-2499.  Note that the send API would have likely
      completed earlier in the process by copying the 1000 bytes into a
      send buffer and returning back to the application, even though we
      could not send any new data until the urgent data was processed
      and acknowledged by Host B.

   7. Host A sends a CDC message to advance its producer cursor to 2500
      and to reset the Urgent Data Pending and Urgent Data Present
      flags.  Host B wakes up and processes the inbound data.

4.8. Connection Termination

Just as SMC-R connections are established using a combination of TCP connection establishment flows and SMC-R protocol flows, the termination of SMC-R connections also uses a similar combination of SMC-R protocol termination flows and normal TCP connection termination flows. The following sections describe the SMC-R protocol normal and abnormal connection termination flows.

4.8.1. Normal SMC-R Connection Termination Flows

Normal SMC-R connection flows are triggered via the normal stream socket API semantics, namely by the application issuing a close() or shutdown() API. Most applications, after consuming all incoming data and after sending any outbound data, will then issue a close() API to indicate that they are done both sending and receiving data. Some applications, typically a small percentage, make use of the shutdown() API that allows them to indicate that the application is done sending data, receiving data, or both sending and receiving data. The main use of this API is scenarios where a TCP application wants to alert its partner endpoint that it is done sending data but is still receiving data on its socket (shutdown for write). Issuing shutdown() for both sending and receiving data is really no different than issuing a close() and can therefore be treated in a similar fashion. Shutdown for read is typically not a very useful operation and in normal circumstances does not trigger any network flows to notify the partner TCP endpoint of this operation. These same trigger points will be used by the SMC-R layer to initiate SMC-R connection termination flows. The main design point for SMC-R normal connection flows is to use the SMC-R protocol to first shut down the SMC-R connection and free up any SMC-R RDMA resources, and then allow the normal TCP connection termination protocol (i.e., FIN processing) to drive cleanup of the TCP connection. This design
Top   ToC   RFC7609 - Page 82
   point is very important in ensuring that RDMA resources such as
   the RMBEs are only freed and reused when both SMC-R endpoints
   are completely done with their RDMA write operations to the
   partner's RMBE.

                                      1
                            +-----------------+
            |-------------->|     CLOSED      |<-------------|
        3D  |               |                 |              |  4D
            |               +-----------------+              |
            |                       |                        |
            |                     2 |                        |
            |                       V                        |
    +----------------+     +-----------------+     +----------------+
    |AppFinCloseWait |     |     ACTIVE      |     |PeerFinCloseWait|
    |                |     |                 |     |                |
    +----------------+     +-----------------+     +----------------+
            |                   |         |                   |
            |     Active Close  | 3A | 4A |  Passive Close    |
            |                   V    |    V                   |
            |       +--------------+ | +-------------+        |
            |--<----|PeerCloseWait1| | |AppCloseWait1|--->----|
        3C  |       |              | | |             |        |  4C
            |       +--------------+ | +-------------+        |
            |             |          |         |              |
            |             | 3B       |     4B  |              |
            |             V          |         V              |
            |       +--------------+ | +-------------+        |
            |--<----|PeerCloseWait2| | |AppCloseWait2|--->----|
                    |              | | |             |
                    +--------------+ | +-------------+
                                     |
                                     |

                    Figure 22: SMC-R Connection States

   Figure 22 describes the states that an SMC-R connection typically
   goes through.  Note that there are variations to these states that
   can occur when an SMC-R connection is abnormally terminated, similar
   in a way to when a TCP connection is reset.  The following are the
   high-level state transitions for an SMC-R connection:

   1. An SMC-R connection begins in the Closed state.  This state is
      meant to reflect an RMBE that is not currently in use (was
      previously in use but no longer is, or was never allocated).
Top   ToC   RFC7609 - Page 83
   2. An SMC-R connection progresses to the Active state once the SMC-R
      Rendezvous processing has successfully completed, RMB element
      indices have been exchanged, and SMC-R links have been activated.
      In this state, the TCP connection is fully established, rendezvous
      processing has been completed, and SMC-R peers can begin the
      exchange of data via RDMA.

   3. Active close processing (on the SMC-R peer that is initiating the
      connection termination).

      A. When an application on one of the SMC-R connection peers issues
         a close(), a shutdown() for write, or a shutdown() for both
         read and write, the SMC-R layer on that host will initiate
         SMC-R connection termination processing.  First, if a close()
         or shutdown(both) is issued, it will check to see that there's
         no data in the local RMB element that has not been read by the
         application.  If unread data is detected, the SMC-R connection
         must be abnormally reset; for more details on this, refer to
         Section 4.8.2 ("Abnormal SMC-R Connection Termination Flows").
         If no unread data is pending, it then checks to see whether or
         not any outstanding data is waiting to be written to the peer,
         or if any outstanding RDMA writes for this SMC-R connection
         have not yet completed.  If either of these two scenarios is
         true, an indicator that this connection is in a pending close
         state is saved in internal data structures representing this
         SMC-R connection, and control is returned to the application.
         If all data to be written to the partner has completed, this
         peer will send a CDC message to notify the peer of either the
         PeerConnectionClosed indicator (close or shutdown for both was
         issued) or the PeerDoneWriting indicator.  This will provide an
         interrupt to inform that partner SMC-R peer that the connection
         is terminating.  At this point, the local side of the SMC-R
         connection transitions in the PeerCloseWait1 state, and control
         can be returned to the application.  If this process could not
         be completed synchronously (the pending close condition
         mentioned above), it is completed when all RDMA writes for data
         and control cursors have been completed.

      B. At some point, the SMC-R peer application (passive close) will
         consume all incoming data, realize that that partner is done
         sending data on this connection, and proceed to initiate its
         own close of the connection once it has completed sending all
         data from its end.  The partner application can initiate this
         connection termination processing via close() or shutdown()
         APIs.  If the application does so by issuing a shutdown() for
         write, then the partner SMC-R layer will send a CDC message to
         notify the peer (the active close side) of the PeerDoneWriting
         indicator.  When the "active close" SMC-R peer wakes up as a
Top   ToC   RFC7609 - Page 84
         result of the previous CDC message, it will notice that the
         PeerDoneWriting indicator is now on and transition to the
         PeerCloseWait2 state.  This state indicates that the peer is
         done sending data and may still be reading data.  At this
         point, the "active close" peer will also need to ensure that
         any outstanding recv() calls for this socket are woken up and
         remember that no more data is forthcoming on this connection
         (in case the local connection was shutdown() for write only).

      C. This flow is a common transition from 3A or 3B above.  When the
         SMC-R peer (passive close) consumes all data and updates all
         necessary cursors to the peer, and the application closes its
         socket (close or shutdown for both), it will send a CDC message
         to the peer (the active close side) with the
         PeerConnectionClosed indicator set.  At this point, the
         connection can transition back to the Closed state if the local
         application has already closed (or issued shutdown for both)
         the socket.  Once in the Closed state, the RMBE can now be
         safely reused for a new SMC-R connection.  When the
         PeerConnectionClosed indicator is turned on, the SMC-R peer is
         indicating that it is done updating the partner's RMBE.

      D. Conditional state: If the local application has not yet issued
         a close() or shutdown(both), we need to wait until the
         application does so.  Once it does, the local host will send a
         CDC message to notify the peer of the PeerConnectionClosed
         indicator and then transition to the Closed state.

   4. Passive close processing (on the SMC-R peer that receives an
      indication that the partner is closing the connection).

      A. Upon receipt of a CDC message, the SMC-R layer will detect that
         the PeerConnectionClosed indicator or PeerDoneWriting indicator
         is on.  If any outstanding recv() calls are pending, they are
         completed with an indicator that the partner has closed the
         connection (zero-length data presented to the application).  If
         there is any pending data to be written and
         PeerConnectionClosed is on, then an SMC-R connection reset must
         be performed.  The connection then enters the AppCloseWait1
         state on the passive close side waiting for the local
         application to initiate its own close processing.

      B. If the local application issues a shutdown() for writing, then
         the SMC-R layer will send a CDC message to notify the partner
         of the PeerDoneWriting indicator and then transition the local
         side of the SMC-R connection to the AppCloseWait2 state.
Top   ToC   RFC7609 - Page 85
      C. When the application issues a close() or shutdown() for both,
         the local SMC-R peer will send a message informing the peer of
         the PeerConnectionClosed indicator and transition to the Closed
         state if the remote peer has also sent the local peer the
         PeerConnectionClosed indicator.  If the peer has not sent the
         PeerConnectionClosed indicator, we transition into the
         PeerFinCloseWait state.

      D. The local SMC-R connection stays in this state until the peer
         sends the PeerConnectionClosed indicator in a CDC message.
         When the indicator is sent, we transition to the Closed state
         and are then free to reuse this RMBE.

   Note that each SMC-R peer needs to provide some logic that will
   prevent being stranded in a termination state indefinitely.  For
   example, if an Active Close SMC-R peer is in a PeerCloseWait (1 or 2)
   state waiting for the remote SMC-R peer to update its connection
   termination status, it needs to provide a timer that will prevent it
   from waiting in that state indefinitely should the remote SMC-R peer
   not respond to this termination request.  This could occur in error
   scenarios -- for example, if the remote SMC-R peer suffered a failure
   prior to being able to respond to the termination request or the
   remote application is not responding to this connection termination
   request by closing its own socket.  This latter scenario is similar
   to the TCP FINWAIT2 state, which has been known to sometimes cause
   issues when remote TCP/IP hosts lose track of established connections
   and neglect to close them.  Even though the TCP standards do not
   mandate a timeout from the TCP FINWAIT2 state, most TCP/IP
   implementations assign a timeout for this state.  A similar timeout
   will be required for SMC-R connections.  When this timeout occurs,
   the local SMC-R peer performs TCP reset processing for this
   connection.  However, no additional RDMA writes to the partner RMBE
   can occur at this point (we have already indicated that we are done
   updating the peer's RMBE).  After the TCP connection is reset, the
   RMBE can be returned to the free pool for reallocation.  See
   Section 4.4.2 for more details.

   Also note that it is possible to have two SMC-R endpoints initiate an
   Active close concurrently.  In that scenario, the flows above still
   apply; however, both endpoints follow the active close path (path 3).
Top   ToC   RFC7609 - Page 86

4.8.2. Abnormal SMC-R Connection Termination Flows

Abnormal SMC-R connection termination can occur for a variety of reasons, including the following: o The TCP connection associated with an SMC-R connection is reset. In TCP, either endpoint can send a RST segment to abort an existing TCP connection when error conditions are detected for the connection or the application overtly requests that the connection be reset. o Normal SMC-R connection termination processing has unexpectedly stalled for a given connection. When the stall is detected (connection termination timeout condition), an abnormal SMC-R connection termination flow is initiated. In these scenarios, it is very important that resources associated with the affected SMC-R connections are properly cleaned up to ensure that there are no orphaned resources and that resources can reliably be reused for new SMC-R connections. Given that SMC-R relies heavily on the RDMA write processing, special care needs to be taken to ensure that an RMBE is no longer being used by an SMC-R peer before logically reassigning that RMBE to a new SMC-R connection. When an SMC-R peer initiates a TCP connection reset, it also initiates an SMC-R abnormal connection flow at the same time. The SMC-R peers explicitly signal their intent to abnormally terminate an SMC-R connection and await explicit acknowledgment that the peer has received this notification and has also completed abnormal connection termination on its end. Note that TCP connection reset processing can occur in parallel to these flows.
Top   ToC   RFC7609 - Page 87
                            +-----------------+
            |-------------->|     CLOSED      |<-------------|
            |               |                 |              |
            |               +-----------------+              |
            |                                                |
            |                                                |
            |                                                |
            |           +-----------------------+            |
            |           |     Any state         |            |
            |1B         | (before setting       |          2B|
            |           |  PeerConnectionClosed |            |
            |           |  indicator in         |            |
            |           |  peer's RMBE)         |            |
            |           +-----------------------+            |
            |         1A        |         |      2A          |
            |     Active Abort  |         |  Passive Abort   |
            |                   V         V                  |
            |       +--------------+   +--------------+      |
            |-------|PeerAbortWait |   | Process Abort|------|
                    |              |   |              |
                    +--------------+   +--------------+

      Figure 23: SMC-R Abnormal Connection Termination State Diagram

   Figure 23 above shows the SMC-R abnormal connection termination state
   diagram:

   1. Active abort designates the SMC-R peer that is initiating the TCP
      RST processing.  At the time that the TCP RST is sent, the active
      abort side must also do the following:

      A. Send the PeerConnAbort indicator to the partner in a CDC
         message, and then transition to the PeerAbortWait state.
         During this state, it will monitor this SMC-R connection
         waiting for the peer to send its corresponding PeerConnAbort
         indicator but will ignore any other activity in this connection
         (i.e., new incoming data).  It will also generate an
         appropriate error to any socket API calls issued against this
         socket (e.g., ECONNABORTED, ECONNRESET).

      B. Once the peer sends the PeerConnAbort indicator to the local
         host, the local host can transition this SMC-R connection to
         the Closed state and reuse this RMBE.  Note that the SMC-R peer
         that goes into the active abort state must provide some
         protection against staying in that state indefinitely should
         the remote SMC-R peer not respond by sending its own
         PeerConnAbort indicator to the local host.  While this should
         be a rare scenario, it could occur if the remote SMC-R peer
Top   ToC   RFC7609 - Page 88
         (passive abort) suffered a failure right after the local SMC-R
         peer (active abort) sent the PeerConnAbort indicator.  To
         protect against these types of failures, a timer can be set
         after entering the PeerAbortWait state, and if that timer pops
         before the peer has sent its local PeerConnAbort indicator (to
         the active abort side), this RMBE can be returned to the free
         pool for possible reallocation.  See Section 4.4.2 for more
         details.

   2. Passive abort designates the SMC-R peer that is the recipient of
      an SMC-R abort from the peer designated by the PeerConnAbort
      indicator being sent by the peer in a CDC message.  Upon receiving
      this request, the local peer must do the following:

      A. Using the appropriate error codes, indicate to the socket
         application that this connection has been aborted, and then
         purge all in-flight data for this connection that is waiting to
         be read or waiting to be sent.

      B. Send a CDC message to notify the peer of the PeerConnAbort
         indicator and, once that is completed, transition this RMBE to
         the Closed state.

   If an SMC-R peer receives a TCP RST for a given SMC-R connection, it
   also initiates SMC-R abnormal connection termination processing if it
   has not already been notified (via the PeerConnAbort indicator) that
   the partner is severing the connection.  It is possible to have two
   SMC-R endpoints concurrently be in an active abort role for a given
   connection.  In that scenario, the flows above still apply but both
   endpoints take the active abort path (path 1).

4.8.3. Other SMC-R Connection Termination Conditions

The following are additional conditions that have implications for SMC-R connection termination: o An SMC-R peer being gracefully shut down. If an SMC-R peer supports a graceful shutdown operation, it should attempt to terminate all SMC-R connections as part of shutdown processing. This could be accomplished via LLC DELETE LINK requests on all active SMC-R links. o Abnormal termination of an SMC-R peer. In this example, there may be no opportunity for the host to perform any SMC-R cleanup processing. In this scenario, it is up to the remote peer to detect a RoCE communications failure with the failing host. This
Top   ToC   RFC7609 - Page 89
      could trigger SMC-R link switchover, but that would also generate
      RoCE errors, causing the remote host to eventually terminate all
      existing SMC-R connections to this peer.

   o  Loss of RoCE connectivity between two SMC-R peers.  If two peers
      are no longer reachable across any links in their SMC-R link
      group, then both peers perform a TCP reset for the connections,
      generate an error to the local applications, and free up all QP
      resources associated with the link group.

5. Security Considerations

5.1. VLAN Considerations

The concepts and access control of virtual LANs (VLANs) must be extended to also cover the RoCE network traffic flowing across the Ethernet. The RoCE VLAN configuration and access permissions must mirror the IP VLAN configuration and access permissions over the Converged Enhanced Ethernet fabric. This means that hosts, routers, and switches that have access to specific VLANs on the IP fabric must also have the same VLAN access across the RoCE fabric. In other words, the SMC-R connectivity will follow the same virtual network access permissions as normal TCP/IP traffic.

5.2. Firewall Considerations

As mentioned above, the RoCE fabric inherits the same VLAN topology/access as the IP fabric. RoCE is a Layer 2 protocol that requires both endpoints to reside in the same Layer 2 network (i.e., VLAN). RoCE traffic cannot traverse multiple VLANs, as there is no support for routing RoCE traffic beyond a single VLAN. As a result, SMC-R communications will also be confined to peers that are members of the same VLAN. IP-based firewalls are typically inserted between VLANs (or physical LANs) and rely on normal IP routing to insert themselves in the data path. Since RoCE (and by extension SMC-R) is not routable beyond the local VLAN, there is no ability to insert a firewall in the network path of two SMC-R peers.

5.3. Host-Based IP Filters

Because SMC-R maintains the TCP three-way handshake for connection setup before switching to RoCE out of band, existing IP filters that control connection setup flows remain effective in an SMC-R environment. IP filters that operate on traffic flowing in an active TCP connection are not supported, because the connection data does not flow over IP.
Top   ToC   RFC7609 - Page 90

5.4. Intrusion Detection Services

Similar to IP filters, intrusion detection services that operate on TCP connection setups are compatible with SMC-R with no changes required. However, once the TCP connection has switched to RoCE out of band, packets are not available for examination.

5.5. IP Security (IPsec)

IP security is not compatible with SMC-R, because there are no IP packets on which to operate. TCP connections that require IP security must opt out of SMC-R.

5.6. TLS/SSL

Transport Layer Security/Secure Socket Layer (TLS/SSL) is preserved in an SMC-R environment. The TLS/SSL layer resides above the SMC-R layer, and outgoing connection data is encrypted before being passed down to the SMC-R layer for RDMA write. Similarly, incoming connection data goes through the SMC-R layer encrypted and is decrypted by the TLS/SSL layer as it is today. The TLS/SSL handshake messages flow over the TCP connection after the connection has switched to SMC-R, and so they are exchanged using RDMA writes by the SMC-R layer, transparently to the TLS/SSL layer.

6. IANA Considerations

The scarcity of TCP option codes available for assignment is understood, and this architecture uses experimental TCP options following the conventions of [RFC6994] ("Shared Use of Experimental TCP Options"). TCP ExID 0xE2D4C3D9 has been registered with IANA as a TCP Experiment Identifier. See Section 3.1. If this protocol achieves wide acceptance, a discrete option code may be requested by subsequent versions of this protocol.
Top   ToC   RFC7609 - Page 91

7. Normative References

[RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, DOI 10.17487/RFC0793, September 1981, <http://www.rfc-editor.org/info/rfc793>. [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", RFC 6994, DOI 10.17487/RFC6994, August 2013, <http://www.rfc-editor.org/info/rfc6994>. [RoCE] InfiniBand, "RDMA over Converged Ethernet specification", <https://cw.infinibandta.org/wg/Members/documentRevision/ download/7149>.


(next page on part 5)

Next Section