4. SMC-R Memory-Sharing Architecture
4.1. RMB Element Allocation Considerations
Each TCP connection using SMC-R must be allocated an RMBE by each SMC-R peer. This allocation is performed by each endpoint independently to allow each endpoint to select an RMBE that best matches the characteristics on its TCP socket endpoint. The RMBE associated with a TCP socket endpoint must have a receive buffer that is at least as large as the TCP receive buffer size in effect for that connection. The receive buffer size can be determined by what is specified explicitly by the application using setsockopt() or implicitly via the system-configured default value. This will allow sufficient data to be RDMA-written by the SMC-R peer to fill an entire receive buffer size's worth of data on a given data flow. Given that each RMB must have fixed-length RMBEs, this implies that an SMC-R endpoint may need to maintain multiple RMBs of various sizes for SMC-R connections on a given SMC-R link and can then select an RMBE that most closely fits a connection.4.2. RMB and RMBE Format
An RMB is a virtual memory buffer whose backing real memory is pinned. The RMB is subdivided into a whole number of equal-sized RMB Elements (RMBEs). Each RMBE begins with a 4-byte eye catcher for diagnostic and service purposes, followed by the receive data buffer. The contents of this diagnostic eye catcher are implementation dependent and should be used by the local SMC-R peer to check for overlay errors by verifying an intact eye catcher with every RMBE access. The RMBE is a wrapping receive buffer for receiving RDMA writes from the peer. Cursors, as described below, are exchanged between peers to manage and track RDMA writes and local data reads from the RMBE for a TCP connection.4.3. RMBE Control Information
RMBE control information consists of consumer cursors, producer cursors, wrap counts, CDC message sequence numbers, control flags such as urgent data and "writer blocked" indicators, and TCP connection information such as termination flags. This information is exchanged between SMC-R peers using CDC messages, which are passed using RoCE SendMsg. A TCP/IP stack implementing SMC-R must receive and store this information in its internal data structures, as it is used to manage the RMBE and its data buffer.
The format and contents of the CDC message are described in detail in Appendix A.4 ("Connection Data Control (CDC) Message Format"). The following is a high-level description of what this control information contains. o Connection state flags such as sending done, connection closed, failover data validation, and abnormal close. o A sequence number that is managed by the sender. This sequence number starts at 1, is increased each send, and wraps to 0. This sequence number tracks the CDC message sent and is not related to the number of bytes sent. It is used for failover data validation. o Producer cursor: a wrapping offset into the receiver's RMBE data area. Set by the peer that is writing into the RMBE, it points to where the writing peer will write the next byte of data into an RMBE. This cursor is accompanied by a wrap sequence number to help the RMBE owner (the receiver) identify full window size wrapping writes. Note that this cursor must account for (i.e., skip over) the RMBE eye catcher that is in the beginning of the data area. o Consumer cursor: a wrapping offset into the receiver's RMBE data area. Set by the owner of the RMBE (the peer that is reading from it), this cursor points to the offset of the next byte of data to be consumed by the peer in its own RMBE. The sender cannot write beyond this cursor into the receiver's RMBE without causing data loss. Like the producer cursor, this is accompanied by a wrap count to help the writer identify full window size wrapping reads. Note that this cursor must account for (i.e., skip over) the RMBE eye catcher that is in the beginning of the data area. o Data flags such as urgent data, writer blocked indicator, and cursor update requests.4.4. Use of RMBEs
4.4.1. Initializing and Accessing RMBEs
The RMBE eye catcher is initialized by the RMB owner prior to assigning it to a specific TCP connection and communicating its RMB index to the SMC-R partner. After an RMBE index is communicated to the SMC-R partner, the RMBE can only be referenced in "read-only mode" by the owner, and all updates to it are performed by the remote SMC-R partner via RDMA write operations.
Initialization of an RMBE must include the following: o Zeroing out the entire RMBE receive buffer, which helps minimize data integrity issues (e.g., data from a previous connection somehow being presented to the current connection). o Setting the beginning RMBE eye catcher. This eye catcher plays an important role in helping detect accidental overlays of the RMBE. The RMB owner should always validate these eye catchers before each new reference to the RMBE. If the eye catchers are found to be corrupted, the local host must reset the TCP connection associated with this RMBE and log the appropriate diagnostic information.4.4.2. RMB Element Reuse and Conflict Resolution
RMB elements can be reused once their associated TCP and SMC-R connections are terminated. Under normal and abnormal SMC-R connection termination processing, both SMC-R peers must explicitly acknowledge that they are done using an RMBE before that element can be freed and reassigned to another SMC-R connection instance. For more details on SMC-R connection termination, refer to Section 4.8. However, there are some error scenarios where this two-way explicit acknowledgment may not be completed. In these scenarios, an RMBE owner may choose to reassign this RMBE to a new SMC-R connection instance on this SMC-R link group. When this occurs, the partner SMC-R peer must detect this condition during SMC-R Rendezvous processing when presented with an RMBE that it believes is already in use for a different SMC-R connection. In this case, the SMC-R peer must abort the existing SMC-R connection associated with this RMBE. The abort processing resets the TCP connection (if it is still active), but it must not attempt to perform any RDMA writes to this RMBE and must also ignore any data sitting in the local RMBE associated with the existing connection. It then proceeds to free up the local RMBE and notify the local application that the connection is being abnormally reset. The remote SMC-R peer then proceeds to normal processing for this new SMC-R connection.
4.5. SMC-R Protocol Considerations
The following sections describe considerations for the SMC-R protocol as compared to TCP.4.5.1. SMC-R Protocol Optimized Window Size Updates
An SMC-R receiver host sends its consumer cursor information to the sender to convey the progress that the receiving application has made in consuming the sent data. The difference between the writer's producer cursor and the associated receiver's consumer cursor indicates the window size available for the sender to write into. This is somewhat similar to TCP window update processing and therefore has some similar considerations, such as silly window syndrome avoidance, whereby TCP has an optimization that minimizes the overhead of very small, unproductive window size updates associated with suboptimal socket applications consuming very small amounts of data on every receive() invocation. For SMC-R, the receiver only updates its consumer cursor via a unique CDC message under the following conditions: o The current window size (from a sender's perspective) is less than half of the receive buffer space, and the consumer cursor update will result in a minimum increase in the window size of 10% of the receive buffer space. Some examples: a. Receive buffer size: 64K, current window size (from a sender's perspective): 50K. No need to update the consumer cursor. Plenty of space is available for the sender. b. Receive buffer size: 64K, current window size (from a sender's perspective): 30K, current window size from a receiver's perspective: 31K. No need to update the consumer cursor; even though the sender's window size is < 1/2 of the 64K, the window update would only increase that by 1K, which is < 1/10th of the 64K buffer size. c. Receive buffer size: 64K, current window size (from a sender's perspective): 30K, current window size from a receiver's perspective: 64K. The receiver updates the consumer cursor (sender's window size is < 1/2 of the 64K; the window update would increase that by > 6.4K).
o The receiver must always include a consumer cursor update whenever it sends a CDC message to the partner for another flow (i.e., send flow in the opposite direction). This allows the window size update to be delivered with no additional overhead. This is somewhat similar to TCP DelayAck processing and quite effective for request/response data patterns. o If a peer has set the B-bit in a CDC message, then any consumption of data by the receiver causes a CDC message to be sent, updating the consumer cursor until a CDC message with that bit cleared is received from the peer. o The optimized window size updates are overridden when the sender sets the Consumer Cursor Update Requested flag in a CDC message to the receiver. When this indicator is on, the consumer must send a consumer cursor update immediately when data is consumed by the local application or if the cursor has not been updated for a while (i.e., local copy of the consumer cursor does not match the last consumer cursor value sent to the partner). This allows the sender to perform optional diagnostics for detecting a stalled receiver application (data has been sent but not consumed). It is recommended that the Consumer Cursor Update Requested flag only be sent for diagnostic procedures, as it may result in non-optimal data path performance.4.5.2. Small Data Sends
The SMC-R protocol makes no special provisions for handling small data segments sent across a stream socket. Data is always sent if sufficient window space is available. In contrast to the TCP Nagle algorithm, there are no special provisions in SMC-R for coalescing small data segments. An implementation of SMC-R can be configured to optimize its sending processing by coalescing outbound data for a given SMC-R connection so that it can reduce the number of RDMA write operations it performs, in a fashion similar to Nagle's algorithm. However, any such coalescing would require a timer on the sending host that would ensure that data was eventually sent. Also, the sending host would have to opt out of this processing if Nagle's algorithm had been disabled (programmatically or via system configuration).
4.5.3. TCP Keepalive Processing
TCP keepalive processing allows applications to direct the local TCP/IP host to periodically "test" the viability of an idle TCP connection. Since SMC-R connections have a TCP representation along with an SMC-R representation, there are unique keepalive processing considerations: o SMC-R-layer keepalive processing: If keepalive is enabled for an SMC-R connection, the local host maintains a keepalive timer that reflects how long an SMC-R connection has been idle. The local host also maintains a timestamp of last activity for each SMC-R link (for any SMC-R connection on that link). When it is determined that an SMC-R connection has been idle longer than the keepalive interval, the host checks to see whether or not the SMC-R link has been idle for a duration longer than the keepalive timeout. If both conditions are met, the local host then performs a TEST LINK LLC command to test the viability of the SMC-R link over the RoCE fabric (RC-QPs). If a TEST LINK LLC command response is received within a reasonable amount of time, then the link is considered viable, and all connections using this link are considered viable as well. If, however, a response is not received in a reasonable amount of time or there's a failure in sending the TEST LINK LLC command, then this is considered a failure in the SMC-R link, and failover processing to an alternate SMC-R link must be triggered. If no alternate SMC-R link exists in the SMC-R link group, then all of the SMC-R connections on this link are abnormally terminated by resetting the TCP connections represented by these SMC-R connections. Given that multiple SMC-R connections can share the same SMC-R link, implementing an SMC-R link-level probe using the TEST LINK LLC command will help reduce the amount of unproductive keepalive traffic for SMC-R connections; as long as some SMC-R connections on a given SMC-R link are active (i.e., have had I/O activity within the keepalive interval), then there is no need to perform additional link viability testing.
o TCP-layer keepalive processing: Traditional TCP "keepalive" packets are not as relevant for SMC-R connections, given that the TCP path is not used for these connections once the SMC-R Rendezvous processing is completed. All SMC-R connections by default have associated TCP connections that are idle. Are TCP keepalive probes still needed for these connections? There are two main scenarios to consider: 1. TCP keepalives that are used to determine whether or not the peer TCP endpoint is still active. This is not needed for SMC-R connections, as the SMC-R-level keepalives mentioned above will determine whether or not the remote endpoint connections are still active. 2. TCP keepalives that are used to ensure that TCP connections traversing an intermediate proxy maintain an active state. For example, stateful firewalls typically maintain state representing every valid TCP connection that traverses the firewall. These types of firewalls are known to expire idle connections by removing their state in the firewall to conserve memory. TCP keepalives are often used in this scenario to prevent firewalls from timing out otherwise idle connections. When using SMC-R, both endpoints must reside in the same Layer 2 network (i.e., the same subnet). As a result, firewalls cannot be injected in the path between two SMC-R endpoints. However, other intermediate proxies, such as TCP/IP-layer load balancers, may be injected in the path of two SMC-R endpoints. These types of load balancers also maintain connection state so that they can forward TCP connection traffic to the appropriate cluster endpoint. When using SMC-R, these TCP connections will appear to be completely idle, making them susceptible to potential timeouts at the load-balancing proxy. As a result, for this scenario, TCP keepalives may still be relevant. The following are the TCP-level keepalive processing requirements for SMC-R-enabled hosts: o SMC-R peers should allow TCP keepalives to flow on the TCP path of SMC-R connections based on existing TCP keepalive configuration and programming options. However, it is strongly recommended that platforms provide the ability to specify very granular keepalive timers (for example, single-digit-second timers) and should consider providing a configuration option that limits the minimum keepalive timer that will be used for TCP-layer keepalives on SMC-R connections. This is important to minimize the amount of TCP keepalive packets transmitted in the network for SMC-R connections.
o SMC-R peers must always respond to inbound TCP-layer keepalives (by sending ACKs for these packets) even if the connection is using SMC-R. Typically, once a TCP connection has completed the SMC-R Rendezvous processing and is using SMC-R for data flows, no new inbound TCP segments are expected on that TCP connection, other than TCP termination segments (FIN, RST, etc.). TCP keepalives are the one exception that must be supported. Also, since TCP keepalive probes do not carry any application-layer data, this has no adverse impact on the application's inbound data stream.4.6. TCP Connection Failover between SMC-R Links
A peer may change which SMC-R link within a link group it sends its writes over in the event of a link failure. Since each peer independently chooses which link to send writes over for a specific TCP connection, this process is done independently by each peer.4.6.1. Validating Data Integrity
Even though RoCE is a reliable transport, there is a small subset of failure modes that could cause unrecoverable loss of data. When an RNIC acknowledges receipt of an RDMA write to its peer, that creates a write completion event to the sending peer, which allows the sender to release any buffers it is holding for that write. In normal operation and in most failures, this operation is reliable. However, there are failure modes possible in which a receiving RNIC has acknowledged an RDMA write but then was not able to place the received data into its host memory -- for example, a sudden, disorderly failure of the interface between the RNIC and the host. While rare, these types of events must be guarded against to ensure data integrity. The process for switching SMC-R links during failover, as described in this section, guards against this possibility and is mandatory. Each peer must track the current state of the CDC sequence numbers for a TCP connection. The sender must keep track of the sequence number of the CDC message that described the last write acknowledged by the peer RNIC, or Sequence Sent (SS). In other words, SS describes the last write that the sender believes its peer has successfully received. The receiver must keep track of the sequence number of the CDC message that described the last write that it has successfully received (i.e., the data has been successfully placed into an RMBE), or Sequence Received (SR).
When an RNIC fails and the sender changes SMC-R links, the sender must first send a CDC message with the F-bit (failover validation indicator; see Appendix A.4) set over the new SMC-R link. This is the failover data validation message. The sequence number in this CDC message is equal to SS. The CDC message key, the length, and the SMC-R alert token are the only other fields in this CDC message that are significant. No reply is expected from this validation message, and once the sender has sent it, the sender may resume sending on the new SMC-R link as described in Section 4.6.2. Upon receipt of the failover validation message, the receiver must verify that its SR value for the TCP connection is equal to or greater than the sequence number in the failover validation message. If so, no further action is required, and the TCP connection resumes on the new SMC-R link. If SR is less than the sequence number value in the validation message, data has been lost, and the receiver must immediately reset the TCP connection.4.6.2. Resuming the TCP Connection on a New SMC-R Link
When a connection is moved to a new SMC-R link and the failover validation message has been sent, the sender can immediately resume normal transmission. In order to preserve the application message stream, the sender must replay any RDMA writes (and their associated CDC messages) that were in progress or failed when the previous SMC-R link failed, before sending new data on the new SMC-R link. The sender has two options for accomplishing this: o Preserve the sequence numbers "as is": Retry all failed and pending operations as they were originally done, including reposting all associated RDMA write operations and their associated CDC messages without making any changes. Then resume sending new data using new sequence numbers. o Combine pending messages and possibly add new data: Combine failed and pending messages into a single new write with a new sequence number. This allows the sender to combine pending messages into fewer operations. As a further optimization, this write can also include new data, as long as all failed and pending data are also included. If this approach is taken, the sequence number must be increased beyond the last failed or pending sequence number.
4.7. RMB Data Flows
The following sections describe the RDMA wire flows for the SMC-R protocol after a TCP connection has switched into SMC-R mode (i.e., SMC-R Rendezvous processing is complete and a pair of RMB elements has been assigned and communicated by the SMC-R peers). The ladder diagrams below include the following: o RMBE control information kept by each peer. Only a subset of the information is depicted, specifically only the fields that reflect the stream of data written by Host A and read by Host B. o Time line 0-x, which shows the wire flows in a time-relative fashion. o Note that RMBE control information is only shown in a time interval if its value changed (otherwise, assume that the value is unchanged from the previously depicted value). o The local copy of the producer cursors and consumer cursors that is maintained by each host is not depicted in these figures. Note that the cursor values in the diagram reflect the necessity of skipping over the eye catcher in the RMBE data area. They start and wrap at 4, not 0.4.7.1. Scenario 1: Send Flow, Window Size Unconstrained
SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 4 0 0 0 4 0 0 0 0 1 ---------------> 1 0 0 0 RDMA-WR Data (4:1003) 4 0 2 ...............> 2 1004 0 0 CDC Message Figure 16: Scenario 1: Send Flow, Window Size Unconstrained Scenario assumptions: o Kernel implementation. o New SMC-R connection; no data has been sent on the connection.
o Host A: Application issues send for 1000 bytes to Host B. o Host B: RMBE receive buffer size is 10,000; application has issued a recv for 10,000 bytes. Flow description: 1. The application issues a send() for 1000 bytes; the SMC-R layer copies data into a kernel send buffer. It then schedules an RDMA write operation to move the data into the peer's RMBE receive buffer, at relative position 4-1003 (to skip the 4-byte eye catcher in the RMBE data area). Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation. 2. Host A sends a CDC message to update the producer cursor to byte 1004. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application. Host B, once notified of the completion of the previous RDMA operation, locates the RMBE associated with the RMBE alert token that was included in the message and proceeds to perform normal receive-side processing, waking up the suspended application read thread, copying the data into the application's receive buffer, etc. It will use the producer cursor as an indicator of how much data is available to be delivered to the local application. After this processing is complete, the SMC-R layer will also update its local consumer cursor to match the producer cursor (i.e., indicating that all data has been consumed). Note that a message to the peer updating the consumer cursor is not needed at this time, as the window size is unconstrained (> 1/2 of the receive buffer size). The window size is calculated by taking the difference between the producer cursor and the consumer cursor in the RMBEs (10,000 - 1004 = 8996).
4.7.2. Scenario 2: Send/Receive Flow, Window Size Unconstrained
SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 4 0 0 0 4 0 0 0 0 1 ---------------> 1 0 0 0 RDMA-WR Data (4:1003) 4 0 2 ...............> 2 1004 0 0 CDC Message 0 0 3 <-------------- 3 1004 0 0 RDMA-WR Data (4:503) 1004 0 4 <.............. 4 1004 0 0 CDC Message Figure 17: Scenario 2: Send/Receive Flow, Window Size Unconstrained Scenario assumptions: o New SMC-R connection; no data has been sent on the connection. o Host A: Application issues send for 1000 bytes to Host B. o Host B: RMBE receive buffer size is 10,000; application has already issued a recv for 10,000 bytes. Once the receive is completed, the application sends a 500-byte response to Host A. Flow description: 1. The application issues a send() for 1000 bytes; the SMC-R layer copies data into a kernel send buffer. It then schedules an RDMA write operation to move the data into the peer's RMBE receive buffer, at relative position 4-1003. Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation. 2. Host A sends a CDC message to update the producer cursor to byte 1004. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application.
3. Host B, once notified of the receipt of the previous CDC message, locates the RMBE associated with the RMBE alert token and proceeds to perform normal receive-side processing, waking up the suspended application read thread, copying the data into the application's receive buffer, etc. After this processing is complete, the SMC-R layer will also update its local consumer cursor to match the producer cursor (i.e., indicating that all data has been consumed). Note that an update of the consumer cursor to the peer is not needed at this time, as the window size is unconstrained (> 1/2 of the receive buffer size). The application then performs a send() for 500 bytes to Host A. The SMC-R layer will copy the data into a kernel buffer and then schedule an RDMA write into the partner's RMBE receive buffer. Note that this RDMA write operation includes no immediate data or notification to Host A. 4. Host B sends a CDC message to update the partner's RMBE control information with the latest producer cursor (set to 503 and not shown in the diagram above) and to also inform the peer that the consumer cursor value is now 1004. It also updates the local current consumer cursor and the last sent consumer cursor to 1004. This CDC message includes notification, since we are updating our producer cursor; this requires attention by the peer host.4.7.3. Scenario 3: Send Flow, Window Size Constrained
SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 4 0 0 0 4 0 0 4 0 1 ---------------> 1 4 0 0 RDMA-WR Data (4:3003) 4 0 2 ...............> 2 3004 0 0 CDC Message 4 0 3 3 3004 0 0 4 0 4 ---------------> 4 3004 0 0 RDMA-WR Data (3004:7003) 4 0 5 ................> 5 7004 0 0 CDC Message 7004 0 6 <................ 6 7004 0 0 CDC Message Figure 18: Scenario 3: Send Flow, Window Size Constrained
Scenario assumptions: o New SMC-R connection; no data has been sent on this connection. o Host A: Application issues send for 3000 bytes to Host B and then another send for 4000 bytes. o Host B: RMBE receive buffer size is 10,000. Application has already issued a recv for 10,000 bytes. Flow description: 1. The application issues a send() for 3000 bytes; the SMC-R layer copies data into a kernel send buffer. It then schedules an RDMA write operation to move the data into the peer's RMBE receive buffer, at relative position 4-3003. Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation. 2. Host A sends a CDC message to update its producer cursor to byte 3003. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application. 3. Host B, once notified of the receipt of the previous CDC message, locates the RMBE associated with the RMBE alert token and proceeds to perform normal receive-side processing, waking up the suspended application read thread, copying the data into the application's receive buffer, etc. After this processing is complete, the SMC-R layer will also update its local consumer cursor to match the producer cursor (i.e., indicating that all data has been consumed). It will not, however, update the partner with this information, as the window size is not constrained (10,000 - 3000 = 7000 bytes of available space). The application on Host B also issues a new recv() for 10,000 bytes. 4. On Host A, the application issues a send() for 4000 bytes. The SMC-R layer copies the data into a kernel buffer and schedules an async RDMA write into the peer's RMBE receive buffer at relative position 3003-7004. Note that no alert is provided to Host B for this flow. 5. Host A sends a CDC message to update the producer cursor to byte 7004. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application.
6. Host B, once notified of the receipt of the previous CDC message, locates the RMBE associated with the RMBE alert token and proceeds to perform normal receive-side processing, waking up the suspended application read thread, copying the data into the application's receive buffer, etc. After this processing is complete, the SMC-R layer will also update its local consumer cursor to match the producer cursor (i.e., indicating that all data has been consumed). It will then determine whether or not it needs to update the consumer cursor to the peer. The available window size is now 3000 (10,000 - (producer cursor - last sent consumer cursor)), which is < 1/2 of the receive buffer size (10,000/2 = 5000), and the advance of the window size is > 10% of the window size (1000). Therefore, a CDC message is issued to update the consumer cursor to Peer A.4.7.4. Scenario 4: Large Send, Flow Control, Full Window Size Writes
SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags 1004 1 0 0 1004 1 0 1004 1 1 ---------------> 1 1004 1 0 RDMA-WR Data (1004:9999) 1004 1 2 ---------------> 2 1004 1 0 RDMA-WR Data (4:1003) 1004 1 3 ...............> 3 1004 2 Wrt CDC Message Blk 1004 2 4 <............... 4 1004 2 Wrt CDC Message Blk 1004 2 5 ---------------> 5 1004 2 Wrt RDMA-WR Data Blk (1004:9999) 1004 2 6 ---------------> 6 1004 2 Wrt RDMA-WR Data Blk (4:1003) 1004 2 7 ...............> 7 1004 3 Wrt CDC Message Blk 1004 3 8 <............... 8 1004 3 Wrt CDC Message Blk Figure 19: Scenario 4: Large Send, Flow Control, Full Window Size Writes
Scenario assumptions: o Kernel implementation. o Existing SMC-R connection, Host B's receive window size is fully open (peer consumer cursor = peer producer cursor). o Host A: Application issues send for 20,000 bytes to Host B. o Host B: RMBE receive buffer size is 10,000; application has issued a recv for 10,000 bytes. Flow description: 1. The application issues a send() for 20,000 bytes; the SMC-R layer copies data into a kernel send buffer (assumes that send buffer space of 20,000 is available for this connection). It then schedules an RDMA write operation to move the data into the peer's RMBE receive buffer, at relative position 1004-9999. Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation. 2. Host A then schedules an RDMA write operation to fill the remaining 1000 bytes of available space in the peer's RMBE receive buffer, at relative position 4-1003. Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation. Also note that an implementation of SMC-R may optimize this processing by combining steps 1 and 2 into a single RDMA write operation (with two different data sources). 3. Host A sends a CDC message to update the producer cursor to byte 1004. Since the entire receive buffer space is filled, the producer writer blocked flag (the "Wrt Blk" indicator (flag) in Figure 19) is set and the producer cursor wrap sequence number (the producer "Wrap Seq#" in Figure 19) is incremented. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application. 4. Host B, once notified of the receipt of the previous CDC message, locates the RMBE associated with the RMBE alert token and proceeds to perform normal receive-side processing, waking up the suspended application read thread, copying the data into the application's receive buffer, etc. In this scenario, Host B notices that the producer cursor has not been advanced (same value as the consumer cursor); however, it notices that the producer cursor wrap sequence number is different from its local value (1), indicating that a full window of new data is available. All of the data in the receive buffer can be processed, with the first segment
(1004-9999) followed by the second segment (4-1003). Because the producer writer blocked indicator was set, Host B schedules a CDC message to update its latest information to the peer: consumer cursor (1004), consumer cursor wrap sequence number (the current value of 2 is used). 5. Host A, upon receipt of the CDC message, locates the TCP connection associated with the alert token and, upon examining the control information provided, notices that Host B has consumed all of the data (based on the consumer cursor and the consumer cursor wrap sequence number) and initiates the next RDMA write to fill the receive buffer at offset 1003-9999. 6. Host A then moves the next 1000 bytes into the beginning of the receive buffer (4-1003) by scheduling an RDMA write operation. Note that at this point there are still 8 bytes remaining to be written. 7. Host A then sends a CDC message to set the producer writer blocked indicator and to increment the producer cursor wrap sequence number (3). 8. Host B, upon notification, completes the same processing as step 4 above, including sending a CDC message to update the peer to indicate that all data has been consumed. At this point, Host A can write the final 8 bytes to Host B's RMBE into positions 1004-1011 (not shown).
4.7.5. Scenario 5: Send Flow, Urgent Data, Window Size Unconstrained
SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag 1000 1 0 0 1000 1 0 1000 1 1 ---------------> 1 1000 1 0 RDMA-WR Data (1000:1499) 1000 1 2 ...............> 2 1500 1 UrgP CDC Message UrgA 1500 1 3 <............... 3 1500 1 UrgP CDC Message UrgA 1500 1 4 ---------------> 4 1500 1 UrgP RDMA-WR Data UrgA (1500:2499) 1500 1 5 ...............> 5 2500 1 0 CDC Message Figure 20: Scenario 5: Send Flow, Urgent Data, Window Size Open Scenario assumptions: o Kernel implementation. o Existing SMC-R connection; window size open (unconstrained); all data has been consumed by receiver. o Host A: Application issues send for 500 bytes with urgent data indicator (out of band) to Host B, then sends 1000 bytes of normal data. o Host B: RMBE receive buffer size is 10,000; application has issued a recv for 10,000 bytes and is also monitoring the socket for urgent data. Flow description: 1. The application issues a send() for 500 bytes of urgent data; the SMC-R layer copies data into a kernel send buffer. It then schedules an RDMA write operation to move the data into the peer's RMBE receive buffer, at relative position 1000-1499. Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation.
2. Host A sends a CDC message to update its producer cursor to byte 1500 and to turn on the producer Urgent Data Pending (UrgP) and Urgent Data Present (UrgA) flags. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application. 3. Host B, once notified of the receipt of the previous CDC message, locates the RMBE associated with the RMBE alert token, notices that the Urgent Data Pending flag is on, and proceeds with out-of- band socket API notification -- for example, satisfying any outstanding select() or poll() requests on the socket by indicating that urgent data is pending (i.e., by setting the exception bit on). The urgent data present indicator allows Host B to also determine the position of the urgent data (the producer cursor points 1 byte beyond the last byte of urgent data). Host B can then perform normal receive-side processing (including specific urgent data processing), copying the data into the application's receive buffer, etc. Host B then sends a CDC message to update the partner's RMBE control area with its latest consumer cursor (1500). Note that this CDC message must occur, regardless of the current local window size that is available. The partner host (Host A) cannot initiate any additional RDMA writes until it receives acknowledgment that the urgent data has been processed (or at least processed/remembered at the SMC-R layer). 4. Upon receipt of the message, Host A wakes up, sees that the peer consumed all data up to and including the last byte of urgent data, and now resumes sending any pending data. In this case, the application had previously issued a send for 1000 bytes of normal data, which would have been copied in the send buffer, and control would have been returned to the application. Host A now initiates an RDMA write to move that data to the peer's receive buffer at position 1500-2499. 5. Host A then sends a CDC message to update its producer cursor value (2500) and to turn off the Urgent Data Pending and Urgent Data Present flags. Host B wakes up, processes the new data (resumes application, copies data into the application receive buffer), and then proceeds to update the local current consumer cursor (2500). Given that the window size is unconstrained, there is no need for a consumer cursor update in the peer's RMBE.
4.7.6. Scenario 6: Send Flow, Urgent Data, Window Size Closed
SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag 1000 1 0 0 1000 2 Wrt Blk 1000 1 1 ...............> 1 1000 2 Wrt CDC Message Blk UrgP 1000 2 2 <............... 2 1000 2 Wrt CDC Message Blk UrgP 1000 2 3 ---------------> 3 1000 2 Wrt RDMA-WR Data Blk (1000:1499) UrgP 1000 2 4 ...............> 4 1500 2 UrgP CDC Message UrgA 1500 2 5 <............... 5 1500 2 UrgP CDC Message UrgA 1500 2 6 ---------------> 6 1500 2 UrgP RDMA-WR Data UrgA (1500:2499) 1000 2 7 ...............> 7 2500 2 0 CDC Message Figure 21: Scenario 6: Send Flow, Urgent Data, Window Size Closed Scenario assumptions: o Kernel implementation. o Existing SMC-R connection; window size closed; writer is blocked. o Host A: Application issues send for 500 bytes with urgent data indicator (out of band) to Host B, then sends 1000 bytes of normal data. o Host B: RMBE receive buffer size is 10,000; application has no outstanding recv() (for normal data) and is monitoring the socket for urgent data.
Flow description: 1. The application issues a send() for 500 bytes of urgent data; the SMC-R layer copies data into a kernel send buffer (if available). Since the writer is blocked (window size closed), it cannot send the data immediately. It then sends a CDC message to notify the peer of the Urgent Data Pending (UrgP) indicator (the writer blocked indicator remains on as well). This serves as a signal to Host B that urgent data is pending in the stream. Control is also returned to the application at this point. 2. Host B, once notified of the receipt of the previous CDC message, locates the RMBE associated with the RMBE alert token, notices that the Urgent Data Pending flag is on, and proceeds with out-of- band socket API notification -- for example, satisfying any outstanding select() or poll() requests on the socket by indicating that urgent data is pending (i.e., by setting the exception bit on). At this point, it is expected that the application will enter urgent data mode processing, expeditiously processing all normal data (by issuing recv API calls) so that it can get to the urgent data byte. Whether the application has this urgent mode processing or not, at some point, the application will consume some or all of the pending data in the receive buffer. When this occurs, Host B will also send a CDC message to update its consumer cursor and consumer cursor wrap sequence number to the peer. In the example above, a full window's worth of data was consumed. 3. Host A, once awakened by the message, will notice that the window size is now open on this connection (based on the consumer cursor and the consumer cursor wrap sequence number, which now matches the producer cursor wrap sequence number) and resume sending of the urgent data segment by scheduling an RDMA write into relative position 1000-1499. 4. Host A then sends a CDC message to advance its producer cursor (1500) and to also notify Host B of the Urgent Data Present (UrgA) indicator (and turn off the writer blocked indicator). This signals to Host B that the urgent data is now in the local receive buffer and that the producer cursor points to the last byte of urgent data. 5. Host B wakes up, processes the urgent data, and, once the urgent data is consumed, sends a CDC message to update its consumer cursor (1500).
6. Host A wakes up, sees that Host B has consumed the sequence number associated with the urgent data, and then initiates the next RDMA write operation to move the 1000 bytes associated with the next send() of normal data into the peer's receive buffer at position 1500-2499. Note that the send API would have likely completed earlier in the process by copying the 1000 bytes into a send buffer and returning back to the application, even though we could not send any new data until the urgent data was processed and acknowledged by Host B. 7. Host A sends a CDC message to advance its producer cursor to 2500 and to reset the Urgent Data Pending and Urgent Data Present flags. Host B wakes up and processes the inbound data.4.8. Connection Termination
Just as SMC-R connections are established using a combination of TCP connection establishment flows and SMC-R protocol flows, the termination of SMC-R connections also uses a similar combination of SMC-R protocol termination flows and normal TCP connection termination flows. The following sections describe the SMC-R protocol normal and abnormal connection termination flows.4.8.1. Normal SMC-R Connection Termination Flows
Normal SMC-R connection flows are triggered via the normal stream socket API semantics, namely by the application issuing a close() or shutdown() API. Most applications, after consuming all incoming data and after sending any outbound data, will then issue a close() API to indicate that they are done both sending and receiving data. Some applications, typically a small percentage, make use of the shutdown() API that allows them to indicate that the application is done sending data, receiving data, or both sending and receiving data. The main use of this API is scenarios where a TCP application wants to alert its partner endpoint that it is done sending data but is still receiving data on its socket (shutdown for write). Issuing shutdown() for both sending and receiving data is really no different than issuing a close() and can therefore be treated in a similar fashion. Shutdown for read is typically not a very useful operation and in normal circumstances does not trigger any network flows to notify the partner TCP endpoint of this operation. These same trigger points will be used by the SMC-R layer to initiate SMC-R connection termination flows. The main design point for SMC-R normal connection flows is to use the SMC-R protocol to first shut down the SMC-R connection and free up any SMC-R RDMA resources, and then allow the normal TCP connection termination protocol (i.e., FIN processing) to drive cleanup of the TCP connection. This design
point is very important in ensuring that RDMA resources such as the RMBEs are only freed and reused when both SMC-R endpoints are completely done with their RDMA write operations to the partner's RMBE. 1 +-----------------+ |-------------->| CLOSED |<-------------| 3D | | | | 4D | +-----------------+ | | | | | 2 | | | V | +----------------+ +-----------------+ +----------------+ |AppFinCloseWait | | ACTIVE | |PeerFinCloseWait| | | | | | | +----------------+ +-----------------+ +----------------+ | | | | | Active Close | 3A | 4A | Passive Close | | V | V | | +--------------+ | +-------------+ | |--<----|PeerCloseWait1| | |AppCloseWait1|--->----| 3C | | | | | | | 4C | +--------------+ | +-------------+ | | | | | | | | 3B | 4B | | | V | V | | +--------------+ | +-------------+ | |--<----|PeerCloseWait2| | |AppCloseWait2|--->----| | | | | | +--------------+ | +-------------+ | | Figure 22: SMC-R Connection States Figure 22 describes the states that an SMC-R connection typically goes through. Note that there are variations to these states that can occur when an SMC-R connection is abnormally terminated, similar in a way to when a TCP connection is reset. The following are the high-level state transitions for an SMC-R connection: 1. An SMC-R connection begins in the Closed state. This state is meant to reflect an RMBE that is not currently in use (was previously in use but no longer is, or was never allocated).
2. An SMC-R connection progresses to the Active state once the SMC-R Rendezvous processing has successfully completed, RMB element indices have been exchanged, and SMC-R links have been activated. In this state, the TCP connection is fully established, rendezvous processing has been completed, and SMC-R peers can begin the exchange of data via RDMA. 3. Active close processing (on the SMC-R peer that is initiating the connection termination). A. When an application on one of the SMC-R connection peers issues a close(), a shutdown() for write, or a shutdown() for both read and write, the SMC-R layer on that host will initiate SMC-R connection termination processing. First, if a close() or shutdown(both) is issued, it will check to see that there's no data in the local RMB element that has not been read by the application. If unread data is detected, the SMC-R connection must be abnormally reset; for more details on this, refer to Section 4.8.2 ("Abnormal SMC-R Connection Termination Flows"). If no unread data is pending, it then checks to see whether or not any outstanding data is waiting to be written to the peer, or if any outstanding RDMA writes for this SMC-R connection have not yet completed. If either of these two scenarios is true, an indicator that this connection is in a pending close state is saved in internal data structures representing this SMC-R connection, and control is returned to the application. If all data to be written to the partner has completed, this peer will send a CDC message to notify the peer of either the PeerConnectionClosed indicator (close or shutdown for both was issued) or the PeerDoneWriting indicator. This will provide an interrupt to inform that partner SMC-R peer that the connection is terminating. At this point, the local side of the SMC-R connection transitions in the PeerCloseWait1 state, and control can be returned to the application. If this process could not be completed synchronously (the pending close condition mentioned above), it is completed when all RDMA writes for data and control cursors have been completed. B. At some point, the SMC-R peer application (passive close) will consume all incoming data, realize that that partner is done sending data on this connection, and proceed to initiate its own close of the connection once it has completed sending all data from its end. The partner application can initiate this connection termination processing via close() or shutdown() APIs. If the application does so by issuing a shutdown() for write, then the partner SMC-R layer will send a CDC message to notify the peer (the active close side) of the PeerDoneWriting indicator. When the "active close" SMC-R peer wakes up as a
result of the previous CDC message, it will notice that the PeerDoneWriting indicator is now on and transition to the PeerCloseWait2 state. This state indicates that the peer is done sending data and may still be reading data. At this point, the "active close" peer will also need to ensure that any outstanding recv() calls for this socket are woken up and remember that no more data is forthcoming on this connection (in case the local connection was shutdown() for write only). C. This flow is a common transition from 3A or 3B above. When the SMC-R peer (passive close) consumes all data and updates all necessary cursors to the peer, and the application closes its socket (close or shutdown for both), it will send a CDC message to the peer (the active close side) with the PeerConnectionClosed indicator set. At this point, the connection can transition back to the Closed state if the local application has already closed (or issued shutdown for both) the socket. Once in the Closed state, the RMBE can now be safely reused for a new SMC-R connection. When the PeerConnectionClosed indicator is turned on, the SMC-R peer is indicating that it is done updating the partner's RMBE. D. Conditional state: If the local application has not yet issued a close() or shutdown(both), we need to wait until the application does so. Once it does, the local host will send a CDC message to notify the peer of the PeerConnectionClosed indicator and then transition to the Closed state. 4. Passive close processing (on the SMC-R peer that receives an indication that the partner is closing the connection). A. Upon receipt of a CDC message, the SMC-R layer will detect that the PeerConnectionClosed indicator or PeerDoneWriting indicator is on. If any outstanding recv() calls are pending, they are completed with an indicator that the partner has closed the connection (zero-length data presented to the application). If there is any pending data to be written and PeerConnectionClosed is on, then an SMC-R connection reset must be performed. The connection then enters the AppCloseWait1 state on the passive close side waiting for the local application to initiate its own close processing. B. If the local application issues a shutdown() for writing, then the SMC-R layer will send a CDC message to notify the partner of the PeerDoneWriting indicator and then transition the local side of the SMC-R connection to the AppCloseWait2 state.
C. When the application issues a close() or shutdown() for both, the local SMC-R peer will send a message informing the peer of the PeerConnectionClosed indicator and transition to the Closed state if the remote peer has also sent the local peer the PeerConnectionClosed indicator. If the peer has not sent the PeerConnectionClosed indicator, we transition into the PeerFinCloseWait state. D. The local SMC-R connection stays in this state until the peer sends the PeerConnectionClosed indicator in a CDC message. When the indicator is sent, we transition to the Closed state and are then free to reuse this RMBE. Note that each SMC-R peer needs to provide some logic that will prevent being stranded in a termination state indefinitely. For example, if an Active Close SMC-R peer is in a PeerCloseWait (1 or 2) state waiting for the remote SMC-R peer to update its connection termination status, it needs to provide a timer that will prevent it from waiting in that state indefinitely should the remote SMC-R peer not respond to this termination request. This could occur in error scenarios -- for example, if the remote SMC-R peer suffered a failure prior to being able to respond to the termination request or the remote application is not responding to this connection termination request by closing its own socket. This latter scenario is similar to the TCP FINWAIT2 state, which has been known to sometimes cause issues when remote TCP/IP hosts lose track of established connections and neglect to close them. Even though the TCP standards do not mandate a timeout from the TCP FINWAIT2 state, most TCP/IP implementations assign a timeout for this state. A similar timeout will be required for SMC-R connections. When this timeout occurs, the local SMC-R peer performs TCP reset processing for this connection. However, no additional RDMA writes to the partner RMBE can occur at this point (we have already indicated that we are done updating the peer's RMBE). After the TCP connection is reset, the RMBE can be returned to the free pool for reallocation. See Section 4.4.2 for more details. Also note that it is possible to have two SMC-R endpoints initiate an Active close concurrently. In that scenario, the flows above still apply; however, both endpoints follow the active close path (path 3).
4.8.2. Abnormal SMC-R Connection Termination Flows
Abnormal SMC-R connection termination can occur for a variety of reasons, including the following: o The TCP connection associated with an SMC-R connection is reset. In TCP, either endpoint can send a RST segment to abort an existing TCP connection when error conditions are detected for the connection or the application overtly requests that the connection be reset. o Normal SMC-R connection termination processing has unexpectedly stalled for a given connection. When the stall is detected (connection termination timeout condition), an abnormal SMC-R connection termination flow is initiated. In these scenarios, it is very important that resources associated with the affected SMC-R connections are properly cleaned up to ensure that there are no orphaned resources and that resources can reliably be reused for new SMC-R connections. Given that SMC-R relies heavily on the RDMA write processing, special care needs to be taken to ensure that an RMBE is no longer being used by an SMC-R peer before logically reassigning that RMBE to a new SMC-R connection. When an SMC-R peer initiates a TCP connection reset, it also initiates an SMC-R abnormal connection flow at the same time. The SMC-R peers explicitly signal their intent to abnormally terminate an SMC-R connection and await explicit acknowledgment that the peer has received this notification and has also completed abnormal connection termination on its end. Note that TCP connection reset processing can occur in parallel to these flows.
+-----------------+ |-------------->| CLOSED |<-------------| | | | | | +-----------------+ | | | | | | | | +-----------------------+ | | | Any state | | |1B | (before setting | 2B| | | PeerConnectionClosed | | | | indicator in | | | | peer's RMBE) | | | +-----------------------+ | | 1A | | 2A | | Active Abort | | Passive Abort | | V V | | +--------------+ +--------------+ | |-------|PeerAbortWait | | Process Abort|------| | | | | +--------------+ +--------------+ Figure 23: SMC-R Abnormal Connection Termination State Diagram Figure 23 above shows the SMC-R abnormal connection termination state diagram: 1. Active abort designates the SMC-R peer that is initiating the TCP RST processing. At the time that the TCP RST is sent, the active abort side must also do the following: A. Send the PeerConnAbort indicator to the partner in a CDC message, and then transition to the PeerAbortWait state. During this state, it will monitor this SMC-R connection waiting for the peer to send its corresponding PeerConnAbort indicator but will ignore any other activity in this connection (i.e., new incoming data). It will also generate an appropriate error to any socket API calls issued against this socket (e.g., ECONNABORTED, ECONNRESET). B. Once the peer sends the PeerConnAbort indicator to the local host, the local host can transition this SMC-R connection to the Closed state and reuse this RMBE. Note that the SMC-R peer that goes into the active abort state must provide some protection against staying in that state indefinitely should the remote SMC-R peer not respond by sending its own PeerConnAbort indicator to the local host. While this should be a rare scenario, it could occur if the remote SMC-R peer
(passive abort) suffered a failure right after the local SMC-R peer (active abort) sent the PeerConnAbort indicator. To protect against these types of failures, a timer can be set after entering the PeerAbortWait state, and if that timer pops before the peer has sent its local PeerConnAbort indicator (to the active abort side), this RMBE can be returned to the free pool for possible reallocation. See Section 4.4.2 for more details. 2. Passive abort designates the SMC-R peer that is the recipient of an SMC-R abort from the peer designated by the PeerConnAbort indicator being sent by the peer in a CDC message. Upon receiving this request, the local peer must do the following: A. Using the appropriate error codes, indicate to the socket application that this connection has been aborted, and then purge all in-flight data for this connection that is waiting to be read or waiting to be sent. B. Send a CDC message to notify the peer of the PeerConnAbort indicator and, once that is completed, transition this RMBE to the Closed state. If an SMC-R peer receives a TCP RST for a given SMC-R connection, it also initiates SMC-R abnormal connection termination processing if it has not already been notified (via the PeerConnAbort indicator) that the partner is severing the connection. It is possible to have two SMC-R endpoints concurrently be in an active abort role for a given connection. In that scenario, the flows above still apply but both endpoints take the active abort path (path 1).4.8.3. Other SMC-R Connection Termination Conditions
The following are additional conditions that have implications for SMC-R connection termination: o An SMC-R peer being gracefully shut down. If an SMC-R peer supports a graceful shutdown operation, it should attempt to terminate all SMC-R connections as part of shutdown processing. This could be accomplished via LLC DELETE LINK requests on all active SMC-R links. o Abnormal termination of an SMC-R peer. In this example, there may be no opportunity for the host to perform any SMC-R cleanup processing. In this scenario, it is up to the remote peer to detect a RoCE communications failure with the failing host. This
could trigger SMC-R link switchover, but that would also generate RoCE errors, causing the remote host to eventually terminate all existing SMC-R connections to this peer. o Loss of RoCE connectivity between two SMC-R peers. If two peers are no longer reachable across any links in their SMC-R link group, then both peers perform a TCP reset for the connections, generate an error to the local applications, and free up all QP resources associated with the link group.5. Security Considerations
5.1. VLAN Considerations
The concepts and access control of virtual LANs (VLANs) must be extended to also cover the RoCE network traffic flowing across the Ethernet. The RoCE VLAN configuration and access permissions must mirror the IP VLAN configuration and access permissions over the Converged Enhanced Ethernet fabric. This means that hosts, routers, and switches that have access to specific VLANs on the IP fabric must also have the same VLAN access across the RoCE fabric. In other words, the SMC-R connectivity will follow the same virtual network access permissions as normal TCP/IP traffic.5.2. Firewall Considerations
As mentioned above, the RoCE fabric inherits the same VLAN topology/access as the IP fabric. RoCE is a Layer 2 protocol that requires both endpoints to reside in the same Layer 2 network (i.e., VLAN). RoCE traffic cannot traverse multiple VLANs, as there is no support for routing RoCE traffic beyond a single VLAN. As a result, SMC-R communications will also be confined to peers that are members of the same VLAN. IP-based firewalls are typically inserted between VLANs (or physical LANs) and rely on normal IP routing to insert themselves in the data path. Since RoCE (and by extension SMC-R) is not routable beyond the local VLAN, there is no ability to insert a firewall in the network path of two SMC-R peers.5.3. Host-Based IP Filters
Because SMC-R maintains the TCP three-way handshake for connection setup before switching to RoCE out of band, existing IP filters that control connection setup flows remain effective in an SMC-R environment. IP filters that operate on traffic flowing in an active TCP connection are not supported, because the connection data does not flow over IP.
5.4. Intrusion Detection Services
Similar to IP filters, intrusion detection services that operate on TCP connection setups are compatible with SMC-R with no changes required. However, once the TCP connection has switched to RoCE out of band, packets are not available for examination.5.5. IP Security (IPsec)
IP security is not compatible with SMC-R, because there are no IP packets on which to operate. TCP connections that require IP security must opt out of SMC-R.5.6. TLS/SSL
Transport Layer Security/Secure Socket Layer (TLS/SSL) is preserved in an SMC-R environment. The TLS/SSL layer resides above the SMC-R layer, and outgoing connection data is encrypted before being passed down to the SMC-R layer for RDMA write. Similarly, incoming connection data goes through the SMC-R layer encrypted and is decrypted by the TLS/SSL layer as it is today. The TLS/SSL handshake messages flow over the TCP connection after the connection has switched to SMC-R, and so they are exchanged using RDMA writes by the SMC-R layer, transparently to the TLS/SSL layer.6. IANA Considerations
The scarcity of TCP option codes available for assignment is understood, and this architecture uses experimental TCP options following the conventions of [RFC6994] ("Shared Use of Experimental TCP Options"). TCP ExID 0xE2D4C3D9 has been registered with IANA as a TCP Experiment Identifier. See Section 3.1. If this protocol achieves wide acceptance, a discrete option code may be requested by subsequent versions of this protocol.
7. Normative References
[RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, DOI 10.17487/RFC0793, September 1981, <http://www.rfc-editor.org/info/rfc793>. [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", RFC 6994, DOI 10.17487/RFC6994, August 2013, <http://www.rfc-editor.org/info/rfc6994>. [RoCE] InfiniBand, "RDMA over Converged Ethernet specification", <https://cw.infinibandta.org/wg/Members/documentRevision/ download/7149>.