Appendix B. Socket API Considerations
A key design goal for SMC-R is to require no application changes for exploitation. It is confined to socket applications using stream (i.e., TCP) sockets over IPv4 or IPv6. By virtue of the fact that the switch to the SMC-R protocol occurs after a TCP connection is established, no changes are required in a socket address family or in the IP addresses and ports that the socket applications are using. Existing socket APIs that allow applications to retrieve local and remote socket address structures for an established TCP connection (for example, getsockname() and getpeername()) will continue to function as they have before. Existing DNS setup and APIs for resolving hostnames to IP addresses and vice versa also continue to function without any changes. In general, all of the usual socket APIs that are used for TCP communications (send APIs, recv APIs, etc.) will continue to function as they do today, even if SMC-R is used as the underlying protocol.
Each SMC-R-enabled implementation does, however, need to pay special attention to any socket APIs that have a reliance on the underlying TCP and IP protocols and also ensure that their behavior in an SMC-R environment is reasonable and minimizes impact on the application. While the basic socket API set is fairly similar across different operating systems, there is more variability when it comes to advanced socket API options. Each implementation needs to perform a detailed analysis of its API options, any possible impact that SMC-R may have, and any resultant implications. As part of that step, a discussion or review with other implementations supporting SMC-R would be useful to ensure consistent implementation.B.1. setsockopt() / getsockopt() Considerations
These APIs allow socket applications to manipulate socket, transport (TCP/UDP), and IP-level options associated with a given socket. Typically, a platform restricts the number of IP options available to stream (TCP) socket applications, given their connection-oriented nature. The general guideline here is to continue processing these APIs in a manner that allows for application compatibility. Some options will be relevant to the SMC-R protocol and will require special processing "under the covers". For example, the ability to manipulate TCP send and receive buffer sizes is still valid for SMC-R. However, other options may have no meaning for SMC-R. For example, if an application enabled the TCP_NODELAY socket option to disable Nagle's algorithm, it should have no real effect on SMC-R communications, as there is no notion of Nagle's algorithm with this new protocol. But the implementation must accept the TCP_NODELAY option as it does today and save it so that it can be later extracted via getsockopt() processing. Note that any TCP or IP-level options will still have an effect on any TCP/IP packets flowing for an SMC-R connection (i.e., as part of TCP/IP connection establishment and TCP/IP connection termination packet flows). Under the covers, manipulation of the TCP options will also include the SMC-layer setting, as well as reading the SMC-R experimental option before and after completion of the three-way TCP handshake.
Appendix C. Rendezvous Error Scenarios
This section discusses error scenarios for setting up and managing SMC-R links.C.1. SMC Decline during CLC Negotiation
A peer to the SMC-R CLC negotiation can send an SMC Decline in lieu of any expected CLC message to decline SMC and force the TCP connection back to the IP fabric. There can be several reasons for an SMC Decline during the CLC negotiation, including the following: o RNIC went down o SMC-R forbidden by local policy o subnet (IPv4) or prefix (IPv6) doesn't match o lack of resources to perform SMC-R In all cases, when an SMC Decline is sent in lieu of an expected CLC message, no confirmation is required, and the TCP connection immediately falls back to using the IP fabric. To prevent ambiguity between CLC messages and application data, an SMC Decline cannot "chase" another CLC message. An SMC Decline can only be sent in lieu of an expected CLC message. For example, if the client sends an SMC Proposal and then its RNIC goes down, it must wait for the SMC Accept from the server and then reply to the SMC Accept with an SMC Decline. This "no chase" rule means that if this TCP connection is not a first contact between RoCE peers, a server cannot send an SMC Decline after sending an SMC Accept -- it can only either break the TCP connection or fail over if a problem arises in the RoCE fabric after it has sent the SMC Accept. Similarly, once the client sends an SMC Confirm on a TCP connection that isn't a first contact, it is committed to SMC-R for this TCP connection and cannot fall back to IP.C.2. SMC Decline during LLC Negotiation
For a TCP connection that represents a first contact between RoCE pairs, it is possible for SMC to fall back to IP during the LLC negotiation. This is possible until the first contact SMC-R link is confirmed. For example, see Figure 42. After a first contact SMC-R link is confirmed, fallback to IP is no longer possible. This translates to the following rule: a first contact peer can send an
SMC Decline at any time during LLC negotiation until it has successfully sent its CONFIRM LINK (request or response) flow. After that point, it cannot fall back to IP. Host X -- Server Host Y -- Client +-------------------+ +-------------------+ | Peer ID = PS1 | | Peer ID = PC1 | | +------+ +------+ | | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 | | RKey X | |MAC MA|<-------------------->|MAC MB| | | | | |GID GA| attempted setup |GID GB| | RKey Y2| | \/ +------+ +------+ \/ | |+--------+ | | +--------+ | || RMB | | | | RMB | | |+--------+ | | +--------+ | | /\ +------+ +------+ /\ | | | |RNIC 3| |RNIC 4| | RKey W2| | | |MAC MC| |MAC MD| | | | QP 9 |GID GC| |GID GD| QP 65 | | +------+ +------+ | +-------------------+ +-------------------+ SYN / SYN-ACK / ACK TCP three-way handshake with TCP option <---------------------------------------------------------> SMC Proposal / SMC Accept / SMC Confirm exchange <--------------------------------------------------------> CONFIRM LINK(request, Link 1) .........................................................> CONFIRM LINK(response, Link 1) X................................... : : RoCE write failure :.................................> SMC Decline(PC1, reason code) <-------------------------------------------------------- Connection data flows over IP fabric <-------------------------------------------------------> Legend: ------------ TCP/IP and CLC flows ............ RoCE (LLC) flows Figure 42: SMC Decline during LLC Negotiation
C.3. The SMC Decline Window
Because SMC-R does not support fallback to IP for a TCP connection that is already using RDMA, there are specific rules on when the SMC Decline CLC message, which signals a fallback to IP because of an error or problem with the RoCE fabric, can be sent during TCP connection setup. There is a "point of no return" after which a connection cannot fall back to IP, and RoCE errors that occur after this point require the connection to be broken with a RST flow in the IP fabric. For a first contact, that point of no return is after the ADD LINK LLC message has been successfully sent for the second SMC-R link. Specifically, the server cannot fall back to IP after receiving either (1) a positive write completion indication for the ADD LINK request or (2) the ADD LINK response from the client, whichever comes first. The client cannot fall back to IP after sending a negative ADD LINK response, receiving a positive write complete on a positive ADD LINK response, or receiving a CONFIRM LINK for the second SMC-R link from the server, whichever comes first. For a subsequent contact, that point of no return is after the last send of the CLC negotiation completes. This, in combination with the rule that error "chasers" are not allowed during CLC negotiation, means that the server cannot send an SMC Decline after sending an SMC Accept, and the client cannot send an SMC Decline after sending an SMC Confirm.C.4. Out-of-Sync Conditions during SMC-R Negotiation
The SMC Accept CLC message contains a first contact flag that indicates to the client whether the server believes it is setting up a new link group or using an existing link group. This flag is used to detect an out-of-sync condition between the client and the server. The scenario for such a condition is as follows: there is a single existing SMC-R link between the peers. After the client sends the SMC Proposal CLC message, the existing SMC-R link between the client and the server fails. The client cannot chase the SMC Proposal CLC message with an SMC Decline CLC message in this case, because the client does not yet know that the server would have wanted to choose the SMC-R link that just crashed. The QP that failed recovers before the server returns its SMC Accept CLC message. This means that there is a QP but no SMC-R link. Since the server had not yet learned of the SMC-R link failure when it sent the SMC Accept CLC message, it attempts to reuse the SMC-R link that just failed. This means that the server would not set the first contact flag, indicating to the client that the server thinks it is reusing an SMC-R link. However, the client does not have an SMC-R link that matches the server's
specification. Because the first contact flag is off, the client realizes it is out of sync with the server and sends an SMC Decline to cause the connection to fall back to IP.C.5. Timeouts during CLC Negotiation
Because the SMC-R negotiation flows as TCP data, there are built-in timeouts and retransmits at the TCP layer for individual messages. Implementations also must protect the overall TCP/CLC handshake with a timer or timers to prevent connections from hanging indefinitely due to SMC-R processing. This can be done with individual timers for individual CLC messages or an overall timer for the entire exchange, which may include the TCP handshake and the CLC handshake under one timer or separate timers. This decision is implementation dependent. If the TCP and/or CLC handshakes time out, the TCP connection must be terminated as it would be in a legacy IP environment when connection setup doesn't complete in a timely manner. Because the CLC flows are TCP messages, if they cannot be sent and received in a timely fashion, the TCP connection is not healthy and would not work if fallback to IP were attempted.C.6. Protocol Errors during CLC Negotiation
Protocol errors occur during CLC negotiation when a message is received that is not expected. For example, a peer that is expecting a CLC message but instead receives application data has experienced a protocol error; this also indicates a likely software error, as the two sides are out of sync. When application data is expected, this data is not parsed to ensure that it's not a CLC message. When a peer is expecting a CLC negotiation message, any parsing error except a bad enumerated value in that message must be treated as application data. The CLC negotiation messages are designed with beginning and ending eye catchers to help verify that a CLC negotiation message is actually the expected message. If other parsing errors in an expected CLC message occur, such as incorrect length fields or incorrectly formatted fields, the message must be treated as application data. All protocol errors, with the exception of bad enumerated values, must result in termination of the TCP connection. No fallback to IP is allowed in the case of a protocol error, because if the protocols are out of sync, mismatched, or corrupted, then data and security integrity cannot be ensured.
The exception to this rule is enumerated values -- for example, the QP MTU values on SMC Accept and SMC Confirm. If a reserved value is received, the proper error response is to send an SMC Decline and fall back to IP; this is because the use of a reserved enumerated value indicates that the other partner likely has additional support that the receiving partner does not have. This indicated mismatch of SMC-R capabilities is not an integrity problem but indicates that SMC-R cannot be used for this connection.C.7. Timeouts during LLC Negotiation
Whenever a peer sends an LLC message to which a reply is expected, it sets a timer after the send posts to wait for the reply. An expected response may be a reply flavor of the LLC message (for example, a CONFIRM LINK reply) or a new LLC message (for example, an ADD LINK CONTINUATION expected from the server by the client if there are more RKeys to be communicated). On LLC flows that are part of a first contact setup of a link group, the value of the timer is implementation dependent but should be long enough to allow the other peer to have a write complete timeout and 2-3 retransmits of an SMC Decline on the TCP fabric. For LLC flows that are maintaining the link group and are not part of a first contact setup of a link group, the timers may be shorter. Upon receipt of an expected reply, the timer is cancelled. If a timer pops without a reply having been received, the sender must initiate a recovery action. During first contact processing, failure of an LLC verification timer is a "should-not-occur" that indicates a problem with one of the endpoints; this is because if there is a "routine" failure in the RoCE fabric that causes an LLC verification send to fail, the sender will get a write completion failure and will then send an SMC Decline to the partner. The only time an LLC verification timer will expire on a first contact is when the sender thinks the send succeeded but it actually didn't. Because of the reliably connected nature of QP connections on the RoCE fabric, this indicates a problem with one of the peers, not with the RoCE fabric. After the reliably connected queue pair for the first SMC-R link in a link group is set up on initial contact, the client sets a timer to wait for a RoCE verification message from the server that the QP is actually connected and usable. If the server experiences a failure sending its QP confirmation message, it will send an SMC Decline, which should arrive at the client before the client's verification timer expires. If the client's timer expires without receiving either an SMC Decline or a RoCE message confirmation from the server,
there is a problem with either the server or the TCP fabric. In either case, the client must break the TCP connection and clean up the SMC-R link. There are two scenarios in which the client's response to the QP verification message fails to reach the server. The main difference is whether or not the client has successfully completed the send of the CONFIRM LINK response. In the normal case of a problem with the RoCE path, the client will learn of the failure by getting a write completion failure, before the server's timer expires. In this case, the client sends an SMC Decline CLC message to the server, and the TCP connection falls back to IP. If the client's send of the confirmation message receives a positive return code but for some reason still does not reach the server, or the client's SMC Decline CLC message fails to reach the server after the client fails to send its RoCE confirmation message, then the server's timer will time out and the server must break the TCP connection by sending a RST. This is expected to be a very rare case, because if the client cannot send its CONFIRM LINK response LLC message, the client should get a negative return code and initiate fallback to IP. A client receiving a positive return code on a send that fails to reach the server should also be an extremely rare case.C.7.1. Recovery Actions for LLC Timeouts and Failures
The following list describes recovery actions for LLC timeouts. A write completion failure or other indication of send failure for an LLC command is treated the same as a timeout. LLC message: CONFIRM LINK from server (first contact, first link in the link group) Timer waits for: CONFIRM LINK reply from client. Recovery action: Break the TCP connection by sending a RST, and clean up the link. The server should have received an SMC Decline from the client by now if the client had an LLC send failure. LLC message: CONFIRM LINK from server (first contact, second link in the link group) Timer waits for: CONFIRM LINK reply from client.
Recovery action: The second link was not successfully set up. Send a DELETE LINK to the client. Connection data cannot flow in the first link in the link group, until the reply to this DELETE LINK is received, to prevent the peers from being out of sync on the state of the link group. LLC message: CONFIRM LINK from server (not first contact) Timer waits for: CONFIRM LINK reply from client. Recovery action: Clean up the new link, and set a timer to retry. Send a DELETE LINK to the client, in case the client has a longer timer interval, so the client can stop waiting. LLC message: CONFIRM LINK reply from client (first contact) Timer waits for: ADD LINK from server. Recovery action: Clean up the SMC-R link, and break the TCP connection by sending a RST over the IP fabric. There is a problem with the server. If the server had a send failure, it should have sent an SMC Decline by now. LLC message: ADD LINK from server (first contact) Timer waits for: ADD LINK reply from client. Recovery action: Break the TCP connection with a RST, and clean up RoCE resources. The connection is past the point where the server can fall back to IP, and if the client had a send problem it should have sent an SMC Decline by now. LLC message: ADD LINK from server (not first contact) Timer waits for: ADD LINK reply from client. Recovery action: Clean up resources (QP, RKeys, etc.) for the new link, and treat the link over which the ADD LINK was sent as if it had failed. If there is another link available to resend the ADD LINK and the link group still needs another link, retry the ADD LINK over another link in the link group. LLC message: ADD LINK reply from client (and there are more RKeys to be communicated) Timer waits for: ADD LINK CONTINUATION from server. Recovery action: Treat the same as ADD LINK timer failure.
LLC message: ADD LINK reply or ADD LINK CONTINUATION reply from client (and there are no more RKeys to be communicated, for the second link in a first contact scenario) Timer waits for: CONFIRM LINK from the server, over the new link. Recovery action: The setup of the new link failed. Send a DELETE LINK to the server. Do not consider the socket opened to the client application until receiving confirmation from the server in the form of a DELETE LINK request for this link and sending the reply (to prevent the partners from being out of sync on the state of the link group). Set a timer to send another ADD LINK to the server if there is still an unused RNIC on the client side. LLC message: ADD LINK reply or ADD LINK CONTINUATION reply from client (and there are no more RKeys to be communicated) Timer waits for: CONFIRM LINK from the server, over the new link. Recovery action: Send a DELETE LINK to the server for the new link, then clean up any resource allocated for the new link and set a timer to send an ADD LINK to the server if there is still an unused RNIC on the client side. The setup of the new link failed, but the link over which the ADD LINK exchange occurred is unaffected. LLC message: ADD LINK CONTINUATION from server Timer waits for: ADD LINK CONTINUATION reply from client. Recovery action: Treat the same as ADD LINK timer failure. LLC message: ADD LINK CONTINUATION reply from client (first contact, and RMB count fields indicate that the server owes more ADD LINK CONTINUATION messages) Timer waits for: ADD LINK CONTINUATION from server. Recovery action: Clean up the SMC-R link, and break the TCP connection by sending a RST. There is a problem with the server. If the server had a send failure, it should have sent an SMC Decline by now.
LLC message: ADD LINK CONTINUATION reply from client (not first contact, and RMB count fields indicate that the server owes more ADD LINK CONTINUATION messages) Timer waits for: ADD LINK CONTINUATION from server. Recovery action: Treat as if client detected link failure on the link that the ADD LINK exchange is using. Send a DELETE LINK to the server over another active link if one exists; otherwise, clean up the link group. LLC message: DELETE LINK from client Timer waits for: DELETE LINK request from server. Recovery action: If the scope of the request is to delete a single link, the surviving link over which the client sent the DELETE LINK is no longer usable either. If this is the last link in the link group, end TCP connections over the link group by sending RST packets. If there are other surviving links in the link group, resend over a surviving link. Also send a DELETE LINK over a surviving link for the link over which the client attempted to send the initial DELETE LINK message. If the scope of the request is to delete the entire link group, try resending on other links in the link group until success is achieved. If all sends fail, tear down the link group and any TCP connections that exist on it. LLC message: DELETE LINK from server (scope: entire link group) Timer waits for: Confirmation from the adapter that the message was delivered. Recovery action: Tear down the link group and any TCP connections that exist on it. LLC message: DELETE LINK from server (scope: single link) Timer waits for: DELETE LINK reply from client. Recovery action: The link over which the server sent the DELETE LINK is no longer usable either. If this is the last link in the link group, end TCP connections over the link group by sending RST packets. If there are other surviving links in the link group, resend over a surviving link. Also send a DELETE LINK over a surviving link for the link over which the server attempted to send the initial DELETE LINK message. If the scope of the request is to delete the entire link group, try resending on other
links in the link group until success is achieved. If all sends fail, tear down the link group and any TCP connections that exist on it. LLC message: CONFIRM RKEY from client Timer waits for: CONFIRM RKEY reply from server. Recovery action: Perform normal client procedures for detection of failed link. The link over which the message was sent has failed. LLC message: CONFIRM RKEY from server Timer waits for: CONFIRM RKEY reply from client. Recovery action: Perform normal server procedures for detection of failed link. The link over which the message was sent has failed. LLC message: TEST LINK from client Timer waits for: TEST LINK reply from server. Recovery action: Perform normal client procedures for detection of failed link. The link over which the message was sent has failed. LLC message: TEST LINK from server Timer waits for: TEST LINK reply from client. Recovery action: Perform normal server procedures for detection of failed link. The link over which the message was sent has failed. The following list describes recovery actions for invalid LLC messages. These could be misformatted or contain out-of-sync data. LLC message received: CONFIRM LINK from server What it indicates: Incorrect link information. Recovery action: Protocol error. The link must be brought down by sending a DELETE LINK for the link over another link in the link group if one exists. If this is a first contact, fall back to IP by sending an SMC Decline to the server.
LLC message received: ADD LINK What it indicates: Undefined enumerated MTU value. Recovery action: Send a negative ADD LINK reply with reason code x'2'. LLC message received: ADD LINK reply from client What it indicates: Client-side link information that would result in a parallel link being set up. Recovery action: Parallel links are not permitted. Delete the link by sending a DELETE LINK to the client over another link in the link group. LLC message received: Any link group command from the server, except DELETE LINK for the entire link group What it indicates: Client has sent a DELETE LINK for the link on which the message was received. Recovery action: Ignore the LLC message. Worst case: the server will time out. Best case: the DELETE LINK crosses with the command from the server, and the server realizes it failed. LLC message received: ADD LINK CONTINUATION from server or ADD LINK CONTINUATION reply from client What it indicates: Number of RMBs provided doesn't match count given on initial ADD LINK or ADD LINK reply message. Recovery action: Protocol error. Treat as if detected link outage. LLC message received: DELETE LINK from client What it indicates: Link indicated doesn't exist. Recovery action: If the link is in the process of being cleaned up, assume timing window and ignore message. Otherwise, send a DELETE LINK reply with reason code 1. LLC message received: DELETE LINK from server What it indicates: Link indicated doesn't exist. Recovery action: Send a DELETE LINK reply with reason code 1.
LLC message received: CONFIRM RKEY from either client or server What it indicates: No RKey provided for one or more of the links in the link group. Recovery action: Treat as if detected failure of the link(s) for which no RKey was provided. LLC message received: DELETE RKEY What it indicates: Specified RKey doesn't exist. Recovery action: Send a negative DELETE RKEY response. LLC message received: TEST LINK reply What it indicates: User data doesn't match what was sent in the TEST LINK request. Recovery action: Treat as if detected that the link has gone down. This is a protocol error. LLC message received: Unknown LLC type with high-order bits of opcode equal to b'10' What it indicates: This is an optional LLC message that the receiver does not support. Recovery action: Ignore (silently discard) the message. LLC message received: Any unambiguously incorrect or out-of-sync LLC message What it indicates: Link is out of sync. Recovery action: Treat as if detected that the link has gone down. Note that an unsupported or unknown LLC opcode whose two high-order bits are b'10' is not an error and must be silently discarded. Any other unknown or unsupported LLC opcode is an error.C.8. Failure to Add Second SMC-R Link to a Link Group
When there is any failure in setting up the second SMC-R link in an SMC-R link group, including confirmation timer expiration, the SMC-R link group is allowed to continue without available failover. However, this situation is extremely undesirable, and the server must endeavor to correct it as soon as it can.
The server peer in the SMC-R link group must set a timer to drive it to retry setup of a failed additional SMC-R link. The server will immediately retry the SMC-R link setup when the first of the following events occurs: o The retry timer expires. o A new RNIC becomes available to the server, on the same LAN as the SMC-R link group. o An ADD LINK LLC request message is received from the client; this indicates the availability of a new RNIC on the client side.Authors' Addresses
Mike Fox IBM 3039 Cornwallis Rd. Research Triangle Park, NC 27709 United States Email: mjfox@us.ibm.com Constantinos (Gus) Kassimis IBM 3039 Cornwallis Rd. Research Triangle Park, NC 27709 United States Email: kassimis@us.ibm.com Jerry Stevens IBM 3039 Cornwallis Rd. Research Triangle Park, NC 27709 United States Email: sjerry@us.ibm.com