Independent Submission M. Fox Request for Comments: 7609 C. Kassimis Category: Informational J. Stevens ISSN: 2070-1721 IBM August 2015 IBM's Shared Memory Communications over RDMA (SMC-R) ProtocolAbstract
This document describes IBM's Shared Memory Communications over RDMA (SMC-R) protocol. This protocol provides Remote Direct Memory Access (RDMA) communications to TCP endpoints in a manner that is transparent to socket applications. It further provides for dynamic discovery of partner RDMA capabilities and dynamic setup of RDMA connections, as well as transparent high availability and load balancing when redundant RDMA network paths are available. It maintains many of the traditional TCP/IP qualities of service such as filtering that enterprise users demand, as well as TCP socket semantics such as urgent data. Status of This Memo This document is not an Internet Standards Track specification; it is published for informational purposes. This is a contribution to the RFC Series, independently of any other RFC stream. The RFC Editor has chosen to publish this document at its discretion and makes no statement about its value for implementation or deployment. Documents approved for publication by the RFC Editor are not a candidate for any level of Internet Standard; see Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc7609.
Copyright Notice Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.Table of Contents
1. Introduction ....................................................5 1.1. Protocol Overview ..........................................6 1.1.1. Hardware Requirements ...............................8 1.2. Definition of Common Terms .................................8 1.3. Conventions Used in This Document .........................11 2. Link Architecture ..............................................11 2.1. Remote Memory Buffers (RMBs) ..............................12 2.2. SMC-R Link Groups .........................................18 2.2.1. Link Group Types ...................................18 2.2.2. Maximum Number of Links in Link Group ..............21 2.2.3. Forming and Managing Link Groups ...................23 2.2.4. SMC-R Link Identifiers .............................24 2.3. SMC-R Resilience and Load Balancing .......................24 3. SMC-R Rendezvous Architecture ..................................26 3.1. TCP Options ...............................................26 3.2. Connection Layer Control (CLC) Messages ...................27 3.3. LLC Messages ..............................................27 3.4. CDC Messages ..............................................29 3.5. Rendezvous Flows ..........................................29 3.5.1. First Contact ......................................29 3.5.1.1. Pre-negotiation of TCP Options ............29 3.5.1.2. Client Proposal ...........................30 3.5.1.3. Server Acceptance .........................32 3.5.1.4. Client Confirmation .......................32 3.5.1.5. Link (QP) Confirmation ....................32 3.5.1.6. Second SMC-R Link Setup ...................35 3.5.1.6.1. Client Processing of ADD LINK LLC Message from Server ........35 3.5.1.6.2. Server Processing of ADD LINK Reply LLC Message from Client ..36 3.5.1.6.3. Exchange of RKeys on Second SMC-R Link ..............38 3.5.1.6.4. Aborting SMC-R and Falling Back to IP .............38
3.5.2. Subsequent Contact .................................38 3.5.2.1. SMC-R Proposal ............................39 3.5.2.2. SMC-R Acceptance ..........................40 3.5.2.3. SMC-R Confirmation ........................41 3.5.2.4. TCP Data Flow Race with SMC Confirm CLC Message .......................41 3.5.3. First Contact Variation: Creating a Parallel Link Group ................................42 3.5.4. Normal SMC-R Link Termination ......................43 3.5.5. Link Group Management Flows ........................44 3.5.5.1. Adding and Deleting Links in an SMC-R Link Group ..........................44 3.5.5.1.1. Server-Initiated ADD LINK Processing ................45 3.5.5.1.2. Client-Initiated ADD LINK Processing ................45 3.5.5.1.3. Server-Initiated DELETE LINK Processing ................46 3.5.5.1.4. Client-Initiated DELETE LINK Request ...................48 3.5.5.2. Managing Multiple RKeys over Multiple SMC-R Links in a Link Group ......49 3.5.5.2.1. Adding a New RMB to an SMC-R Link Group ...............50 3.5.5.2.2. Deleting an RMB from an SMC-R Link Group ...............53 3.5.5.2.3. Adding a New SMC-R Link to a Link Group with Multiple RMBs ..54 3.5.5.3. Serialization of LLC Exchanges, and Collisions ............................56 3.5.5.3.1. Collisions with ADD LINK / CONFIRM LINK Exchange ...57 3.5.5.3.2. Collisions during DELETE LINK Exchange ...........58 3.5.5.3.3. Collisions during CONFIRM RKEY Exchange ..........59 4. SMC-R Memory-Sharing Architecture ..............................60 4.1. RMB Element Allocation Considerations .....................60 4.2. RMB and RMBE Format .......................................60 4.3. RMBE Control Information ..................................60 4.4. Use of RMBEs ..............................................61 4.4.1. Initializing and Accessing RMBEs ...................61 4.4.2. RMB Element Reuse and Conflict Resolution ..........62 4.5. SMC-R Protocol Considerations .............................63 4.5.1. SMC-R Protocol Optimized Window Size Updates .......63 4.5.2. Small Data Sends ...................................64 4.5.3. TCP Keepalive Processing ...........................65
4.6. TCP Connection Failover between SMC-R Links ...............67 4.6.1. Validating Data Integrity ..........................67 4.6.2. Resuming the TCP Connection on a New SMC-R Link ....68 4.7. RMB Data Flows ............................................69 4.7.1. Scenario 1: Send Flow, Window Size Unconstrained ...69 4.7.2. Scenario 2: Send/Receive Flow, Window Size Unconstrained ......................................71 4.7.3. Scenario 3: Send Flow, Window Size Constrained .....72 4.7.4. Scenario 4: Large Send, Flow Control, Full Window Size Writes .................................74 4.7.5. Scenario 5: Send Flow, Urgent Data, Window Size Unconstrained .................................77 4.7.6. Scenario 6: Send Flow, Urgent Data, Window Size Closed ........................................79 4.8. Connection Termination ....................................81 4.8.1. Normal SMC-R Connection Termination Flows ..........81 4.8.2. Abnormal SMC-R Connection Termination Flows ........86 4.8.3. Other SMC-R Connection Termination Conditions ......88 5. Security Considerations ........................................89 5.1. VLAN Considerations .......................................89 5.2. Firewall Considerations ...................................89 5.3. Host-Based IP Filters .....................................89 5.4. Intrusion Detection Services ..............................90 5.5. IP Security (IPsec) .......................................90 5.6. TLS/SSL ...................................................90 6. IANA Considerations ............................................90 7. Normative References ...........................................91 Appendix A. Formats ...............................................92 A.1. TCP Option .................................................92 A.2. CLC Messages ...............................................92 A.2.1. Peer ID Format ......................................93 A.2.2. SMC Proposal CLC Message Format .....................94 A.2.3. SMC Accept CLC Message Format .......................98 A.2.4. SMC Confirm CLC Message Format .....................102 A.2.5. SMC Decline CLC Message Format .....................105 A.3. LLC Messages ..............................................106 A.3.1. CONFIRM LINK LLC Message Format ....................107 A.3.2. ADD LINK LLC Message Format ........................109 A.3.3. ADD LINK CONTINUATION LLC Message Format ...........112 A.3.4. DELETE LINK LLC Message Format .....................115 A.3.5. CONFIRM RKEY LLC Message Format ....................117 A.3.6. CONFIRM RKEY CONTINUATION LLC Message Format .......120 A.3.7. DELETE RKEY LLC Message Format .....................122 A.3.8. TEST LINK LLC Message Format .......................124 A.4. Connection Data Control (CDC) Message Format ..............125
Appendix B. Socket API Considerations ............................129 B.1. setsockopt() / getsockopt() Considerations ................130 Appendix C. Rendezvous Error Scenarios ...........................131 C.1. SMC Decline during CLC Negotiation ........................131 C.2. SMC Decline during LLC Negotiation ........................131 C.3. The SMC Decline Window ....................................133 C.4. Out-of-Sync Conditions during SMC-R Negotiation ...........133 C.5. Timeouts during CLC Negotiation ...........................134 C.6. Protocol Errors during CLC Negotiation ....................134 C.7. Timeouts during LLC Negotiation ...........................135 C.7.1. Recovery Actions for LLC Timeouts and Failures .....136 C.8. Failure to Add Second SMC-R Link to a Link Group ..........142 Authors' Addresses ...............................................1431. Introduction
This document specifies IBM's Shared Memory Communications over RDMA (SMC-R) protocol. SMC-R is a protocol for Remote Direct Memory Access (RDMA) communication between TCP socket endpoints. SMC-R runs over networks that support RDMA over Converged Ethernet (RoCE). It is designed to permit existing TCP applications to benefit from RDMA without requiring modifications to the applications or predefinition of RDMA partners. SMC-R provides dynamic discovery of the RDMA capabilities of TCP peers and automatic setup of RDMA connections that those peers can use. SMC-R also provides transparent high availability and load-balancing capabilities that are demanded by enterprise installations but are missing from current RDMA protocols. If redundant RoCE-capable hardware such as RDMA-capable Network Interface Cards (RNICs) and RoCE-capable switches is present, SMC-R can load-balance over that redundant hardware and can also non-disruptively move TCP traffic from failed paths to surviving paths, all seamlessly to the application and the sockets layer. Because SMC-R preserves socket semantics and the TCP three-way handshake, many TCP qualities of service such as filtering, load balancing, and Secure Socket Layer (SSL) encryption are preserved, as are TCP features such as urgent data. Because of the dynamic discovery and setup of SMC-R connectivity between peers, no RDMA connection manager (RDMA-CM) is required. This also means that support for Unreliable Datagram (UD) Queue Pairs (QPs) is also not required.
It is recommended that the SMC-R services be implemented in kernel space, which enables optimizations such as resource-sharing between connections across multiple processes and also permits applications using SMC-R to spawn multiple processes (e.g., fork) without losing SMC-R functionality. A user-space implementation is compatible with this architecture, but it may not support spawned processes (e.g., fork), which limits sharing and resource optimization to TCP connections that originate from the same process. This might be an appropriate design choice if the use case is a system that hosts a large single process application that creates many TCP connections to a peer host, or in implementations where a kernel-space implementation is not possible or introduces excessive overhead for "kernel space to user space" context switches.1.1. Protocol Overview
SMC-R defines the concept of the SMC-R link, which is a logical point-to-point link using reliably connected queue pairs between TCP/IP stack peers over a RoCE fabric. An SMC-R link is bound to a specific hardware path, meaning a specific RNIC on each peer. SMC-R links are created and maintained by an SMC-R layer, which may reside in kernel space or user space, depending upon operating system and implementation requirements. The SMC-R layer resides below the sockets layer and directs data traffic for TCP connections between connected peers over the RoCE fabric using RDMA rather than over a TCP connection. The TCP/IP stack, with its requirements for fragmentation, packetization, etc., is bypassed, and the application data is moved between peers using RDMA. Multiple SMC-R links between the same two TCP/IP stack peers are also supported. A set of SMC-R links called a link group can be logically bonded together to provide redundant connectivity. If there is redundant hardware -- for example, two RNICs on each peer -- separate SMC-R links are created between the peers to exploit that redundant hardware. The link group architecture with redundant links provides load balancing and increased bandwidth, as well as seamless failover. Each SMC-R link group is associated with an area of memory called Remote Memory Buffers (RMBs), which are areas of memory that are available for SMC-R peers to write into using RDMA writes. Multiple TCP connections between peers may be multiplexed over a single SMC-R link, in which case the SMC-R layer manages the partitioning of the RMBs between the TCP connections. This multiplexing reduces the RDMA resources, such as QPs and RMBs, that are required to support multiple connections between peers, and it also reduces the processing and delays related to setting up QPs, pinning memory, and other RDMA setup tasks when new TCP connections are created. In a kernel-space SMC-R implementation in which the RMBs reside in kernel
storage, this sharing and optimization works across multiple processes executing on the same host. In a user-space SMC-R implementation in which the RMBs reside in user space, this sharing and optimization is limited to multiple TCP connections created by a single process, as separate RMBs and QPs will be required for each process. SMC-R also introduces a rendezvous protocol that is used to dynamically discover the RDMA capabilities of TCP connection partners and exchange credentials necessary to exploit that capability if present. TCP connections are set up using the normal TCP three-way handshake [RFC793], with the addition of a new TCP option that indicates SMC-R capability. If both partners indicate SMC-R capability, then at the completion of the three-way TCP handshake the SMC-R layers in each peer take control of the TCP connection and use it to exchange additional Connection Layer Control (CLC) messages to negotiate SMC-R credentials such as QP information; addressability over the RoCE fabric; RMB buffer sizes; and keys and addresses for accessing RMBs over RDMA. If at any time during this negotiation a failure or decline occurs, the TCP connection falls back to using the IP fabric. If the SMC-R negotiation succeeds and either a new SMC-R link is set up or an existing SMC-R link is chosen for the TCP connection, then the SMC-R layers open the sockets to the applications and the applications use the sockets as normal. The SMC-R layer intercepts the socket reads and writes and moves the TCP connection data over the SMC-R link, "out of band" to the TCP connection, which remains open and idle over the IP fabric, except for termination flows and possible keepalive flows. Regular TCP sequence numbering methods are used for the TCP flows that do occur; data flowing over RDMA does not use or affect TCP sequence numbers. This architecture does not support fallback of active SMC-R connections to IP. Once connection data has completed the switch to RDMA, a TCP connection cannot be switched back to IP and will reset if RDMA becomes unusable. The SMC-R protocol defines the format of the RMBs that are used to receive TCP connection data written over RDMA, as well as the semantics for managing and writing to these buffers using Connection Data Control (CDC) messages.
Finally, SMC-R defines Link Layer Control (LLC) messages that are exchanged over the RoCE fabric between peer SMC-R layers to manage the SMC-R links and link groups. These include messages to test and confirm connectivity over an SMC-R link, add and delete SMC-R links to or from the link group, and exchange RMB addressability information.1.1.1. Hardware Requirements
SMC-R does not require full Converged Enhanced Ethernet switch functionality. SMC-R functions over standard Ethernet fabrics, provided that endpoint RNICs are provided and IEEE 802.3x Global Pause Frame is supported and enabled in the switch fabric. While SMC-R as specified in this document is designed to operate over RoCE fabrics, adjustments to the rendezvous methods could enable it to run over other RDMA fabrics, such as InfiniBand [RoCE] and iWARP.1.2. Definition of Common Terms
This section provides definitions of terms that have a specific meaning to the SMC-R protocol and are used throughout this document. SMC-R Link An SMC-R link is a logical point-to-point connection over the RoCE fabric via specific physical adapters (Media Access Control / Global Identifier (MAC/GID)). The link is formed during the "first contact" sequence of the TCP/IP three-way handshake sequence that occurs over the IP fabric. During this handshake, an RDMA reliably connected queue pair (RC-QP) connection is formed between the two peer SMC hosts and is defined as the SMC-R link. The SMC-R link can then support multiple TCP connections between the two peers. An SMC-R link is associated with a single LAN (or VLAN) segment and is not routable. SMC-R Link Group An SMC-R link group is a group of SMC-R links between the same two SMC-R peers, typically with each link over unique RoCE adapters. Each link in the link group has equal characteristics, such as the same VLAN ID (if VLANs are in use), access to the same RMB(s), and access to the same TCP server/client.
SMC-R Peer The SMC-R peer is the peer software stack within the peer operating system with respect to the Shared Memory Communications (messaging) protocol. SMC-R Rendezvous SMC-R Rendezvous is the SMC-R peer discovery and handshake sequence that occurs transparently over the IP (Ethernet) fabric during and immediately after the TCP connection three-way handshake by exchanging the SMC-R capabilities and credentials using experimental TCP option and CLC messages. RoCE SendMsg RoCE SendMsg is a send operation posted to a reliably connected queue pair with inline data, for the purpose of transferring control information between peers. TCP Client The TCP client is the TCP socket-based peer that initiates a TCP connection. TCP Server The TCP server is the TCP socket-based peer that accepts a TCP connection. CLC Messages The SMC-R protocol defines a set of Connection Layer Control messages that flow over the TCP connection that are used to manage SMC-R link rendezvous at TCP connection setup time. This mechanism is analogous to SSL setup messages. LLC Commands The SMC-R protocol defines a set of RoCE Link Layer Control commands that flow over the RoCE fabric using RoCE SendMsg, that are used to manage SMC-R links, SMC-R link groups, and SMC-R link group RMB expansion and contraction.
CDC Message The SMC-R protocol defines a Connection Data Control message that flows over the RoCE fabric using RoCE SendMsg that is used to manage the SMC-R connection data. This message provides information about data being transferred over the out-of-band RDMA connection, such as data cursors, sequence numbers, and data flags (for example, urgent data). The receipt of this message also provides an interrupt to inform the receiver that it has received RDMA data. RMB A Remote (RDMA) Memory Buffer is a fixed or pinned buffer allocated in each of the peer hosts for a TCP (via SMC-R) connection. The RMB is registered to the RNIC and allows remote access by the remote peer using RDMA semantics. Each host is passed the peer's RMB-specific access information (RMB Key (RKey) and RMB element offset) during the SMC-R Rendezvous process. The host stores socket application user data directly into the peer's RMB using RDMA over RoCE. RToken The RToken is the combination of an RMB's RKey and RDMA virtual address. An RToken provides RMB addressability information to an RDMA peer. RMBE The Remote Memory Buffer Element (RMBE) is an area of an RMB that is allocated to a specific TCP connection. The RMBE contains data for the TCP connection. The RMBE represents the TCP receive buffer, whereby the remote peer writes into the RMBE and the local peer reads from the local RMBE. The alert token resolves to a specific RMBE. Alert Token The SMC-R alert token is a 4-byte value that uniquely identifies the TCP connection over an SMC-R connection. The alert token allows the SMC peer to quickly identify the target TCP connection that now has new work. The format of the token is defined by the owning SMC-R endpoint and is considered opaque to the remote peer. However, the token should not simply be an index to an RMBE; it should reference a TCP connection and be able to be validated to avoid reading data from stale connections.
RNIC The RDMA-capable Network Interface Card (RNIC) is an Ethernet NIC that supports RDMA semantics and verbs using RoCE. First Contact "First contact" describes an SMC-R negotiation to set up the first link in a link group. Subsequent Contact "Subsequent contact" describes an SMC-R negotiation between peers who are using an already-existing SMC-R link group.1.3. Conventions Used in This Document
In the rendezvous flow diagrams, dashed lines (----) are used to indicate flows over the TCP/IP fabric and dotted lines (....) are used to indicate flows over the RoCE fabric. In the data transfer ladder diagrams, dashed lines (----) are used to indicate RDMA write operations and dotted lines (....) are used to indicate CDC messages, which are RDMA messages with inline data that contain control information for the connection.