Tech-invite3GPPspaceIETFspace
9796959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 7609

IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

Pages: 143
Informational
Part 1 of 6 – Pages 1 to 11
None   None   Next

Top   ToC   RFC7609 - Page 1
Independent Submission                                            M. Fox
Request for Comments: 7609                                   C. Kassimis
Category: Informational                                       J. Stevens
ISSN: 2070-1721                                                      IBM
                                                             August 2015


     IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

Abstract

This document describes IBM's Shared Memory Communications over RDMA (SMC-R) protocol. This protocol provides Remote Direct Memory Access (RDMA) communications to TCP endpoints in a manner that is transparent to socket applications. It further provides for dynamic discovery of partner RDMA capabilities and dynamic setup of RDMA connections, as well as transparent high availability and load balancing when redundant RDMA network paths are available. It maintains many of the traditional TCP/IP qualities of service such as filtering that enterprise users demand, as well as TCP socket semantics such as urgent data. Status of This Memo This document is not an Internet Standards Track specification; it is published for informational purposes. This is a contribution to the RFC Series, independently of any other RFC stream. The RFC Editor has chosen to publish this document at its discretion and makes no statement about its value for implementation or deployment. Documents approved for publication by the RFC Editor are not a candidate for any level of Internet Standard; see Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc7609.
Top   ToC   RFC7609 - Page 2
Copyright Notice

   Copyright (c) 2015 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.

Table of Contents

1. Introduction ....................................................5 1.1. Protocol Overview ..........................................6 1.1.1. Hardware Requirements ...............................8 1.2. Definition of Common Terms .................................8 1.3. Conventions Used in This Document .........................11 2. Link Architecture ..............................................11 2.1. Remote Memory Buffers (RMBs) ..............................12 2.2. SMC-R Link Groups .........................................18 2.2.1. Link Group Types ...................................18 2.2.2. Maximum Number of Links in Link Group ..............21 2.2.3. Forming and Managing Link Groups ...................23 2.2.4. SMC-R Link Identifiers .............................24 2.3. SMC-R Resilience and Load Balancing .......................24 3. SMC-R Rendezvous Architecture ..................................26 3.1. TCP Options ...............................................26 3.2. Connection Layer Control (CLC) Messages ...................27 3.3. LLC Messages ..............................................27 3.4. CDC Messages ..............................................29 3.5. Rendezvous Flows ..........................................29 3.5.1. First Contact ......................................29 3.5.1.1. Pre-negotiation of TCP Options ............29 3.5.1.2. Client Proposal ...........................30 3.5.1.3. Server Acceptance .........................32 3.5.1.4. Client Confirmation .......................32 3.5.1.5. Link (QP) Confirmation ....................32 3.5.1.6. Second SMC-R Link Setup ...................35 3.5.1.6.1. Client Processing of ADD LINK LLC Message from Server ........35 3.5.1.6.2. Server Processing of ADD LINK Reply LLC Message from Client ..36 3.5.1.6.3. Exchange of RKeys on Second SMC-R Link ..............38 3.5.1.6.4. Aborting SMC-R and Falling Back to IP .............38
Top   ToC   RFC7609 - Page 3
           3.5.2. Subsequent Contact .................................38
                  3.5.2.1. SMC-R Proposal ............................39
                  3.5.2.2. SMC-R Acceptance ..........................40
                  3.5.2.3. SMC-R Confirmation ........................41
                  3.5.2.4. TCP Data Flow Race with SMC
                           Confirm CLC Message .......................41
           3.5.3. First Contact Variation: Creating a
                  Parallel Link Group ................................42
           3.5.4. Normal SMC-R Link Termination ......................43
           3.5.5. Link Group Management Flows ........................44
                  3.5.5.1. Adding and Deleting Links in an
                           SMC-R Link Group ..........................44
                           3.5.5.1.1. Server-Initiated ADD
                                      LINK Processing ................45
                           3.5.5.1.2. Client-Initiated ADD
                                      LINK Processing ................45
                           3.5.5.1.3. Server-Initiated DELETE
                                      LINK Processing ................46
                           3.5.5.1.4. Client-Initiated DELETE
                                      LINK Request ...................48
                  3.5.5.2. Managing Multiple RKeys over
                           Multiple SMC-R Links in a Link Group ......49
                           3.5.5.2.1. Adding a New RMB to an
                                      SMC-R Link Group ...............50
                           3.5.5.2.2. Deleting an RMB from an
                                      SMC-R Link Group ...............53
                           3.5.5.2.3. Adding a New SMC-R Link to a
                                      Link Group with Multiple RMBs ..54
                  3.5.5.3. Serialization of LLC Exchanges,
                           and Collisions ............................56
                           3.5.5.3.1. Collisions with ADD
                                      LINK / CONFIRM LINK Exchange ...57
                           3.5.5.3.2. Collisions during
                                      DELETE LINK Exchange ...........58
                           3.5.5.3.3. Collisions during
                                      CONFIRM RKEY Exchange ..........59
   4. SMC-R Memory-Sharing Architecture ..............................60
      4.1. RMB Element Allocation Considerations .....................60
      4.2. RMB and RMBE Format .......................................60
      4.3. RMBE Control Information ..................................60
      4.4. Use of RMBEs ..............................................61
           4.4.1. Initializing and Accessing RMBEs ...................61
           4.4.2. RMB Element Reuse and Conflict Resolution ..........62
      4.5. SMC-R Protocol Considerations .............................63
           4.5.1. SMC-R Protocol Optimized Window Size Updates .......63
           4.5.2. Small Data Sends ...................................64
           4.5.3. TCP Keepalive Processing ...........................65
Top   ToC   RFC7609 - Page 4
      4.6. TCP Connection Failover between SMC-R Links ...............67
           4.6.1. Validating Data Integrity ..........................67
           4.6.2. Resuming the TCP Connection on a New SMC-R Link ....68
      4.7. RMB Data Flows ............................................69
           4.7.1. Scenario 1: Send Flow, Window Size Unconstrained ...69
           4.7.2. Scenario 2: Send/Receive Flow, Window Size
                  Unconstrained ......................................71
           4.7.3. Scenario 3: Send Flow, Window Size Constrained .....72
           4.7.4. Scenario 4: Large Send, Flow Control, Full
                  Window Size Writes .................................74
           4.7.5. Scenario 5: Send Flow, Urgent Data, Window
                  Size Unconstrained .................................77
           4.7.6. Scenario 6: Send Flow, Urgent Data, Window
                  Size Closed ........................................79
      4.8. Connection Termination ....................................81
           4.8.1. Normal SMC-R Connection Termination Flows ..........81
           4.8.2. Abnormal SMC-R Connection Termination Flows ........86
           4.8.3. Other SMC-R Connection Termination Conditions ......88
   5. Security Considerations ........................................89
      5.1. VLAN Considerations .......................................89
      5.2. Firewall Considerations ...................................89
      5.3. Host-Based IP Filters .....................................89
      5.4. Intrusion Detection Services ..............................90
      5.5. IP Security (IPsec) .......................................90
      5.6. TLS/SSL ...................................................90
   6. IANA Considerations ............................................90
   7. Normative References ...........................................91
   Appendix A. Formats ...............................................92
     A.1. TCP Option .................................................92
     A.2. CLC Messages ...............................................92
          A.2.1. Peer ID Format ......................................93
          A.2.2. SMC Proposal CLC Message Format .....................94
          A.2.3. SMC Accept CLC Message Format .......................98
          A.2.4. SMC Confirm CLC Message Format .....................102
          A.2.5. SMC Decline CLC Message Format .....................105
     A.3. LLC Messages ..............................................106
          A.3.1. CONFIRM LINK LLC Message Format ....................107
          A.3.2. ADD LINK LLC Message Format ........................109
          A.3.3. ADD LINK CONTINUATION LLC Message Format ...........112
          A.3.4. DELETE LINK LLC Message Format .....................115
          A.3.5. CONFIRM RKEY LLC Message Format ....................117
          A.3.6. CONFIRM RKEY CONTINUATION LLC Message Format .......120
          A.3.7. DELETE RKEY LLC Message Format .....................122
          A.3.8. TEST LINK LLC Message Format .......................124
     A.4. Connection Data Control (CDC) Message Format ..............125
Top   ToC   RFC7609 - Page 5
   Appendix B. Socket API Considerations ............................129
     B.1. setsockopt() / getsockopt() Considerations ................130
   Appendix C. Rendezvous Error Scenarios ...........................131
     C.1. SMC Decline during CLC Negotiation ........................131
     C.2. SMC Decline during LLC Negotiation ........................131
     C.3. The SMC Decline Window ....................................133
     C.4. Out-of-Sync Conditions during SMC-R Negotiation ...........133
     C.5. Timeouts during CLC Negotiation ...........................134
     C.6. Protocol Errors during CLC Negotiation ....................134
     C.7. Timeouts during LLC Negotiation ...........................135
          C.7.1. Recovery Actions for LLC Timeouts and Failures .....136
     C.8. Failure to Add Second SMC-R Link to a Link Group ..........142
   Authors' Addresses ...............................................143

1. Introduction

This document specifies IBM's Shared Memory Communications over RDMA (SMC-R) protocol. SMC-R is a protocol for Remote Direct Memory Access (RDMA) communication between TCP socket endpoints. SMC-R runs over networks that support RDMA over Converged Ethernet (RoCE). It is designed to permit existing TCP applications to benefit from RDMA without requiring modifications to the applications or predefinition of RDMA partners. SMC-R provides dynamic discovery of the RDMA capabilities of TCP peers and automatic setup of RDMA connections that those peers can use. SMC-R also provides transparent high availability and load-balancing capabilities that are demanded by enterprise installations but are missing from current RDMA protocols. If redundant RoCE-capable hardware such as RDMA-capable Network Interface Cards (RNICs) and RoCE-capable switches is present, SMC-R can load-balance over that redundant hardware and can also non-disruptively move TCP traffic from failed paths to surviving paths, all seamlessly to the application and the sockets layer. Because SMC-R preserves socket semantics and the TCP three-way handshake, many TCP qualities of service such as filtering, load balancing, and Secure Socket Layer (SSL) encryption are preserved, as are TCP features such as urgent data. Because of the dynamic discovery and setup of SMC-R connectivity between peers, no RDMA connection manager (RDMA-CM) is required. This also means that support for Unreliable Datagram (UD) Queue Pairs (QPs) is also not required.
Top   ToC   RFC7609 - Page 6
   It is recommended that the SMC-R services be implemented in kernel
   space, which enables optimizations such as resource-sharing between
   connections across multiple processes and also permits applications
   using SMC-R to spawn multiple processes (e.g., fork) without losing
   SMC-R functionality.  A user-space implementation is compatible with
   this architecture, but it may not support spawned processes (e.g.,
   fork), which limits sharing and resource optimization to TCP
   connections that originate from the same process.  This might be an
   appropriate design choice if the use case is a system that hosts a
   large single process application that creates many TCP connections to
   a peer host, or in implementations where a kernel-space
   implementation is not possible or introduces excessive overhead for
   "kernel space to user space" context switches.

1.1. Protocol Overview

SMC-R defines the concept of the SMC-R link, which is a logical point-to-point link using reliably connected queue pairs between TCP/IP stack peers over a RoCE fabric. An SMC-R link is bound to a specific hardware path, meaning a specific RNIC on each peer. SMC-R links are created and maintained by an SMC-R layer, which may reside in kernel space or user space, depending upon operating system and implementation requirements. The SMC-R layer resides below the sockets layer and directs data traffic for TCP connections between connected peers over the RoCE fabric using RDMA rather than over a TCP connection. The TCP/IP stack, with its requirements for fragmentation, packetization, etc., is bypassed, and the application data is moved between peers using RDMA. Multiple SMC-R links between the same two TCP/IP stack peers are also supported. A set of SMC-R links called a link group can be logically bonded together to provide redundant connectivity. If there is redundant hardware -- for example, two RNICs on each peer -- separate SMC-R links are created between the peers to exploit that redundant hardware. The link group architecture with redundant links provides load balancing and increased bandwidth, as well as seamless failover. Each SMC-R link group is associated with an area of memory called Remote Memory Buffers (RMBs), which are areas of memory that are available for SMC-R peers to write into using RDMA writes. Multiple TCP connections between peers may be multiplexed over a single SMC-R link, in which case the SMC-R layer manages the partitioning of the RMBs between the TCP connections. This multiplexing reduces the RDMA resources, such as QPs and RMBs, that are required to support multiple connections between peers, and it also reduces the processing and delays related to setting up QPs, pinning memory, and other RDMA setup tasks when new TCP connections are created. In a kernel-space SMC-R implementation in which the RMBs reside in kernel
Top   ToC   RFC7609 - Page 7
   storage, this sharing and optimization works across multiple
   processes executing on the same host.  In a user-space SMC-R
   implementation in which the RMBs reside in user space, this sharing
   and optimization is limited to multiple TCP connections created by a
   single process, as separate RMBs and QPs will be required for each
   process.

   SMC-R also introduces a rendezvous protocol that is used to
   dynamically discover the RDMA capabilities of TCP connection partners
   and exchange credentials necessary to exploit that capability if
   present.  TCP connections are set up using the normal TCP three-way
   handshake [RFC793], with the addition of a new TCP option that
   indicates SMC-R capability.  If both partners indicate SMC-R
   capability, then at the completion of the three-way TCP handshake the
   SMC-R layers in each peer take control of the TCP connection and use
   it to exchange additional Connection Layer Control (CLC) messages to
   negotiate SMC-R credentials such as QP information; addressability
   over the RoCE fabric; RMB buffer sizes; and keys and addresses for
   accessing RMBs over RDMA.  If at any time during this negotiation a
   failure or decline occurs, the TCP connection falls back to using the
   IP fabric.

   If the SMC-R negotiation succeeds and either a new SMC-R link is set
   up or an existing SMC-R link is chosen for the TCP connection, then
   the SMC-R layers open the sockets to the applications and the
   applications use the sockets as normal.  The SMC-R layer intercepts
   the socket reads and writes and moves the TCP connection data over
   the SMC-R link, "out of band" to the TCP connection, which remains
   open and idle over the IP fabric, except for termination flows and
   possible keepalive flows.  Regular TCP sequence numbering methods are
   used for the TCP flows that do occur; data flowing over RDMA does not
   use or affect TCP sequence numbers.

   This architecture does not support fallback of active SMC-R
   connections to IP.  Once connection data has completed the switch to
   RDMA, a TCP connection cannot be switched back to IP and will reset
   if RDMA becomes unusable.

   The SMC-R protocol defines the format of the RMBs that are used to
   receive TCP connection data written over RDMA, as well as the
   semantics for managing and writing to these buffers using Connection
   Data Control (CDC) messages.
Top   ToC   RFC7609 - Page 8
   Finally, SMC-R defines Link Layer Control (LLC) messages that are
   exchanged over the RoCE fabric between peer SMC-R layers to manage
   the SMC-R links and link groups.  These include messages to test and
   confirm connectivity over an SMC-R link, add and delete SMC-R links
   to or from the link group, and exchange RMB addressability
   information.

1.1.1. Hardware Requirements

SMC-R does not require full Converged Enhanced Ethernet switch functionality. SMC-R functions over standard Ethernet fabrics, provided that endpoint RNICs are provided and IEEE 802.3x Global Pause Frame is supported and enabled in the switch fabric. While SMC-R as specified in this document is designed to operate over RoCE fabrics, adjustments to the rendezvous methods could enable it to run over other RDMA fabrics, such as InfiniBand [RoCE] and iWARP.

1.2. Definition of Common Terms

This section provides definitions of terms that have a specific meaning to the SMC-R protocol and are used throughout this document. SMC-R Link An SMC-R link is a logical point-to-point connection over the RoCE fabric via specific physical adapters (Media Access Control / Global Identifier (MAC/GID)). The link is formed during the "first contact" sequence of the TCP/IP three-way handshake sequence that occurs over the IP fabric. During this handshake, an RDMA reliably connected queue pair (RC-QP) connection is formed between the two peer SMC hosts and is defined as the SMC-R link. The SMC-R link can then support multiple TCP connections between the two peers. An SMC-R link is associated with a single LAN (or VLAN) segment and is not routable. SMC-R Link Group An SMC-R link group is a group of SMC-R links between the same two SMC-R peers, typically with each link over unique RoCE adapters. Each link in the link group has equal characteristics, such as the same VLAN ID (if VLANs are in use), access to the same RMB(s), and access to the same TCP server/client.
Top   ToC   RFC7609 - Page 9
   SMC-R Peer

      The SMC-R peer is the peer software stack within the peer
      operating system with respect to the Shared Memory Communications
      (messaging) protocol.

   SMC-R Rendezvous

      SMC-R Rendezvous is the SMC-R peer discovery and handshake
      sequence that occurs transparently over the IP (Ethernet) fabric
      during and immediately after the TCP connection three-way
      handshake by exchanging the SMC-R capabilities and credentials
      using experimental TCP option and CLC messages.

   RoCE SendMsg

      RoCE SendMsg is a send operation posted to a reliably connected
      queue pair with inline data, for the purpose of transferring
      control information between peers.

   TCP Client

      The TCP client is the TCP socket-based peer that initiates a TCP
      connection.

   TCP Server

      The TCP server is the TCP socket-based peer that accepts a TCP
      connection.

   CLC Messages

      The SMC-R protocol defines a set of Connection Layer Control
      messages that flow over the TCP connection that are used to manage
      SMC-R link rendezvous at TCP connection setup time.  This
      mechanism is analogous to SSL setup messages.

   LLC Commands

      The SMC-R protocol defines a set of RoCE Link Layer Control
      commands that flow over the RoCE fabric using RoCE SendMsg, that
      are used to manage SMC-R links, SMC-R link groups, and SMC-R
      link group RMB expansion and contraction.
Top   ToC   RFC7609 - Page 10
   CDC Message

      The SMC-R protocol defines a Connection Data Control message that
      flows over the RoCE fabric using RoCE SendMsg that is used to
      manage the SMC-R connection data.  This message provides
      information about data being transferred over the out-of-band RDMA
      connection, such as data cursors, sequence numbers, and data flags
      (for example, urgent data).  The receipt of this message also
      provides an interrupt to inform the receiver that it has received
      RDMA data.

   RMB

      A Remote (RDMA) Memory Buffer is a fixed or pinned buffer
      allocated in each of the peer hosts for a TCP (via SMC-R)
      connection.  The RMB is registered to the RNIC and allows remote
      access by the remote peer using RDMA semantics.  Each host is
      passed the peer's RMB-specific access information (RMB Key (RKey)
      and RMB element offset) during the SMC-R Rendezvous process.  The
      host stores socket application user data directly into the peer's
      RMB using RDMA over RoCE.

   RToken

      The RToken is the combination of an RMB's RKey and RDMA virtual
      address.  An RToken provides RMB addressability information to an
      RDMA peer.

   RMBE

      The Remote Memory Buffer Element (RMBE) is an area of an RMB that
      is allocated to a specific TCP connection.  The RMBE contains data
      for the TCP connection.  The RMBE represents the TCP receive
      buffer, whereby the remote peer writes into the RMBE and the local
      peer reads from the local RMBE.  The alert token resolves to a
      specific RMBE.

   Alert Token

      The SMC-R alert token is a 4-byte value that uniquely identifies
      the TCP connection over an SMC-R connection.  The alert token
      allows the SMC peer to quickly identify the target TCP connection
      that now has new work.  The format of the token is defined by the
      owning SMC-R endpoint and is considered opaque to the remote peer.
      However, the token should not simply be an index to an RMBE; it
      should reference a TCP connection and be able to be validated to
      avoid reading data from stale connections.
Top   ToC   RFC7609 - Page 11
   RNIC

      The RDMA-capable Network Interface Card (RNIC) is an Ethernet NIC
      that supports RDMA semantics and verbs using RoCE.

   First Contact

      "First contact" describes an SMC-R negotiation to set up the first
      link in a link group.

   Subsequent Contact

      "Subsequent contact" describes an SMC-R negotiation between peers
      who are using an already-existing SMC-R link group.

1.3. Conventions Used in This Document

In the rendezvous flow diagrams, dashed lines (----) are used to indicate flows over the TCP/IP fabric and dotted lines (....) are used to indicate flows over the RoCE fabric. In the data transfer ladder diagrams, dashed lines (----) are used to indicate RDMA write operations and dotted lines (....) are used to indicate CDC messages, which are RDMA messages with inline data that contain control information for the connection.


(page 11 continued on part 2)

Next Section