2. Link Architecture
An SMC-R link is based on reliably connected queue pairs (QPs) that form a "logical point-to-point link" between the two SMC-R peers over a RoCE fabric. An SMC-R link extends from SMC-R peer to SMC-R peer, where typically each peer would be a TCP/IP stack and would reside on separate hosts. ,,.--..,_ +----+ _-`` `-, +-----+ |QP 8| - RoCE ', |QP 64| | | / VLAN M . | | +----+--------+/ \+-------+-----+ | RNIC 1 | SMC-R Link | RNIC 2 | | |<--------------------->| | +------------+ , /+------------+ MAC A (GID A) MAC B (GID B) . .` `', ,-` ``''--''`` Figure 1: SMC-R Link Overview
Figure 1 illustrates an overview of the basic concepts of SMC-R peer- to-peer connectivity; this is called the SMC-R link. The SMC-R link forms a logical point-to-point connection between two SMC-R peers via RoCE. The SMC-R link is defined and identified by the following attributes: SMC-R link = RC QPs (source VMAC GID QP + target VMAC GID QP + VLAN ID) The SMC-R link can optionally be associated with a VLAN ID. If VLANs are in use for the associated IP (LAN) connection, then the VLAN attribute is carried over on the SMC-R link. When VLANs are in use, each SMC-R link group is associated with a single and specific VLAN. The RoCE fabric is the same physical Ethernet LAN used for standard TCP/IP-over-Ethernet communications, with switches as described in Section 1.1.1. An SMC-R link is designed to support multiple TCP connections between the same two peers. An SMC-R link is intended to be long lived, while the underlying TCP connections can dynamically come and go. The associated RMBs can also be dynamically added and removed from the link as needed. The first TCP connection between the peers establishes the SMC-R link. Subsequent TCP connections then use the previously established link. When the last TCP connection terminates, the link can then be terminated, typically after an implementation-defined idle timeout period has elapsed. The TCP server is responsible for initiating and terminating the SMC-R link.2.1. Remote Memory Buffers (RMBs)
Figure 2 shows the hosts -- Hosts X and Y -- and their associated RMBs within each host. With the SMC-R link, and the associated RKeys and RDMA virtual addresses, each SMC-R-enabled TCP/IP stack can remotely access its peer's RMBs using RDMA. The RKeys and virtual addresses are exchanged during the rendezvous processing when the link is established. The combination of the RKey and the virtual address is the RToken. Note that the SMC-R link ends at the QP providing access to the RMB (via the link + RToken).
Host X Host Y +-------------------+ ,.--.,_ +-------------------+ | | .'` '. | | | Protection | ,' `, | Protection | | Domain X | / \ | Domain Y | | +------+ / \ +------+ | | QP 8 |RNIC 1| | SMC-R Link | |RNIC 2| QP 64 | | | | |<-------------------->| | | | | | | || || | | | | | +------+| VLAN A |+------+ | | | | || || | | | | | | RoCE | | | | | |RToken X | \ / |RToken Y | | | | | \ / | | | | V | `. ,' | V | | +--------+ | '._ ,' | +--------+ | | | | | `''-'`` | | | | | | RMB | | | | RMB | | | | | | | | | | | +--------+ | | +--------+ | +-------------------+ +-------------------+ Figure 2: SMC-R Link and RMBs An SMC-R link can support multiple RMBs that are independently managed by each peer. The number and the size of RMBs are managed by the peers based on the host's unique memory management requirements; however, the maximum number of RMBs that can be associated to a link group on one peer is 255. The QP has a single protection domain, but each RMB has a unique RToken. All RTokens must be exchanged with the peer. Each peer manages the RMBs in its local memory for its remote SMC-R peer by sharing access to the RMBs via RTokens with its peers. The remote peer writes into the RMBs via RDMA, and the local peer (RMB owner) then reads from the RMBs. When two peers decide to use SMC-R for a given TCP connection, they each allocate a local RMB element for the TCP connection and communicate the location of this local RMB element during rendezvous processing. To that end, RMB elements are created in pairs, with one RMB element allocated locally on each peer of the SMC-R link.
--- +------------+---------------+ /\ |Eye Catcher | | | +------------+ | | | | RMB Element 1 | | | | Receive Buffer | | | | | | | \/ | | --- +------------+---------------+ /\ |Eye Catcher | | | +------------+ | | | | RMB Element 2 | | | | Receive Buffer | | | | | | | \/ | | --- +----------------------------+ | . | | . | | . | | . | | (up to 255 elements) | +----------------------------+ Figure 3: RMB Format Figure 3 illustrates the basic format of an RMB. The RMB is a virtual memory buffer whose backing real memory is pinned, which can support up to 255 TCP connections to exactly one remote SMC-R peer. Each RMB is therefore associated with the SMC-R links within a link group for the two peers and a specific RoCE Protection Domain. Other than the two peers identified by the SMC-R link, no other SMC-R peers can have RDMA access to an RMB; this requires a unique Protection Domain for every SMC-R link. This is critical to ensure integrity of SMC-R communications. RMBs are subdivided into multiple elements for efficiency, with each RMB Element (RMBE) associated with a single TCP connection. Therefore, multiple TCP connections across an SMC-R link group can share the same memory for RDMA purposes, reducing the overhead of having to register additional memory with the RNIC for every new TCP connection. The number of elements in an RMB and the size of each RMBE are entirely governed by the owning peer, subject to the SMC-R architecture rules; however, all RMB elements within a given RMB must be the same size. Each peer can decide the level of resource-sharing that is desirable across TCP connections based on local constraints,
such as available system memory. An RMB element is identified to the remote SMC-R peer via an RMB Element Token, which consists of the following: o RMB RToken: The combination of the RKey and virtual address provided by the RNIC that identifies the start of the RMB for RDMA operations. o RMB Index: Identifies the RMB element index in the RMB. Used to locate a specific RMB element within an RMB. Valid value range is 1-255. o RMB Element Length: The length of the RMB element's eye catcher plus the length of the receive buffer. This length is equal for all RMB elements in a given RMB. This length can be variable across different RMBs. Multiple RMBs can be associated to an SMC-R link group, and each peer in an SMC-R link group manages allocation of its RMBs. RMB allocation can be asymmetric. For example, Server X can allocate two RMBs to an SMC-R link group while Server Y allocates five. This provides maximum implementation flexibility to allow hosts to optimize RMB management for their own local requirements. The maximum number of RMBs that can be allocated on one peer to a link group is 255. If more RMBs are required, the peer may fall back to IP for subsequent connections or, if the peer is the server, create a parallel link group. One use case for multiple RMBs is multiple receive buffer sizes. Since every element in an RMB must be the same size, multiple RMBs with different element sizes can be allocated if varying receive buffer sizes are required. Also, since the maximum number of TCP connections whose receive buffers can be allocated to an RMB is 255, multiple RMBs may be required to provide capacity for large numbers of TCP connections between two peers.
Separately from the RMB, the TCP/IP stack that owns each RMB maintains control data for each RMB element within its local control structures. The control data contains flags for maintaining the state of the TCP data (for example, urgent data indicator) and, most importantly, the following two cursors, which are illustrated below in Figure 4: o The peer producer cursor: This is a wrapping offset into the RMB element's receive buffer that points to the next byte of data to be written by the remote peer. This cursor is provided by the remote peer in a Connection Data Control (CDC) message, which is sent using RoCE SendMsg processing, and tells the local peer how far it can consume data in the RMBE buffer. o The peer consumer cursor: This is a wrapping offset into the remote peer's RMB element's receive buffer that points to the next byte of data to be consumed by the remote peer in its own RMBE. The local peer cannot write into the remote peer's RMBE beyond this point without causing data loss. This cursor is also provided by the peer using a Connection Data Control message. Each TCP connection peer maintains its cursors for a TCP connection's RMBE in its local control structures. In other words, the peer who writes into a remote peer's RMBE provides its producer cursor to the peer whose RMBE it has written into. The peer who reads from its RMBE provides its consumer cursor to the writing peer. In this manner, the reads and writes between peers are kept coordinated. For example, referring to Figure 4, Peer B writes the hashed data into the receive buffer of Peer A's RMBE. After that write completes, Peer B uses a CDC message to update its producer cursor to Peer A, to indicate to Peer A how much data is available for Peer A to consume. The CDC message that Peer B sends to Peer A wakes up Peer A and notifies it that there is data to be consumed. Similarly, when Peer A consumes data written by Peer B, it uses a CDC message to update its consumer cursor to Peer B to let Peer B know how much data it has consumed, so Peer B knows how much space is available for further writes. If Peer B were to write enough data to Peer A that it would wrap the RMBE receive buffer and exceed the consumer cursor, data loss would result. Note that this is a simplistic description of the control flows, and they are optimized to minimize the number of CDC messages required, as described in Section 4.7 ("RMB Data Flows").
Peer A's RMBE Control Info Peer B's RMBE Control Info +--------------------------+ +--------------------------+ | | | | /----Peer producer cursor | +-----+-Peer consumer cursor | /| | | | | | +--------------------------+ | +--------------------------+ | Peer A's RMBE | | +--------------------------+ | | | +------------------+ | | | | | | \/ | | | +------------| | |-------------+/////////// | | |//RDMA data written by ///| | |/// Peer B that is ////// | | |/available to be consumed/| | |///////////////////////// | | |///////// +---------------| | |----------+/\ | | | | | \| | | \ / | |\---------/ | | | | | Figure 4: RMBE Cursors Additional flags and indicators are communicated between peers. In all cases, these flags and indicators are updated by the peer using CDC messages, which are sent using RoCE SendMsg. More details on these additional flags and indicators are described in Section 4.3 ("RMBE Control Information").
2.2. SMC-R Link Groups
SMC-R links are logically grouped together to form an SMC-R link group. The purpose of the link group is for supporting multiple links between the same two peers to provide for: o Resilience: Provides transparent and dynamic switching of the link used by existing TCP connections during link failures, typically hardware related. TCP traffic using the failing link can be switched to an active link within the link group, thereby avoiding disruptions to application workloads. o Link utilization: Provides an active/active link usage model allowing TCP traffic to be balanced across the links, which increases bandwidth and also avoids hardware imbalances and bottlenecks. Note that both adapter and switch utilization can become potential resource constraint issues. SMC-R link group support is required. Resilience is not optional. However, the user can elect to provision a single RNIC (on one or both hosts). Multiple links that are formed between the same two peers fall into two distinct categories: 1. Equal Links: Links providing equal access to the same RMB(s) at both endpoints, whereby all TCP connections associated with the links must have the same VLAN ID and have the same TCP server and TCP client roles or relationship. 2. Unequal Links: Links providing access to unique, unrelated and isolated RMB(s) (i.e., for unique VLANs or unique and isolated application workloads, etc.) or having unique TCP server or client roles. Links that are logically grouped together forming an SMC-R link group must be equal links.2.2.1. Link Group Types
Equal links within a link group also have another "Link Group Type" attribute based on the link's associated underlying physical path. The following SMC-R link types are defined: 1. Single link: the only active link within a link group 2. Parallel link: not allowed -- SMC-R links having the same physical RNIC at both hosts
3. Asymmetric link: links that have unique RNIC adapters at one host but share a single adapter at the peer host 4. Symmetric link: links that have unique RNIC adapters at both hosts These link group types are further explained in the following figures and descriptions. Figure 2 above shows the single-link case. The single link illustrated in Figure 2 also establishes the SMC-R link group. Link groups are supposed to have multiple links, but when only one RNIC is available at both hosts then only a single link can be created. This is expected to be a transient case. Figure 5 shows the symmetric-link case. Both hosts have unique and redundant RNIC adapters. This configuration meets the objectives for providing full RoCE redundancy required to provide the level of resilience required for high availability for SMC-R. While this configuration is not required, it is a strongly recommended "best practice" for the exploitation of SMC-R. Single and asymmetric links must be supported but are intended to provide for short-term transient conditions -- for example, during a temporary outage or recycle of an RNIC. Host X Host Y +-------------------+ +-------------------+ | | | | | Protection | | Protection | | Domain X | | Domain Y | | +------+ +------+ | | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 | |RToken X| | |<-------------------->| | | | | | | | | | |RToken Y| | \/ +------+ +------+ \/ | |+--------+ | | +--------+ | || | | | | | | || RMB | | | | RMB | | || | | | | | | |+--------+ | | +--------+ | | /\ +------+ +------+ /\ | |RToken Z| | | SMC-R Link 2 | | |RToken W| | | |RNIC 3|<-------------------->|RNIC 4| | | | QP 9 | | | | QP 65 | | +------+ +------+ | +-------------------+ +-------------------+ Figure 5: Symmetric SMC-R Links
Host X Host Y +-------------------+ +-------------------+ | | | | | Protection | | Protection | | Domain X | | Domain Y | | +------+ +------+ | | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 | |RToken X| | |<-------------------->| | | | | | | | .->| | |RToken Y| | \/ +------+ .` +------+ \/ | |+--------+ | .` | +--------+ | || | | .` | | | | || RMB | | .` | | RMB | | || | | .`SMC-R | | | | |+--------+ | .` Link 2 | +--------+ | | /\ +------+ .` +------+ | |RToken Z| | | .` | |down or | | | |RNIC 3|<-` |RNIC 4|unavailable | | QP 9 | | | | | | +------+ +------+ | +-------------------+ +-------------------+ Figure 6: Asymmetric SMC-R Links In the example provided by Figure 6, Host X has two RNICs but Host Y only has one RNIC because RNIC 4 is not available. This configuration allows for the creation of an asymmetric link. While an asymmetric link will provide some resilience (for example, when RNIC 1 fails), ideally each host should provide two redundant RNICs. This should be a transient case, and when RNIC 4 becomes available, this configuration must transition to a symmetric-link configuration. This transition is accomplished by first creating the new symmetric link and then deleting the asymmetric link with reason code "Asymmetric link no longer needed" specified in the DELETE LINK LLC message.
Host X Host Y +-------------------+ +-------------------+ | | | | | Protection | | Protection | | Domain X | | Domain Y | | +------+ SMC-R Link 1 +------+ | | QP 8 |RNIC 1|<-------------------->|RNIC 2| QP 64 | |RToken X| | | | | | | | | | |<-------------------->| | |RToken Y| | \/ +------+ SMC-R Link 2 +------+ \/ | |+--------+ QP 9 | | QP 65 +--------+ | || | | | | | | | | || RMB |<-- + | | +---->| RMB | | || | | | | | | |+--------+ | | +--------+ | | +------+ +------+ | | down or| | | |down or | | unavailable|RNIC 3| |RNIC 4|unavailable | | | | | | | | +------+ +------+ | +-------------------+ +-------------------+ Figure 7: SMC-R Parallel Links (Not Supported) Figure 7 shows parallel links, which are two links in the link group that use the same hardware. This configuration is not permitted. Because SMC-R multiplexes multiple TCP connections over an SMC-R link and both links are using the exact same hardware, there is no additional redundancy or capacity benefit obtained from this configuration. In addition to providing no real benefit, this configuration adds the unnecessary overhead of additional queue pairs, generation of additional RKeys, etc.2.2.2. Maximum Number of Links in Link Group
The SMC-R protocol defines a maximum of eight symmetric SMC-R links within a single SMC-R link group. This allows for support for up to eight unique physical paths between peer hosts. However, in terms of meeting the basic requirements for redundancy, support for at least two symmetric links must be implemented. Supporting more than two links also simplifies implementation for practical matters relating to dynamically adding and removing links -- for example, starting a third SMC-R link prior to taking down one of the two existing links. Recall that all links within a link group must have equal access to all associated RMBs.
The SMC-R protocol allows an implementation to assign an implementation-specific and appropriate value for maximum symmetric links. The implementation value must not exceed the architecture limit of 8; also, the value must not be lower than 2, because the SMC-R protocol requires redundancy. This does not mean that two RNICs are physically required to enable SMC-R connectivity, but at least two RNICs for redundancy are strongly recommended. The SMC-R peers exchange their implementation maximum link values during the link group establishment using the defined maximum link value in the CONFIRM LINK LLC command. Once the initial exchange completes, the value is set for the life of the link group. The maximum link value can be provided by both the server and client. The server must supply a value, whereas the client maximum link value is optional. When the client does not supply a value, it indicates that the client accepts the server-supplied maximum value. If the client provides a value, it cannot exceed the server-supplied maximum value. If the client passes a lower value, this lower value then becomes the final negotiated maximum number of symmetric links for this link group. Again, the minimum value is 2. During run time, the client must never request that the server add a symmetric link to a link group that would exceed the negotiated maximum link value. Likewise, the server must never attempt to add a symmetric link to a link group that would exceed the negotiated maximum value. In terms of counting the number of active links within a link group, the initial link (or the only/last) link is always counted as 1. Then, as additional links are added, they are either symmetric or asymmetric links. With regards to enforcing the maximum link rules, asymmetric links are an exception having a unique set of rules: o Asymmetric links are always limited to one asymmetric link allowed per link group. o Asymmetric links must not be counted in the maximum symmetric-link count calculation. When tracking the current count or enforcing the negotiated maximum number of links, an asymmetric link is not to be counted.
2.2.3. Forming and Managing Link Groups
SMC-R link groups are self-defining. The first SMC-R link in a link group is created using TCP option flows on the TCP three-way handshake followed by CLC message flows over the TCP connection. Subsequent SMC-R links in the link group are created by sending LLC messages over an SMC-R link that already exists in the link group. Once an SMC-R link group is created, no additional SMC-R links in that group are created using TCP and CLC negotiation. Because subsequent SMC-R links are created exclusively by sending LLC messages over an existing SMC-R link in a link group, the membership of SMC-R links in a link group is self-defining. This architecture does not define a specific identifier for an SMC-R link group. This identification may be useful for network management and may be assigned in a platform-specific manner, or in an extension to this architecture. In each SMC-R link group, one peer is the server for all TCP connections and the other peer is the client. If there are additional TCP connections between the peers that use SMC-R and have the client and server roles reversed, another SMC-R link group is set up between them with the opposite client-server relationship. This is required because there are specific responsibilities divided between the client and server in the management of an SMC-R link group. In this architecture, the decision of whether to use an existing SMC-R link group or create a new SMC-R link group for a TCP connection is made exclusively by the server. Management of the links in an SMC-R link group is also a server responsibility. The server is responsible for adding and deleting links in a link group. The client may request that the server take certain actions, but the final responsibility is the server's.
2.2.4. SMC-R Link Identifiers
This architecture defines multiple identifiers to identify SMC-R links and peers. o Link number: This is a 1-byte value that identifies an SMC-R link within a link group. Both the server and the client use this number to distinguish an SMC-R link from other links within the same link group. It is only unique within a link group. In order to prevent timing windows that may occur when a server creates a new link while the client is still cleaning up a previously existing link, link numbers cannot be reused until the entire link numbering space has been exhausted. o Link user ID: This is an architecturally opaque 4-byte value that a peer uses to uniquely define an SMC-R link within its own space. This means that a link user ID is unique within one peer only. Each peer defines its own link user ID for a link. The peers exchange this information once during link setup, and it is never used architecturally again. The purpose of this identifier is for network management, display, and debugging. For example, an operator on a client could provide the operator on the server with the server's link user ID if he requires the server's operator to check on the operation of a link that the client is having trouble with. o Peer ID: The SMC-R peer ID uniquely identifies a specific instance of a specific TCP/IP stack. It is required because in clustered and load-balancing environments, an IP address does not uniquely identify a TCP/IP stack. An RNIC's MAC/GID also doesn't uniquely or reliably identify a TCP/IP stack, because RNICs can go up and down and even be redeployed to other TCP/IP stacks in a multiple-partitioned or virtualized environment. The peer ID is not only unique per TCP/IP stack but is also unique per instance of a TCP/IP stack, meaning that if a TCP/IP stack is restarted, its peer ID changes.2.3. SMC-R Resilience and Load Balancing
The SMC-R multilink architecture provides resilience for network high availability via failover capability to an alternate RoCE adapter. The SMC-R multilink architecture does not define primary, secondary, or alternate roles to the links. Instead, there are multiple active links representing multiple redundant RoCE paths over the same LAN.
Assignment of TCP connections to links is unidirectional and asymmetric. This means that the client and server may each choose a separate link for their RDMA writes associated with a specific TCP connection. If a hardware failure occurs or a QP failure associated with an individual link occurs, then the TCP connections that were associated with the failing link are dynamically and transparently switched to use another available link. The server or the client can detect a failure, immediately move their TCP connections, and then notify their peer via the DELETE LINK LLC command. While the client can notify the server of an apparent link failure with the DELETE LINK LLC command, the server performs the actual link deletion. The movement of TCP connections to another link can be accomplished with minimal coordination between the peers. The TCP connection movement is also transparent to, and non-disruptive to, the TCP socket application workloads for most failure scenarios. After a failure, the surviving links and all associated hardware must handle the link group's workload. As each SMC-R peer begins to move active TCP connections to another link, all current RDMA write operations must be allowed to complete. The moving peer then sends a signal to verify receipt of the last successful write by its peer. If this verification fails, the TCP connection must be reset. Once this verification is complete, all writes that failed may then be retried, in order, over the new link. Any data writes or CDC messages for which the sender did not receive write completion must be replayed before any subsequent data or CDC write operations are sent. LLC messages are not retried over the new link, because they are dependent on a known link configuration, which has just changed because of the failure. The initiator of an LLC message exchange that fails will be responsible for retrying once the link group configuration stabilizes. When a new link becomes available and is re-added to the link group, each peer is free to rebalance its current TCP connections as needed or only assign new TCP connections to the newly added link. Both the server and client are free to manage TCP connections across the link group as needed. TCP connection movement does not have to be stimulated by a link failure. The SMC-R architecture also defines orderly versus disorderly failover. The type of failover is communicated in the LLC DELETE LINK command and is simply a means to indicate that the link has terminated (disorderly) or link termination is imminent (orderly). The orderly link deletion could be initiated via operator command or programmatically to bring down an idle link. For example,
an operator command could initiate orderly shutdown of an adapter for service. Implementation of the two types is based on implementation requirements and is beyond the scope of the SMC-R architecture.