4. RPC-over-RDMA in Operation
Every RPC-over-RDMA version 1 message has a header that includes a copy of the message's transaction ID, data for managing RDMA flow- control credits, and lists of RDMA segments describing chunks. All RPC-over-RDMA header content is contained in the Transport stream; thus, it MUST be XDR encoded. RPC message layout is unchanged from that described in [RFC5531] except for the possible reduction of data items that are moved by separate operations. The RPC-over-RDMA protocol passes RPC messages without regard to their type (CALL or REPLY). Apart from restrictions imposed by ULBs, each endpoint of a connection MAY send RDMA_MSG or RDMA_NOMSG message header types at any time (subject to credit limits).4.1. XDR Protocol Definition
This section contains a description of the core features of the RPC- over-RDMA version 1 protocol, expressed in the XDR language [RFC4506]. This description is provided in a way that makes it simple to extract into ready-to-compile form. The reader can apply the following shell script to this document to produce a machine-readable XDR description of the RPC-over-RDMA version 1 protocol. <CODE BEGINS> #!/bin/sh grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??' <CODE ENDS>
That is, if the above script is stored in a file called "extract.sh" and this document is in a file called "spec.txt", then the reader can do the following to extract an XDR description file: <CODE BEGINS> sh extract.sh < spec.txt > rpcrdma_corev1.x <CODE ENDS>4.1.1. Code Component License
Code components extracted from this document must include the following license text. When the extracted XDR code is combined with other complementary XDR code, which itself has an identical license, only a single copy of the license text need be preserved.
<CODE BEGINS> /// /* /// * Copyright (c) 2010-2017 IETF Trust and the persons /// * identified as authors of the code. All rights reserved. /// * /// * The authors of the code are: /// * B. Callaghan, T. Talpey, and C. Lever /// * /// * Redistribution and use in source and binary forms, with /// * or without modification, are permitted provided that the /// * following conditions are met: /// * /// * - Redistributions of source code must retain the above /// * copyright notice, this list of conditions and the /// * following disclaimer. /// * /// * - Redistributions in binary form must reproduce the above /// * copyright notice, this list of conditions and the /// * following disclaimer in the documentation and/or other /// * materials provided with the distribution. /// * /// * - Neither the name of Internet Society, IETF or IETF /// * Trust, nor the names of specific contributors, may be /// * used to endorse or promote products derived from this /// * software without specific prior written permission. /// * /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. /// */ /// <CODE ENDS>
4.1.2. RPC-over-RDMA Version 1 XDR
XDR data items defined in this section encodes the Transport Header Stream in each RPC-over-RDMA version 1 message. Comments identify items that cannot be changed in subsequent versions. <CODE BEGINS> /// /* /// * Plain RDMA segment (Section 3.4.3) /// */ /// struct xdr_rdma_segment { /// uint32 handle; /* Registered memory handle */ /// uint32 length; /* Length of the chunk in bytes */ /// uint64 offset; /* Chunk virtual address or offset */ /// }; /// /// /* /// * RDMA read segment (Section 3.4.5) /// */ /// struct xdr_read_chunk { /// uint32 position; /* Position in XDR stream */ /// struct xdr_rdma_segment target; /// }; /// /// /* /// * Read list (Section 4.3.1) /// */ /// struct xdr_read_list { /// struct xdr_read_chunk entry; /// struct xdr_read_list *next; /// }; /// /// /* /// * Write chunk (Section 3.4.6) /// */ /// struct xdr_write_chunk { /// struct xdr_rdma_segment target<>; /// }; /// /// /* /// * Write list (Section 4.3.2) /// */ /// struct xdr_write_list { /// struct xdr_write_chunk entry; /// struct xdr_write_list *next; /// }; ///
/// /* /// * Chunk lists (Section 4.3) /// */ /// struct rpc_rdma_header { /// struct xdr_read_list *rdma_reads; /// struct xdr_write_list *rdma_writes; /// struct xdr_write_chunk *rdma_reply; /// /* rpc body follows */ /// }; /// /// struct rpc_rdma_header_nomsg { /// struct xdr_read_list *rdma_reads; /// struct xdr_write_list *rdma_writes; /// struct xdr_write_chunk *rdma_reply; /// }; /// /// /* Not to be used */ /// struct rpc_rdma_header_padded { /// uint32 rdma_align; /// uint32 rdma_thresh; /// struct xdr_read_list *rdma_reads; /// struct xdr_write_list *rdma_writes; /// struct xdr_write_chunk *rdma_reply; /// /* rpc body follows */ /// }; /// /// /* /// * Error handling (Section 4.5) /// */ /// enum rpc_rdma_errcode { /// ERR_VERS = 1, /* Value fixed for all versions */ /// ERR_CHUNK = 2 /// }; /// /// /* Structure fixed for all versions */ /// struct rpc_rdma_errvers { /// uint32 rdma_vers_low; /// uint32 rdma_vers_high; /// }; /// /// union rpc_rdma_error switch (rpc_rdma_errcode err) { /// case ERR_VERS: /// rpc_rdma_errvers range; /// case ERR_CHUNK: /// void; /// }; /// /// /*
/// * Procedures (Section 4.2.4) /// */ /// enum rdma_proc { /// RDMA_MSG = 0, /* Value fixed for all versions */ /// RDMA_NOMSG = 1, /* Value fixed for all versions */ /// RDMA_MSGP = 2, /* Not to be used */ /// RDMA_DONE = 3, /* Not to be used */ /// RDMA_ERROR = 4 /* Value fixed for all versions */ /// }; /// /// /* The position of the proc discriminator field is /// * fixed for all versions */ /// union rdma_body switch (rdma_proc proc) { /// case RDMA_MSG: /// rpc_rdma_header rdma_msg; /// case RDMA_NOMSG: /// rpc_rdma_header_nomsg rdma_nomsg; /// case RDMA_MSGP: /* Not to be used */ /// rpc_rdma_header_padded rdma_msgp; /// case RDMA_DONE: /* Not to be used */ /// void; /// case RDMA_ERROR: /// rpc_rdma_error rdma_error; /// }; /// /// /* /// * Fixed header fields (Section 4.2) /// */ /// struct rdma_msg { /// uint32 rdma_xid; /* Position fixed for all versions */ /// uint32 rdma_vers; /* Position fixed for all versions */ /// uint32 rdma_credit; /* Position fixed for all versions */ /// rdma_body rdma_body; /// }; <CODE ENDS>4.2. Fixed Header Fields
The RPC-over-RDMA header begins with four fixed 32-bit fields that control the RDMA interaction. The first three words are individual fields in the rdma_msg structure. The fourth word is the first word of the rdma_body union, which acts as the discriminator for the switched union. The contents of this field are described in Section 4.2.4.
These four fields must remain with the same meanings and in the same positions in all subsequent versions of the RPC-over-RDMA protocol.4.2.1. Transaction ID (XID)
The XID generated for the RPC Call and Reply messages. Having the XID at a fixed location in the header makes it easy for the receiver to establish context as soon as each RPC-over-RDMA message arrives. This XID MUST be the same as the XID in the RPC message. The receiver MAY perform its processing based solely on the XID in the RPC-over-RDMA header, and thereby ignore the XID in the RPC message, if it so chooses.4.2.2. Version Number
For RPC-over-RDMA version 1, this field MUST contain the value one (1). Rules regarding changes to this transport protocol version number can be found in Section 7.4.2.3. Credit Value
When sent with an RPC Call message, the requested credit value is provided. When sent with an RPC Reply message, the granted credit value is returned. Further discussion of how the credit value is determined can be found in Section 3.3.4.2.4. Procedure Number
RDMA_MSG = 0 indicates that chunk lists and a Payload stream follow. The format of the chunk lists is discussed below. RDMA_NOMSG = 1 indicates that after the chunk lists there is no Payload stream. In this case, the chunk lists provide information to allow the Responder to transfer the Payload stream using explicit RDMA operations. RDMA_MSGP = 2 is reserved. RDMA_DONE = 3 is reserved. RDMA_ERROR = 4 is used to signal an encoding error in the RPC- over-RDMA header. An RDMA_MSG procedure conveys the Transport stream and the Payload stream via an RDMA Send operation. The Transport stream contains the four fixed fields followed by the Read and Write lists and the Reply
chunk, though any or all three MAY be marked as not present. The Payload stream then follows, beginning with its XID field. If a Read or Write chunk list is present, a portion of the Payload stream has been reduced and is conveyed via separate operations. An RDMA_NOMSG procedure conveys the Transport stream via an RDMA Send operation. The Transport stream contains the four fixed fields followed by the Read and Write chunk lists and the Reply chunk. Though any of these MAY be marked as not present, one MUST be present and MUST hold the Payload stream for this RPC-over-RDMA message. If a Read or Write chunk list is present, a portion of the Payload stream has been excised and is conveyed via separate operations. An RDMA_ERROR procedure conveys the Transport stream via an RDMA Send operation. The Transport stream contains the four fixed fields followed by formatted error information. No Payload stream is conveyed in this type of RPC-over-RDMA message. A Requester MUST NOT send an RPC-over-RDMA header with the RDMA_ERROR procedure. A Responder MUST silently discard RDMA_ERROR procedures. The Transport stream and Payload stream can be constructed in separate buffers. However, the total length of the gathered buffers cannot exceed the inline threshold.4.3. Chunk Lists
The chunk lists in an RPC-over-RDMA version 1 header are three XDR optional-data fields that follow the fixed header fields in RDMA_MSG and RDMA_NOMSG procedures. Read Section 4.19 of [RFC4506] carefully to understand how optional-data fields work. Examples of XDR-encoded chunk lists are provided in Section 4.7 as an aid to understanding. Often, an RPC-over-RDMA message has no associated chunks. In this case, the Read list, Write list, and Reply chunk are all marked "not present".4.3.1. Read List
Each RDMA_MSG or RDMA_NOMSG procedure has one "Read list". The Read list is a list of zero or more RDMA read segments, provided by the Requester, that are grouped by their Position fields into Read chunks. Each Read chunk advertises the location of argument data the Responder is to pull from the Requester. The Requester has reduced the data items in these chunks from the call's Payload stream.
A Requester may transmit the Payload stream of an RPC Call message using a Position Zero Read chunk. If the RPC Call message has no argument data that is DDP-eligible and the Position Zero Read chunk is not being used, the Requester leaves the Read list empty. Responders MUST leave the Read list empty in all replies.4.3.1.1. Matching Read Chunks to Arguments
When reducing a DDP-eligible argument data item, a Requester records the XDR stream offset of that data item in the Read chunk's Position field. The Responder can then tell unambiguously where that chunk is to be reinserted into the received Payload stream to form a complete RPC Call message.4.3.2. Write List
Each RDMA_MSG or RDMA_NOMSG procedure has one "Write list". The Write list is a list of zero or more Write chunks, provided by the Requester. Each Write chunk is an array of plain segments; thus, the Write list is a list of counted arrays. If an RPC Reply message has no possible DDP-eligible result data items, the Requester leaves the Write list empty. When a Requester provides a Write list, the Responder MUST push data corresponding to DDP-eligible result data items to Requester memory referenced in the Write list. The Responder removes these data items from the reply's Payload stream.4.3.2.1. Matching Write Chunks to Results
A Requester constructs the Write list for an RPC transaction before the Responder has formulated its reply. When there is only one DDP- eligible result data item, the Requester inserts only a single Write chunk in the Write list. If the returned Write chunk is not an unused Write chunk, the Requester knows with certainty which result data item is contained in it. When a Requester has provided multiple Write chunks, the Responder fills in each Write chunk with one DDP-eligible result until there are either no more DDP-eligible results or no more Write chunks. The Requester might not be able to predict in advance which DDP- eligible data item goes in which chunk. Thus, the Requester is responsible for allocating and registering Write chunks large enough to accommodate the largest result data item that might be associated with each chunk in the Write list.
As a Requester decodes a reply Payload stream, it is clear from the contents of the RPC Reply message which Write chunk contains which result data item.4.3.2.2. Unused Write Chunks
There are occasions when a Requester provides a non-empty Write chunk but the Responder is not able to use it. For example, a ULP may define a union result where some arms of the union contain a DDP- eligible data item while other arms do not. The Responder is required to use Requester-provided Write chunks in this case, but if the Responder returns a result that uses an arm of the union that has no DDP-eligible data item, that Write chunk remains unconsumed. If there is a subsequent DDP-eligible result data item in the RPC Reply message, it MUST be placed in that unconsumed Write chunk. Therefore, the Requester MUST provision each Write chunk so it can be filled with the largest DDP-eligible data item that can be placed in it. If this is the last or only Write chunk available and it remains unconsumed, the Responder MUST return this Write chunk as an unused Write chunk (see Section 3.4.6). The Responder sets the segment count to a value matching the Requester-provided Write chunk, but returns only empty segments in that Write chunk. Unused Write chunks, or unused bytes in Write chunk segments, are returned to the RPC consumer as part of RPC completion. Even if a Responder indicates that a Write chunk is not consumed, the Responder may have written data into one or more segments before choosing not to return that data item. The Requester MUST NOT assume that the memory regions backing a Write chunk have not been modified.4.3.2.3. Empty Write Chunks
To force a Responder to return a DDP-eligible result inline, a Requester employs the following mechanism: o When there is only one DDP-eligible result item in an RPC Reply message, the Requester provides an empty Write list. o When there are multiple DDP-eligible result data items and a Requester prefers that a data item is returned inline, the Requester provides an empty Write chunk for that item (see Section 3.4.6). The Responder MUST return the corresponding result data item inline and MUST return an empty Write chunk in that Write list position in the RPC Reply message.
As always, a Requester and Responder must prepare for a Long Reply to be used if the resulting RPC Reply might be too large to be conveyed in an RDMA Send.4.3.3. Reply Chunk
Each RDMA_MSG or RDMA_NOMSG procedure has one "Reply chunk" slot. A Requester MUST provide a Reply chunk whenever the maximum possible size of the RPC Reply message's Transport and Payload streams is larger than the inline threshold for messages from Responder to Requester. Otherwise, the Requester marks the Reply chunk as not present. If the Transport stream and Payload stream together are smaller than the reply inline threshold, the Responder MAY return the RPC Reply message as a Short message rather than using the Requester-provided Reply chunk. When a Requester provides a Reply chunk in an RPC Call message, the Responder MUST copy that chunk into the Transport header of the RPC Reply message. As with Write chunks, the Responder modifies the copied Reply chunk in the RPC Reply message to reflect the actual amount of data that is being returned in the Reply chunk.4.4. Memory Registration
The cost of registering and invalidating memory can be a significant proportion of the cost of an RPC-over-RDMA transaction. Thus, an important implementation consideration is how to minimize registration activity without exposing system memory needlessly.4.4.1. Registration Longevity
Data transferred via RDMA Read and Write can reside in a memory allocation not in the control of the RPC-over-RDMA transport. These memory allocations can persist outside the bounds of an RPC transaction. They are registered and invalidated as needed, as part of each RPC transaction. The Requester endpoint must ensure that memory regions associated with each RPC transaction are protected from Responder access before allowing upper-layer access to the data contained in them. Moreover, the Requester must not access these memory regions while the Responder has access to them.
This includes memory regions that are associated with canceled RPCs. A Responder cannot know that the Requester is no longer waiting for a reply, and it might proceed to read or even update memory that the Requester might have released for other use.4.4.2. Communicating DDP-Eligibility
The interface by which a ULP implementation communicates the eligibility of a data item locally to its local RPC-over-RDMA endpoint is not described by this specification. Depending on the implementation and constraints imposed by ULBs, it is possible to implement reduction transparently to upper layers. Such implementations may lead to inefficiencies, either because they require the RPC layer to perform expensive registration and invalidation of memory "on the fly", or they may require using RDMA chunks in RPC Reply messages, along with the resulting additional handshaking with the RPC-over-RDMA peer. However, these issues are internal and generally confined to the local interface between RPC and its upper layers, one in which implementations are free to innovate. The only requirement, beyond constraints imposed by the ULB, is that the resulting RPC-over-RDMA protocol sent to the peer be valid for the upper layer.4.4.3. Registration Strategies
The choice of which memory registration strategies to employ is left to Requester and Responder implementers. To support the widest array of RDMA implementations, as well as the most general steering tag scheme, an Offset field is included in each RDMA segment. While zero-based offset schemes are available in many RDMA implementations, their use by RPC requires individual registration of each memory region. For such implementations, this can be a significant overhead. By providing an offset in each chunk, many pre-registration or region-based registrations can be readily supported.4.5. Error Handling
A receiver performs basic validity checks on the RPC-over-RDMA header and chunk contents before it passes the RPC message to the RPC layer. If an incoming RPC-over-RDMA message is not as long as a minimal size RPC-over-RDMA header (28 bytes), the receiver cannot trust the value of the XID field; therefore, it MUST silently discard the message before performing any parsing. If other errors are detected in the RPC-over-RDMA header of an RPC Call message, a Responder MUST send an
RDMA_ERROR message back to the Requester. If errors are detected in the RPC-over-RDMA header of an RPC Reply message, a Requester MUST silently discard the message. To form an RDMA_ERROR procedure: o The rdma_xid field MUST contain the same XID that was in the rdma_xid field in the failing request; o The rdma_vers field MUST contain the same version that was in the rdma_vers field in the failing request; o The rdma_proc field MUST contain the value RDMA_ERROR; and o The rdma_err field contains a value that reflects the type of error that occurred, as described below. An RDMA_ERROR procedure indicates a permanent error. Receipt of this procedure completes the RPC transaction associated with XID in the rdma_xid field. A receiver MUST silently discard an RDMA_ERROR procedure that it cannot decode.4.5.1. Header Version Mismatch
When a Responder detects an RPC-over-RDMA header version that it does not support (currently this document defines only version 1), it MUST reply with an RDMA_ERROR procedure and set the rdma_err value to ERR_VERS, also providing the low and high inclusive version numbers it does, in fact, support.4.5.2. XDR Errors
A receiver might encounter an XDR parsing error that prevents it from processing the incoming Transport stream. Examples of such errors include an invalid value in the rdma_proc field; an RDMA_NOMSG message where the Read list, Write list, and Reply chunk are marked not present; or the value of the rdma_xid field does not match the value of the XID field in the accompanying RPC message. If the rdma_vers field contains a recognized value, but an XDR parsing error occurs, the Responder MUST reply with an RDMA_ERROR procedure and set the rdma_err value to ERR_CHUNK. When a Responder receives a valid RPC-over-RDMA header but the Responder's ULP implementation cannot parse the RPC arguments in the RPC Call message, the Responder SHOULD return an RPC Reply message with status GARBAGE_ARGS, using an RDMA_MSG procedure. This type of parsing failure might be due to mismatches between chunk sizes or offsets and the contents of the Payload stream, for example.
4.5.3. Responder RDMA Operational Errors
In RPC-over-RDMA version 1, the Responder initiates RDMA Read and Write operations that target the Requester's memory. Problems might arise as the Responder attempts to use Requester-provided resources for RDMA operations. For example: o Usually, chunks can be validated only by using their contents to perform data transfers. If chunk contents are invalid (e.g., a memory region is no longer registered or a chunk length exceeds the end of the registered memory region), a Remote Access Error occurs. o If a Requester's Receive buffer is too small, the Responder's Send operation completes with a Local Length Error. o If the Requester-provided Reply chunk is too small to accommodate a large RPC Reply message, a Remote Access Error occurs. A Responder might detect this problem before attempting to write past the end of the Reply chunk. RDMA operational errors are typically fatal to the connection. To avoid a retransmission loop and repeated connection loss that deadlocks the connection, once the Requester has re-established a connection, the Responder should send an RDMA_ERROR reply with an rdma_err value of ERR_CHUNK to indicate that no RPC-level reply is possible for that XID.4.5.4. Other Operational Errors
While a Requester is constructing an RPC Call message, an unrecoverable problem might occur that prevents the Requester from posting further RDMA Work Requests on behalf of that message. As with other transports, if a Requester is unable to construct and transmit an RPC Call message, the associated RPC transaction fails immediately. After a Requester has received a reply, if it is unable to invalidate a memory region due to an unrecoverable problem, the Requester MUST close the connection to protect that memory from Responder access before the associated RPC transaction is complete. While a Responder is constructing an RPC Reply message or error message, an unrecoverable problem might occur that prevents the Responder from posting further RDMA Work Requests on behalf of that message. If a Responder is unable to construct and transmit an RPC Reply or RPC-over-RDMA error message, the Responder MUST close the connection to signal to the Requester that a reply was lost.
4.5.5. RDMA Transport Errors
The RDMA connection and physical link provide some degree of error detection and retransmission. iWARP's Marker PDU Aligned (MPA) layer (when used over TCP), the Stream Control Transmission Protocol (SCTP), as well as the InfiniBand [IBARCH] link layer all provide Cyclic Redundancy Check (CRC) protection of the RDMA payload, and CRC-class protection is a general attribute of such transports. Additionally, the RPC layer itself can accept errors from the transport and recover via retransmission. RPC recovery can handle complete loss and re-establishment of a transport connection. The details of reporting and recovery from RDMA link-layer errors are described in specific link-layer APIs and operational specifications and are outside the scope of this protocol specification. See Section 8 for further discussion of the use of RPC-level integrity schemes to detect errors.4.6. Protocol Elements No Longer Supported
The following protocol elements are no longer supported in RPC-over- RDMA version 1. Related enum values and structure definitions remain in the RPC-over-RDMA version 1 protocol for backwards compatibility.4.6.1. RDMA_MSGP
The specification of RDMA_MSGP in Section 3.9 of [RFC5666] is incomplete. To fully specify RDMA_MSGP would require: o Updating the definition of DDP-eligibility to include data items that may be transferred, with padding, via RDMA_MSGP procedures o Adding full operational descriptions of the alignment and threshold fields o Discussing how alignment preferences are communicated between two peers without using CCP o Describing the treatment of RDMA_MSGP procedures that convey Read or Write chunks The RDMA_MSGP message type is beneficial only when the padded data payload is at the end of an RPC message's argument or result list. This is not typical for NFSv4 COMPOUND RPCs, which often include a GETATTR operation as the final element of the compound operation array.
Without a full specification of RDMA_MSGP, there has been no fully implemented prototype of it. Without a complete prototype of RDMA_MSGP support, it is difficult to assess whether this protocol element has benefit or can even be made to work interoperably. Therefore, senders MUST NOT send RDMA_MSGP procedures. When receiving an RDMA_MSGP procedure, Responders SHOULD reply with an RDMA_ERROR procedure, setting the rdma_err field to ERR_CHUNK; Requesters MUST silently discard the message.4.6.2. RDMA_DONE
Because no implementation of RPC-over-RDMA version 1 uses the Read- Read transfer model, there is never a need to send an RDMA_DONE procedure. Therefore, senders MUST NOT send RDMA_DONE messages. Receivers MUST silently discard RDMA_DONE messages.4.7. XDR Examples
RPC-over-RDMA chunk lists are complex data types. In this section, illustrations are provided to help readers grasp how chunk lists are represented inside an RPC-over-RDMA header. A plain segment is the simplest component, being made up of a 32-bit handle (H), a 32-bit length (L), and 64 bits of offset (OO). Once flattened into an XDR stream, plain segments appear as HLOO An RDMA read segment has an additional 32-bit position field (P). RDMA read segments appear as PHLOO A Read chunk is a list of RDMA read segments. Each RDMA read segment is preceded by a 32-bit word containing a one if a segment follows or a zero if there are no more segments in the list. In XDR form, this would look like 1 PHLOO 1 PHLOO 1 PHLOO 0 where P would hold the same value for each RDMA read segment belonging to the same Read chunk.
The Read list is also a list of RDMA read segments. In XDR form, this would look like a Read chunk, except that the P values could vary across the list. An empty Read list is encoded as a single 32-bit zero. One Write chunk is a counted array of plain segments. In XDR form, the count would appear as the first 32-bit word, followed by an HLOO for each element of the array. For instance, a Write chunk with three elements would look like 3 HLOO HLOO HLOO The Write list is a list of counted arrays. In XDR form, this is a combination of optional-data and counted arrays. To represent a Write list containing a Write chunk with three segments and a Write chunk with two segments, XDR would encode 1 3 HLOO HLOO HLOO 1 2 HLOO HLOO 0 An empty Write list is encoded as a single 32-bit zero. The Reply chunk is a Write chunk. However, since it is an optional- data field, there is a 32-bit field in front of it that contains a one if the Reply chunk is present or a zero if it is not. After encoding, a Reply chunk with two segments would look like 1 2 HLOO HLOO Frequently, a Requester does not provide any chunks. In that case, after the four fixed fields in the RPC-over-RDMA header, there are simply three 32-bit fields that contain zero.5. RPC Bind Parameters
In setting up a new RDMA connection, the first action by a Requester is to obtain a transport address for the Responder. The means used to obtain this address, and to open an RDMA connection, is dependent on the type of RDMA transport and is the responsibility of each RPC protocol binding and its local implementation. RPC services normally register with a portmap or rpcbind service [RFC1833], which associates an RPC Program number with a service address. This policy is no different with RDMA transports. However, a different and distinct service address (port number) might sometimes be required for ULP operation with RPC-over-RDMA.
When mapped atop the iWARP transport [RFC5040] [RFC5041], which uses IP port addressing due to its layering on TCP and/or SCTP, port mapping is trivial and consists merely of issuing the port in the connection process. The NFS/RDMA protocol service address has been assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP [RFC5667]. When mapped atop InfiniBand [IBARCH], which uses a service endpoint naming scheme based on a Group Identifier (GID), a translation MUST be employed. One such translation is described in Annexes A3 (Application Specific Identifiers), A4 (Sockets Direct Protocol (SDP)), and A11 (RDMA IP CM Service) of [IBARCH], which is appropriate for translating IP port addressing to the InfiniBand network. Therefore, in this case, IP port addressing may be readily employed by the upper layer. When a mapping standard or convention exists for IP ports on an RDMA interconnect, there are several possibilities for each upper layer to consider: o One possibility is to have the Responder register its mapped IP port with the rpcbind service under the netid (or netids) defined here. An RPC-over-RDMA-aware Requester can then resolve its desired service to a mappable port and proceed to connect. This is the most flexible and compatible approach, for those upper layers that are defined to use the rpcbind service. o A second possibility is to have the Responder's portmapper register itself on the RDMA interconnect at a "well-known" service address (on UDP or TCP, this corresponds to port 111). A Requester could connect to this service address and use the portmap protocol to obtain a service address in response to a program number, e.g., an iWARP port number or an InfiniBand GID. o Alternately, the Requester could simply connect to the mapped well-known port for the service itself, if it is appropriately defined. By convention, the NFS/RDMA service, when operating atop such an InfiniBand fabric, uses the same 20049 assignment as for iWARP. Historically, different RPC protocols have taken different approaches to their port assignment. Therefore, the specific method is left to each RPC-over-RDMA-enabled ULB and is not addressed in this document.
In Section 9, this specification defines two new netid values, to be used for registration of upper layers atop iWARP [RFC5040] [RFC5041] and (when a suitable port translation service is available) InfiniBand [IBARCH]. Additional RDMA-capable networks MAY define their own netids, or if they provide a port translation, they MAY share the one defined in this document.6. ULB Specifications
An ULP is typically defined independently of any particular RPC transport. An ULB (ULB) specification provides guidance that helps the ULP interoperate correctly and efficiently over a particular transport. For RPC-over-RDMA version 1, a ULB may provide: o A taxonomy of XDR data items that are eligible for DDP o Constraints on which upper-layer procedures may be reduced and on how many chunks may appear in a single RPC request o A method for determining the maximum size of the reply Payload stream for all procedures in the ULP o An rpcbind port assignment for operation of the RPC Program and Version on an RPC-over-RDMA transport Each RPC Program and Version tuple that utilizes RPC-over-RDMA version 1 needs to have a ULB specification.6.1. DDP-Eligibility
An ULB designates some XDR data items as eligible for DDP. As an RPC-over-RDMA message is formed, DDP-eligible data items can be removed from the Payload stream and placed directly in the receiver's memory. An XDR data item should be considered for DDP-eligibility if there is a clear benefit to moving the contents of the item directly from the sender's memory to the receiver's memory. Criteria for DDP- eligibility include: o The XDR data item is frequently sent or received, and its size is often much larger than typical inline thresholds. o If the XDR data item is a result, its maximum size must be predictable in advance by the Requester.
o Transport-level processing of the XDR data item is not needed. For example, the data item is an opaque byte array, which requires no XDR encoding and decoding of its content. o The content of the XDR data item is sensitive to address alignment. For example, a data copy operation would be required on the receiver to enable the message to be parsed correctly, or to enable the data item to be accessed. o The XDR data item does not contain DDP-eligible data items. In addition to defining the set of data items that are DDP-eligible, a ULB may also limit the use of chunks to particular upper-layer procedures. If more than one data item in a procedure is DDP- eligible, the ULB may also limit the number of chunks that a Requester can provide for a particular upper-layer procedure. Senders MUST NOT reduce data items that are not DDP-eligible. Such data items MAY, however, be moved as part of a Position Zero Read chunk or a Reply chunk. The programming interface by which an upper-layer implementation indicates the DDP-eligibility of a data item to the RPC transport is not described by this specification. The only requirements are that the receiver can re-assemble the transmitted RPC-over-RDMA message into a valid XDR stream, and that DDP-eligibility rules specified by the ULB are respected. There is no provision to express DDP-eligibility within the XDR language. The only definitive specification of DDP-eligibility is a ULB. In general, a DDP-eligibility violation occurs when: o A Requester reduces a non-DDP-eligible argument data item. The Responder MUST NOT process this RPC Call message and MUST report the violation as described in Section 4.5.2. o A Responder reduces a non-DDP-eligible result data item. The Requester MUST terminate the pending RPC transaction and report an appropriate permanent error to the RPC consumer. o A Responder does not reduce a DDP-eligible result data item into an available Write chunk. The Requester MUST terminate the pending RPC transaction and report an appropriate permanent error to the RPC consumer.
6.2. Maximum Reply Size
A Requester provides resources for both an RPC Call message and its matching RPC Reply message. A Requester forms the RPC Call message itself; thus, the Requester can compute the exact resources needed. A Requester must allocate resources for the RPC Reply message (an RPC-over-RDMA credit, a Receive buffer, and possibly a Write list and Reply chunk) before the Responder has formed the actual reply. To accommodate all possible replies for the procedure in the RPC Call message, a Requester must allocate reply resources based on the maximum possible size of the expected RPC Reply message. If there are procedures in the ULP for which there is no clear reply size maximum, the ULB needs to specify a dependable means for determining the maximum.6.3. Additional Considerations
There may be other details provided in a ULB. o An ULB may recommend inline threshold values or other transport- related parameters for RPC-over-RDMA version 1 connections bearing that ULP. o An ULP may provide a means to communicate these transport-related parameters between peers. Note that RPC-over-RDMA version 1 does not specify any mechanism for changing any transport-related parameter after a connection has been established. o Multiple ULPs may share a single RPC-over-RDMA version 1 connection when their ULBs allow the use of RPC-over-RDMA version 1 and the rpcbind port assignments for the Protocols allow connection sharing. In this case, the same transport parameters (such as inline threshold) apply to all Protocols using that connection. Each ULB needs to be designed to allow correct interoperation without regard to the transport parameters actually in use. Furthermore, implementations of ULPs must be designed to interoperate correctly regardless of the connection parameters in effect on a connection.6.4. ULP Extensions
An RPC Program and Version tuple may be extensible. For instance, there may be a minor versioning scheme that is not reflected in the RPC version number, or the ULP may allow additional features to be specified after the original RPC Program specification was ratified.
ULBs are provided for interoperable RPC Programs and Versions by extending existing ULBs to reflect the changes made necessary by each addition to the existing XDR.