Network Working Group P. Culley Request for Comments: 5044 Hewlett-Packard Company Category: Standards Track U. Elzur Broadcom Corporation R. Recio IBM Corporation S. Bailey Sandburst Corporation J. Carrier Cray Inc. October 2007 Marker PDU Aligned Framing for TCP Specification Status of This Memo This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.Abstract
Marker PDU Aligned Framing (MPA) is designed to work as an "adaptation layer" between TCP and the Direct Data Placement protocol (DDP) as described in RFC 5041. It preserves the reliable, in-order delivery of TCP, while adding the preservation of higher-level protocol record boundaries that DDP requires. MPA is fully compliant with applicable TCP RFCs and can be utilized with existing TCP implementations. MPA also supports integrated implementations that combine TCP, MPA and DDP to reduce buffering requirements in the implementation and improve performance at the system level.
Table of Contents
1. Introduction ....................................................4 1.1. Motivation .................................................4 1.2. Protocol Overview ..........................................5 2. Glossary ........................................................8 3. MPA's Interactions with DDP ....................................11 4. MPA Full Operation Phase .......................................13 4.1. FPDU Format ...............................................13 4.2. Marker Format .............................................14 4.3. MPA Markers ...............................................14 4.4. CRC Calculation ...........................................16 4.5. FPDU Size Considerations ..................................21 5. MPA's interactions with TCP ....................................22 5.1. MPA transmitters with a standard layered TCP ..............22 5.2. MPA receivers with a standard layered TCP .................23 6. MPA Receiver FPDU Identification ...............................24 7. Connection Semantics ...........................................24 7.1. Connection Setup ..........................................24 7.1.1. MPA Request and Reply Frame Format .................26 7.1.2. Connection Startup Rules ...........................28 7.1.3. Example Delayed Startup Sequence ...................30 7.1.4. Use of Private Data ................................33 7.1.4.1. Motivation ................................33 7.1.4.2. Example Immediate Startup Using Private Data ..............................35 7.1.5. "Dual Stack" Implementations .......................37 7.2. Normal Connection Teardown ................................38 8. Error Semantics ................................................39 9. Security Considerations ........................................40 9.1. Protocol-Specific Security Considerations .................40 9.1.1. Spoofing ...........................................40 9.1.1.1. Impersonation .............................41 9.1.1.2. Stream Hijacking ..........................41 9.1.1.3. Man-in-the-Middle Attack ..................41 9.1.2. Eavesdropping ......................................42 9.2. Introduction to Security Options ..........................42 9.3. Using IPsec with MPA ......................................43 9.4. Requirements for IPsec Encapsulation of MPA/DDP ...........43 10. IANA Considerations ...........................................44 Appendix A. Optimized MPA-Aware TCP Implementations ...............45 A.1. Optimized MPA/TCP Transmitters ............................46 A.2. Effects of Optimized MPA/TCP Segmentation .................46 A.3. Optimized MPA/TCP Receivers ...............................48 A.4. Re-segmenting Middleboxes and Non-Optimized MPA/TCP Senders ...................................................49 A.5. Receiver Implementation ...................................50 A.5.1. Network Layer Reassembly Buffers ...................51
A.5.2. TCP Reassembly Buffers .............................52 Appendix B. Analysis of MPA over TCP Operations ...................52 B.1. Assumptions ...............................................53 B.1.1. MPA Is Layered beneath DDP .........................53 B.1.2. MPA Preserves DDP Message Framing ..................53 B.1.3. The Size of the ULPDU Passed to MPA Is Less Than EMSS Under Normal Conditions .......................53 B.1.4. Out-of-Order Placement but NO Out-of-Order Delivery.54 B.2. The Value of FPDU Alignment ...............................54 B.2.1. Impact of Lack of FPDU Alignment on the Receiver Computational Load and Complexity ..................56 B.2.2. FPDU Alignment Effects on TCP Wire Protocol ........60 Appendix C. IETF Implementation Interoperability with RDMA Consortium Protocols ..................................62 C.1. Negotiated Parameters ......................................63 C.2. RDMAC RNIC and Non-Permissive IETF RNIC ....................64 C.2.1. RDMAC RNIC Initiator ................................65 C.2.2. Non-Permissive IETF RNIC Initiator ..................65 C.2.3. RDMAC RNIC and Permissive IETF RNIC .................65 C.2.4. RDMAC RNIC Initiator ................................66 C.2.5. Permissive IETF RNIC Initiator ......................67 C.3. Non-Permissive IETF RNIC and Permissive IETF RNIC ..........67 Normative References ..............................................68 Informative References ............................................68 Contributors ......................................................70Table of Figures
Figure 1: ULP MPA TCP Layering .....................................5 Figure 2: FPDU Format .............................................13 Figure 3: Marker Format ...........................................14 Figure 4: Example FPDU Format with Marker .........................16 Figure 5: Annotated Hex Dump of an FPDU ...........................19 Figure 6: Annotated Hex Dump of an FPDU with Marker ...............20 Figure 7: Fully Layered Implementation ............................22 Figure 8: MPA Request/Reply Frame .................................26 Figure 9: Example Delayed Startup Negotiation .....................31 Figure 10: Example Immediate Startup Negotiation ..................35 Figure 11: Optimized MPA/TCP Implementation .......................45 Figure 12: Non-Aligned FPDU Freely Placed in TCP Octet Stream .....56 Figure 13: Aligned FPDU Placed Immediately after TCP Header .......58 Figure 14: Connection Parameters for the RNIC Types ...............63 Figure 15: MPA Negotiation between an RDMAC RNIC and a Non-Permissive IETF RNIC ...............................65 Figure 16: MPA Negotiation between an RDMAC RNIC and a Permissive IETF RNIC ..............................................66 Figure 17: MPA Negotiation between a Non-Permissive IETF RNIC and a Permissive IETF RNIC .................................67
1. Introduction
This section discusses the reason for creating MPA on TCP and a general overview of the protocol.1.1. Motivation
The Direct Data Placement protocol [DDP], when used with TCP [RFC793], requires a mechanism to detect record boundaries. The DDP records are referred to as Upper Layer Protocol Data Units by this document. The ability to locate the Upper Layer Protocol Data Unit (ULPDU) boundary is useful to a hardware network adapter that uses DDP to directly place the data in the application buffer based on the control information carried in the ULPDU header. This may be done without requiring that the packets arrive in order. Potential benefits of this capability are the avoidance of the memory copy overhead and a smaller memory requirement for handling out-of-order or dropped packets. Many approaches have been proposed for a generalized framing mechanism. Some are probabilistic in nature and others are deterministic. An example probabilistic approach is characterized by a detectable value embedded in the octet stream, with no method of preventing that value elsewhere within user data. It is probabilistic because under some conditions the receiver may incorrectly interpret application data as the detectable value. Under these conditions, the protocol may fail with unacceptable frequency. One deterministic approach is characterized by embedded controls at known locations in the octet stream. Because the receiver can guarantee it will only examine the data stream at locations that are known to contain the embedded control, the protocol can never misinterpret application data as being embedded control data. For unambiguous handling of an out-of-order packet, a deterministic approach is preferred. The MPA protocol provides a framing mechanism for DDP running over TCP using the deterministic approach. It allows the location of the ULPDU to be determined in the TCP stream even if the TCP segments arrive out of order.
1.2. Protocol Overview
The layering of PDUs with MPA is shown in Figure 1, below. +------------------+ | ULP client | +------------------+ <- Consumer messages | DDP | +------------------+ <- ULPDUs | MPA* | +------------------+ <- FPDUs (containing ULPDUs) | TCP* | +------------------+ <- TCP Segments (containing FPDUs) | IP etc. | +------------------+ * These may be fully layered or optimized together. Figure 1: ULP MPA TCP Layering MPA is described as an extra layer above TCP and below DDP. The operation sequence is: 1. A TCP connection is established by ULP action. This is done using methods not described by this specification. The ULP may exchange some amount of data in streaming mode prior to starting MPA, but is not required to do so. 2. The Consumer negotiates the use of DDP and MPA at both ends of a connection. The mechanisms to do this are not described in this specification. The negotiation may be done in streaming mode, or by some other mechanism (such as a pre-arranged port number). 3. The ULP activates MPA on each end in the Startup Phase, either as an Initiator or a Responder, as determined by the ULP. This mode verifies the usage of MPA, specifies the use of CRC and Markers, and allows the ULP to communicate some additional data via a Private Data exchange. See Section 7.1, Connection Setup, for more details on the startup process. 4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into Full Operation and begins sending DDP data as further described below. In this document, DDP data chunks are called ULPDUs. For a description of the DDP data, see [DDP].
Following is a description of data transfer when MPA is in Full Operation. 1. DDP determines the Maximum ULPDU (MULPDU) size by querying MPA for this value. MPA derives this information from TCP or IP, when it is available, or chooses a reasonable value. 2. DDP creates ULPDUs of MULPDU size or smaller, and hands them to MPA at the sender. 3. MPA creates a Framed Protocol Data Unit (FPDU) by prepending a header, optionally inserting Markers, and appending a CRC field after the ULPDU and PAD (if any). MPA delivers the FPDU to TCP. 4. The TCP sender puts the FPDUs into the TCP stream. If the sender is optimized MPA/TCP, it segments the TCP stream in such a way that a TCP Segment boundary is also the boundary of an FPDU. TCP then passes each segment to the IP layer for transmission. 5. The receiver may or may not be optimized. If it is optimized MPA/TCP, it may separate passing the TCP payload to MPA from passing the TCP payload ordering information to MPA. In either case, RFC-compliant TCP wire behavior is observed at both the sender and receiver. 6. The MPA receiver locates and assembles complete FPDUs within the stream, verifies their integrity, and removes MPA Markers (when present), ULPDU_Length, PAD, and the CRC field. 7. MPA then provides the complete ULPDUs to DDP. MPA may also separate passing MPA payload to DDP from passing the MPA payload ordering information. A fully layered MPA on TCP is implemented as a data stream ULP for TCP and is therefore RFC compliant. An optimized DDP/MPA/TCP uses a TCP layer that potentially contains some additional behaviors as suggested in this document. When DDP/MPA/TCP are cross-layer optimized, the behavior of TCP (especially sender segmentation) may change from that of the un- optimized implementation, but the changes are within the bounds permitted by the TCP RFC specifications, and will interoperate with an un-optimized TCP. The additional behaviors are described in Appendix A and are not normative; they are described at a TCP interface layer as a convenience. Implementations may achieve the described functionality using any method, including cross-layer optimizations between TCP, MPA, and DDP.
An optimized DDP/MPA/TCP sender is able to segment the data stream such that TCP segments begin with FPDUs (FPDU Alignment). This has significant advantages for receivers. When segments arrive with aligned FPDUs, the receiver usually need not buffer any portion of the segment, allowing DDP to place it in its destination memory immediately, thus avoiding copies from intermediate buffers (DDP's reason for existence). An optimized DDP/MPA/TCP receiver allows a DDP on MPA implementation to locate the start of ULPDUs that may be received out of order. It also allows the implementation to determine if the entire ULPDU has been received. As a result, MPA can pass out-of-order ULPDUs to DDP for immediate use. This enables a DDP on MPA implementation to save a significant amount of intermediate storage by placing the ULPDUs in the right locations in the application buffers when they arrive, rather than waiting until full ordering can be restored. The ability of a receiver to recover out-of-order ULPDUs is optional and declared to the transmitter during startup. When the receiver declares that it does not support out-of-order recovery, the transmitter does not add the control information to the data stream needed for out-of-order recovery. If the receiver is fully layered, then MPA receives a strictly ordered stream of data and does not deal with out-of-order ULPDUs. In this case, MPA passes each ULPDU to DDP when the last bytes arrive from TCP, along with the indication that they are in order. MPA implementations that support recovery of out-of-order ULPDUs MUST support a mechanism to indicate the ordering of ULPDUs as the sender transmitted them and indicate when missing intermediate segments arrive. These mechanisms allow DDP to reestablish record ordering and report Delivery of complete messages (groups of records). MPA also addresses enhanced data integrity. Some users of TCP have noted that the TCP checksum is not as strong as could be desired (see [CRCTCP]). Studies such as [CRCTCP] have shown that the TCP checksum indicates segments in error at a much higher rate than the underlying link characteristics would indicate. With these higher error rates, the chance that an error will escape detection, when using only the TCP checksum for data integrity, becomes a concern. A stronger integrity check can reduce the chance of data errors being missed. MPA includes a CRC check to increase the ULPDU data integrity to the level provided by other modern protocols, such as SCTP [RFC4960]. It is possible to disable this CRC check; however, CRCs MUST be enabled unless it is clear that the end-to-end connection through the network has data integrity at least as good as an MPA with CRC enabled (for
example, when IPsec is implemented end to end). DDP's ULP expects this level of data integrity and therefore the ULP does not have to provide its own duplicate data integrity and error recovery for lost data.2. Glossary
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. Consumer - the ULPs or applications that lie above MPA and DDP. The Consumer is responsible for making TCP connections, starting MPA and DDP connections, and generally controlling operations. CRC - Cyclic Redundancy Check. Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as the process of informing DDP that a particular PDU is ordered for use. A PDU is Delivered in the exact order that it was sent by the original sender; MPA uses TCP's byte stream ordering to determine when Delivery is possible. This is specifically different from "passing the PDU to DDP", which may generally occur in any order, while the order of Delivery is strictly defined. EMSS - Effective Maximum Segment Size. EMSS is the smaller of the TCP maximum segment size (MSS) as defined in RFC 793 [RFC793], and the current path Maximum Transmission Unit (MTU) [RFC1191]. FPDU - Framed Protocol Data Unit. The unit of data created by an MPA sender. FPDU Alignment - The property that an FPDU is Header Aligned with the TCP segment, and the TCP segment includes an integer number of FPDUs. A TCP segment with an FPDU Alignment allows immediate processing of the contained FPDUs without waiting on other TCP segments to arrive or combining with prior segments. FPDU Pointer (FPDUPTR) - This field of the Marker is used to indicate the beginning of an FPDU. Full Operation (Full Operation Phase) - After the completion of the Startup Phase, MPA begins exchanging FPDUs.
Header Alignment - The property that a TCP segment begins with an FPDU. The FPDU is Header Aligned when the FPDU header is exactly at the start of the TCP segment (right behind the TCP headers on the wire). Initiator - The endpoint of a connection that sends the MPA Request Frame, i.e., the first to actually send data (which may not be the one that sends the TCP SYN). Marker - A four-octet field that is placed in the MPA data stream at fixed octet intervals (every 512 octets). MPA-aware TCP - A TCP implementation that is aware of the receiver efficiencies of MPA FPDU Alignment and is capable of sending TCP segments that begin with an FPDU. MPA-enabled - MPA is enabled if the MPA protocol is visible on the wire. When the sender is MPA-enabled, it is inserting framing and Markers. When the receiver is MPA-enabled, it is interpreting framing and Markers. MPA Request Frame - Data sent from the MPA Initiator to the MPA Responder during the Startup Phase. MPA Reply Frame - Data sent from the MPA Responder to the MPA Initiator during the Startup Phase. MPA - Marker-based ULP PDU Aligned Framing for TCP protocol. This document defines the MPA protocol. MULPDU - Maximum ULPDU. The current maximum size of the record that is acceptable for DDP to pass to MPA for transmission. Node - A computing device attached to one or more links of a network. A Node in this context does not refer to a specific application or protocol instantiation running on the computer. A Node may consist of one or more MPA on TCP devices installed in a host computer. PAD - A 1-3 octet group of zeros used to fill an FPDU to an exact modulo 4 size. PDU - Protocol data unit Private Data - A block of data exchanged between MPA endpoints during initial connection setup.
Protection Domain - An RDMA concept (see [VERBS-RDMA] and [RDMASEC]) that ties use of various endpoint resources (memory access, etc.) to the specific RDMA/DDP/MPA connection. RDDP - A suite of protocols including MPA, [DDP], [RDMAP], an overall security document [RDMASEC], a problem statement [RFC4297], an architecture document [RFC4296], and an applicability document [APPL]. RDMA - Remote Direct Memory Access; a protocol that uses DDP and MPA to enable applications to transfer data directly from memory buffers. See [RDMAP]. Remote Peer - The MPA protocol implementation on the opposite end of the connection. Used to refer to the remote entity when describing protocol exchanges or other interactions between two Nodes. Responder - The connection endpoint that responds to an incoming MPA connection request (the MAP Request Frame). This may not be the endpoint that awaited the TCP SYN. Startup Phase - The initial exchanges of an MPA connection that serves to more fully identify MPA endpoints to each other and pass connection specific setup information to each other. ULP - Upper Layer Protocol. The protocol layer above the protocol layer currently being referenced. The ULP for MPA is DDP [DDP]. ULPDU - Upper Layer Protocol Data Unit. The data record defined by the layer above MPA (DDP). ULPDU corresponds to DDP's DDP segment. ULPDU_Length - A field in the FPDU describing the length of the included ULPDU.
3. MPA's Interactions with DDP
DDP requires MPA to maintain DDP record boundaries from the sender to the receiver. When using MPA on TCP to send data, DDP provides records (ULPDUs) to MPA. MPA will use the reliable transmission abilities of TCP to transmit the data, and will insert appropriate additional information into the TCP stream to allow the MPA receiver to locate the record boundary information. As such, MPA accepts complete records (ULPDUs) from DDP at the sender and returns them to DDP at the receiver. MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU contained in one FPDU. MPA over a standard TCP stack can usually provide FPDU Alignment with the TCP Header if the FPDU is equal to TCP's EMSS. An optimized MPA/TCP stack can also maintain alignment as long as the FPDU is less than or equal to TCP's EMSS. Since FPDU Alignment is generally desired by the receiver, DDP cooperates with MPA to ensure FPDUs' lengths do not exceed the EMSS under normal conditions. This is done with the MULPDU mechanism. MPA MUST provide information to DDP on the current maximum size of the record that is acceptable to send (MULPDU). DDP SHOULD limit each record size to MULPDU. The range of MULPDU values MUST be between 128 octets and 64768 octets, inclusive. The sending DDP MUST NOT post a ULPDU larger than 64768 octets to MPA. DDP MAY post a ULPDU of any size between one and 64768 octets; however, MPA is not REQUIRED to support a ULPDU Length that is greater than the current MULPDU. While the maximum theoretical length supported by the MPA header ULPDU_Length field is 65535, TCP over IP requires the IP datagram maximum length to be 65535 octets. To enable MPA to support FPDU Alignment, the maximum size of the FPDU must fit within an IP datagram. Thus, the ULPDU limit of 64768 octets was derived by taking the maximum IP datagram length, subtracting from it the maximum total length of the sum of the IPv4 header, TCP header, IPv4 options, TCP options, and the worst-case MPA overhead, and then rounding the result down to a 128-octet boundary. Note that MULPDU will be significantly smaller than the theoretical maximum in most implementations for most circumstances, due to link MTUs, use of extra headers such as required for IPsec, etc.
On receive, MPA MUST pass each ULPDU with its length to DDP when it has been validated. If an MPA implementation supports passing out-of-order ULPDUs to DDP, the MPA implementation SHOULD: * Pass each ULPDU with its length to DDP as soon as it has been fully received and validated. * Provide a mechanism to indicate the ordering of ULPDUs as the sender transmitted them. One possible mechanism might be providing the TCP sequence number for each ULPDU. * Provide a mechanism to indicate when a given ULPDU (and prior ULPDUs) are complete (Delivered to DDP). One possible mechanism might be to allow DDP to see the current outgoing TCP ACK sequence number. * Provide an indication to DDP that the TCP has closed or has begun to close the connection (e.g., received a FIN). MPA MUST provide the protocol version negotiated with its peer to DDP. DDP will use this version to set the version in its header and to report the version to [RDMAP].
4. MPA Full Operation Phase
The following sections describe the main semantics of the Full Operation Phase of MPA.4.1. FPDU Format
MPA senders create FPDUs out of ULPDUs. The format of an FPDU shown below MUST be used for all MPA FPDUs. For purposes of clarity, Markers are not shown in Figure 2. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ULPDU_Length | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | ~ ~ ~ ULPDU ~ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | PAD (0-3 octets) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2: FPDU Format ULPDU_Length: 16 bits (unsigned integer). This is the number of octets of the contained ULPDU. It does not include the length of the FPDU header itself, the pad, the CRC, or of any Markers that fall within the ULPDU. The 16-bit ULPDU Length field is large enough to support the largest IP datagrams for IPv4 or IPv6. PAD: The PAD field trails the ULPDU and contains between 0 and 3 octets of data. The pad data MUST be set to zero by the sender and ignored by the receiver (except for CRC checking). The length of the pad is set so as to make the size of the FPDU an integral multiple of four. CRC: 32 bits. When CRCs are enabled, this field contains a CRC32c check value, which is used to verify the entire contents of the FPDU, using CRC32c. See Section 4.4, CRC Calculation. When CRCs are not enabled, this field is still present, may contain any value, and MUST NOT be checked.
The FPDU adds a minimum of 6 octets to the length of the ULPDU. In addition, the total length of the FPDU will include the length of any Markers and from 0 to 3 pad octets added to round-up the ULPDU size.4.2. Marker Format
The format of a Marker MUST be as specified in Figure 3: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RESERVED | FPDUPTR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3: Marker Format RESERVED: The Reserved field MUST be set to zero on transmit and ignored on receive (except for CRC calculation). FPDUPTR: The FPDU Pointer is a relative pointer, 16 bits long, interpreted as an unsigned integer that indicates the number of octets in the TCP stream from the beginning of the ULPDU Length field to the first octet of the entire Marker. The least significant two bits MUST always be set to zero at the transmitter, and the receivers MUST always treat these as zero for calculations.4.3. MPA Markers
MPA Markers are used to identify the start of FPDUs when packets are received out of order. This is done by locating the Markers at fixed intervals in the data stream (which is correlated to the TCP sequence number) and using the Marker value to locate the preceding FPDU start. All MPA Markers are included in the containing FPDU CRC calculation (when both CRCs and Markers are in use). The MPA receiver's ability to locate out-of-order FPDUs and pass the ULPDUs to DDP is implementation dependent. MPA/DDP allows those receivers that are able to deal with out-of-order FPDUs in this way to require the insertion of Markers in the data stream. When the receiver cannot deal with out-of-order FPDUs in this way, it may disable the insertion of Markers at the sender. All MPA senders MUST be able to generate Markers when their use is declared by the opposing receiver (see Section 7.1, Connection Setup).
When Markers are enabled, MPA senders MUST insert a Marker into the data stream at a 512-octet periodic interval in the TCP Sequence Number Space. The Marker contains a 16-bit unsigned integer referred to as the FPDUPTR (FPDU Pointer). If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16-bit relative back-pointer. FPDUPTR MUST contain the number of octets in the TCP stream from the beginning of the ULPDU Length field to the first octet of the Marker, unless the Marker falls between FPDUs. Thus, the location of the first octet of the previous FPDU header can be determined by subtracting the value of the given Marker from the current octet-stream sequence number (i.e., TCP sequence number) of the first octet of the Marker. Note that this computation MUST take into account that the TCP sequence number could have wrapped between the Marker and the header. An FPDUPTR value of 0x0000 is a special case -- it is used when the Marker falls exactly between FPDUs (between the preceding FPDU CRC field and the next FPDU's ULPDU Length field). In this case, the Marker is considered to be contained in the following FPDU; the Marker MUST be included in the CRC calculation of the FPDU following the Marker (if CRCs are being generated or checked). Thus, an FPDUPTR value of 0x0000 means that immediately following the Marker is an FPDU header (the ULPDU Length field). Since all FPDUs are integral multiples of 4 octets, the bottom two bits of the FPDUPTR as calculated by the sender are zero. MPA reserves these bits so they MUST be treated as zero for computation at the receiver. When Markers are enabled (see Section 7.1, Connection Setup), the MPA Markers MUST be inserted immediately preceding the first FPDU of Full Operation Phase, and at every 512th octet of the TCP octet stream thereafter. As a result, the first Marker has an FPDUPTR value of 0x0000. If the first Marker begins at octet sequence number SeqStart, then Markers are inserted such that the first octet of the Marker is at octet sequence number SeqNum if the remainder of (SeqNum - SeqStart) mod 512 is zero. Note that SeqNum can wrap. For example, if the TCP sequence number were used to calculate the insertion point of the Marker, the starting TCP sequence number is unlikely to be zero, and 512-octet multiples are unlikely to fall on a modulo 512 of zero. If the MPA connection is started at TCP sequence number 11, then the 1st Marker will begin at 11, and subsequent Markers will begin at 523, 1035, etc.
If an FPDU is large enough to contain multiple Markers, they MUST all point to the same point in the TCP stream: the first octet of the ULPDU Length field for the FPDU. If a Marker interval contains multiple FPDUs (the FPDUs are small), the Marker MUST point to the start of the ULPDU Length field for the FPDU containing the Marker unless the Marker falls between FPDUs, in which case the Marker MUST be zero. The following example shows an FPDU containing a Marker. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ULPDU Length (0x0010) | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | + + | ULPDU (octets 0-9) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | (0x0000) | FPDU ptr (0x000C) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ULPDU (octets 10-15) | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | PAD (2 octets:0,0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 4: Example FPDU Format with Marker MPA Receivers MUST preserve ULPDU boundaries when passing data to DDP. MPA Receivers MUST pass the ULPDU data and the ULPDU Length to DDP and not the Markers, headers, and CRC.4.4. CRC Calculation
An MPA implementation MUST implement CRC support and MUST either: (1) always use CRCs; the MPA provider is not REQUIRED to support an administrator's request that CRCs not be used. or (2a) only indicate a preference not to use CRCs on the explicit request of the system administrator, via an interface not defined in this spec. The default configuration for a connection MUST be to use CRCs.
(2b) disable CRC checking (and possibly generation) if both the local and remote endpoints indicate preference not to use CRCs. An administrative decision to have a host request CRC suppression SHOULD NOT be made unless there is assurance that the TCP connection involved provides protection from undetected errors that is at least as strong as an end-to-end CRC32c. End-to-end usage of an IPsec cryptographic integrity check is among the ways to provide such protection, and the use of channel bindings [NFSv4CHANNEL] by the ULP can provide a high level of assurance that the IPsec protection scope is end-to-end with respect to the ULP. The process MUST be invisible to the ULP. After receipt of an MPA startup declaration indicating that its peer requires CRCs, an MPA instance MUST continue generating and checking CRCs until the connection terminates. If an MPA instance has declared that it does not require CRCs, it MUST turn off CRC checking immediately after receipt of an MPA mode declaration indicating that its peer also does not require CRCs. It MAY continue generating CRCs. See Section 7.1, Connection Setup, for details on the MPA startup. When sending an FPDU, the sender MUST include a CRC field. When CRCs are enabled, the CRC field in the MPA FPDU MUST be computed using the CRC32c polynomial in the manner described in the iSCSI Protocol [iSCSI] document for Header and Data Digests. The fields which MUST be included in the CRC calculation when sending an FPDU are as follows: 1) If a Marker does not immediately precede the ULPDU Length field, the CRC-32c is calculated from the first octet of the ULPDU Length field, through all the ULPDU and Markers (if present), to the last octet of the PAD (if present), inclusive. If there is a Marker immediately following the PAD, the Marker is included in the CRC calculation for this FPDU. 2) If a Marker immediately precedes the first octet of the ULPDU Length field of the FPDU, (i.e., the Marker fell between FPDUs, and thus is required to be included in the second FPDU), the CRC-32c is calculated from the first octet of the Marker, through the ULPDU Length header, through all the ULPDU and Markers (if present), to the last octet of the PAD (if present), inclusive. 3) After calculating the CRC-32c, the resultant value is placed into the CRC field at the end of the FPDU.
When an FPDU is received, and CRC checking is enabled, the receiver MUST first perform the following: 1) Calculate the CRC of the incoming FPDU in the same fashion as defined above. 2) Verify that the calculated CRC-32c value is the same as the received CRC-32c value found in the FPDU CRC field. If not, the receiver MUST treat the FPDU as an invalid FPDU. The procedure for handling invalid FPDUs is covered in Section 8, Error Semantics. The following is an annotated hex dump of an example FPDU sent as the first FPDU on the stream. As such, it starts with a Marker. The FPDU contains a 42 octet ULPDU (an example DDP segment) which in turn contains 24 octets of the contained ULPDU, which is a data load that is all zeros. The CRC32c has been correctly calculated and can be used as a reference. See the [DDP] and [RDMAP] specification for definitions of the DDP Control field, Queue, MSN, MO, and Send Data.
Octet Contents Annotation Count 0000 00 Marker: Reserved 0001 00 0002 00 Marker: FPDUPTR 0003 00 0004 00 ULPDU Length 0005 2a 0006 41 DDP Control Field, Send with Last flag set 0007 43 0008 00 Reserved (DDP STag position with no STag) 0009 00 000a 00 000b 00 000c 00 DDP Queue = 0 000d 00 000e 00 000f 00 0010 00 DDP MSN = 1 0011 00 0012 00 0013 01 0014 00 DDP MO = 0 0015 00 0016 00 0017 00 0018 00 DDP Send Data (24 octets of zeros) ... 002f 00 0030 52 CRC32c 0031 23 0032 99 0033 83 Figure 5: Annotated Hex Dump of an FPDU
The following is an example sent as the second FPDU of the stream where the first FPDU (which is not shown here) had a length of 492 octets and was also a Send to Queue 0 with Last Flag set. This example contains a Marker. Octet Contents Annotation Count 01ec 00 Length 01ed 2a 01ee 41 DDP Control Field: Send with Last Flag set 01ef 43 01f0 00 Reserved (DDP STag position with no STag) 01f1 00 01f2 00 01f3 00 01f4 00 DDP Queue = 0 01f5 00 01f6 00 01f7 00 01f8 00 DDP MSN = 2 01f9 00 01fa 00 01fb 02 01fc 00 DDP MO = 0 01fd 00 01fe 00 01ff 00 0200 00 Marker: Reserved 0201 00 0202 00 Marker: FPDUPTR 0203 14 0204 00 DDP Send Data (24 octets of zeros) ... 021b 00 021c 84 CRC32c 021d 92 021e 58 021f 98 Figure 6: Annotated Hex Dump of an FPDU with Marker
4.5. FPDU Size Considerations
MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as the size of the largest ULPDU fitting in an FPDU. For an empty TCP Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus space for Markers and pad octets. The maximum ULPDU Length for a single ULPDU when Markers are present MUST be computed as: MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4) The formula above accounts for the worst-case number of Markers. The maximum ULPDU Length for a single ULPDU when Markers are NOT present MUST be computed as: MULPDU = EMSS - (6 + EMSS mod 4) As a further optimization of the wire efficiency an MPA implementation MAY dynamically adjust the MULPDU (see Section 5 for latency and wire efficiency trade-offs). When one or more FPDUs are already packed into a TCP Segment, MULPDU MAY be reduced accordingly. DDP SHOULD provide ULPDUs that are as large as possible, but less than or equal to MULPDU. If the TCP implementation needs to adjust EMSS to support MTU changes or changing TCP options, the MULPDU value is changed accordingly. In certain rare situations, the EMSS may shrink below 128 octets in size. If this occurs, the MPA on TCP sender MUST NOT shrink the MULPDU below 128 octets and is not required to follow the segmentation rules in Section 5.1 and Appendix A. If one or more FPDUs are already packed into a TCP segment, such that the remaining room is less than 128 octets, MPA MUST NOT provide a MULPDU smaller than 128. In this case, MPA would typically provide a MULPDU for the next full sized segment, but may still pack the next FPDU into the small remaining room, provide that the next FPDU is small enough to fit. The value 128 is chosen as to allow DDP designers room for the DDP Header and some user data.