5. MPA's interactions with TCP
The following sections describe MPA's interactions with TCP. This section discusses using a standard layered TCP stack with MPA attached above a TCP socket. Discussion of using an optimized MPA- aware TCP with an MPA implementation that takes advantage of the extra optimizations is done in Appendix A. +-----------------------------------+ | +-----+ +-----------------+ | | | MPA | | Other Protocols | | | +-----+ +-----------------+ | | || || | | ----- socket API -------------- | | || | | +-----+ | | | TCP | | | +-----+ | | || | | +-----+ | | | IP | | | +-----+ | +-----------------------------------+ Figure 7: Fully Layered Implementation The Fully layered implementation is described for completeness; however, the user is cautioned that the reduced probability of FPDU alignment when transmitting with this implementation will tend to introduce a higher overhead at optimized receivers. In addition, the lack of out-of-order receive processing will significantly reduce the value of DDP/MPA by imposing higher buffering and copying overhead in the local receiver.5.1. MPA transmitters with a standard layered TCP
MPA transmitters SHOULD calculate a MULPDU as described in Section 4.5. If the TCP implementation allows EMSS to be determined by MPA, that value should be used. If the transmit side TCP implementation is not able to report the EMSS, MPA SHOULD use the current MTU value to establish a likely FPDU size, taking into account the various expected header sizes. MPA transmitters SHOULD also use whatever facilities the TCP stack presents to cause the TCP transmitter to start TCP segments at FPDU boundaries. Multiple FPDUs MAY be packed into a single TCP segment as determined by the EMSS calculation as long as they are entirely contained in the TCP segment.
For example, passing FPDU buffers sized to the current EMSS to the TCP socket and using the TCP_NODELAY socket option to disable the Nagle [RFC896] algorithm will usually result in many of the segments starting with an FPDU. It is recognized that various effects can cause an FPDU Alignment to be lost. Following are a few of the effects: * ULPDUs that are smaller than the MULPDU. If these are sent in a continuous stream, FPDU Alignment will be lost. Note that careful use of a dynamic MULPDU can help in this case; the MULPDU for future FPDUs can be adjusted to re-establish alignment with the segments based on the current EMSS. * Sending enough data that the TCP receive window limit is reached. TCP may send a smaller segment to exactly fill the receive window. * Sending data when TCP is operating up against the congestion window. If TCP is not tracking the congestion window in segments, it may transmit a smaller segment to exactly fill the receive window. * Changes in EMSS due to varying TCP options, or changes in MTU. If FPDU Alignment with TCP segments is lost for any reason, the alignment is regained after a break in transmission where the TCP send buffers are emptied. Many usage models for DDP/MPA will include such breaks. MPA receivers are REQUIRED to be able to operate correctly even if alignment is lost (see Section 6).5.2. MPA receivers with a standard layered TCP
MPA receivers will get TCP data in the usual ordered stream. The receivers MUST identify FPDU boundaries by using the ULPDU_LENGTH field, as described in Section 6. Receivers MAY utilize markers to check for FPDU boundary consistency, but they are NOT required to examine the markers to determine the FPDU boundaries.
6. MPA Receiver FPDU Identification
An MPA receiver MUST first verify the FPDU before passing the ULPDU to DDP. To do this, the receiver MUST: * locate the start of the FPDU unambiguously, * verify its CRC (if CRC checking is enabled). If the above conditions are true, the MPA receiver passes the ULPDU to DDP. To detect the start of the FPDU unambiguously one of the following MUST be used: 1: In an ordered TCP stream, the ULPDU Length field in the current FPDU when FPDU has a valid CRC, can be used to identify the beginning of the next FPDU. 2: For optimized MPA/TCP receivers that support out-of-order reception of FPDUs (see Section 4.3, MPA Markers) a Marker can always be used to locate the beginning of an FPDU (in FPDUs with valid CRCs). Since the location of the Marker is known in the octet stream (sequence number space), the Marker can always be found. 3: Having found an FPDU by means of a Marker, an optimized MPA/TCP receiver can find following contiguous FPDUs by using the ULPDU Length fields (from FPDUs with valid CRCs) to establish the next FPDU boundary. The ULPDU Length field (see Section 4) MUST be used to determine if the entire FPDU is present before forwarding the ULPDU to DDP. CRC calculation is discussed in Section 4.4 above.7. Connection Semantics
7.1. Connection Setup
MPA requires that the Consumer MUST activate MPA, and any TCP enhancements for MPA, on a TCP half connection at the same location in the octet stream at both the sender and the receiver. This is required in order for the Marker scheme to correctly locate the Markers (if enabled) and to correctly locate the first FPDU. MPA, and any TCP enhancements for MPA are enabled by the ULP in both directions at once at an endpoint.
This can be accomplished several ways, and is left up to DDP's ULP: * DDP's ULP MAY require DDP on MPA startup immediately after TCP connection setup. This has the advantage that no streaming mode negotiation is needed. An example of such a protocol is shown in Figure 10: Example Immediate Startup negotiation. This may be accomplished by using a well-known port, or a service locator protocol to locate an appropriate port on which DDP on MPA is expected to operate. * DDP's ULP MAY negotiate the start of DDP on MPA sometime after a normal TCP startup, using TCP streaming data exchanges on the same connection. The exchange establishes that DDP on MPA (as well as other ULPs) will be used, and exactly locates the point in the octet stream where MPA is to begin operation. Note that such a negotiation protocol is outside the scope of this specification. A simplified example of such a protocol is shown in Figure 9: Example Delayed Startup negotiation on page 33. An MPA endpoint operates in two distinct phases. The Startup Phase is used to verify correct MPA setup, exchange CRC and Marker configuration, and optionally pass Private Data between endpoints prior to completing a DDP connection. During this phase, specifically formatted frames are exchanged as TCP byte streams without using CRCs or Markers. During this phase a DDP endpoint need not be "bound" to the MPA connection. In fact, the choice of DDP endpoint and its operating parameters may not be known until the Consumer supplied Private Data (if any) has been examined by the Consumer. The second distinct phase is Full Operation during which FPDUs are sent using all the rules that pertain (CRCs, Markers, MULPDU restrictions, etc.). A DDP endpoint MUST be "bound" to the MPA connection at entry to this phase. When Private Data is passed between ULPs in the Startup Phase, the ULP is responsible for interpreting that data, and then placing MPA into Full Operation. Note: The following text differentiates the two endpoints by calling them Initiator and Responder. This is quite arbitrary and is NOT related to the TCP startup (SYN, SYN/ACK sequence). The Initiator is the side that sends first in the MPA startup sequence (the MPA Request Frame).
Note: The possibility that both endpoints would be allowed to make a connection at the same time, sometimes called an active/active connection, was considered by the work group and rejected. There were several motivations for this decision. One was that applications needing this facility were few (none other than theoretical at the time of this document). Another was that the facility created some implementation difficulties, particularly with the "dual stack" designs described later on. A last issue was that dealing with rejected connections at startup would have required at least an additional frame type, and more recovery actions, complicating the protocol. While none of these issues was overwhelming, the group and implementers were not motivated to do the work to resolve these issues. The protocol includes a method of detecting these active/active startup attempts so that they can be rejected and an error reported. The ULP is responsible for determining which side is Initiator or Responder. For client/server type ULPs, this is easy. For peer-peer ULPs (which might utilize a TCP style active/active startup), some mechanism (not defined by this specification) must be established, or some streaming mode data exchanged prior to MPA startup to determine which side starts in Initiator and which starts in Responder MPA mode.7.1.1 MPA Request and Reply Frame Format
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | | + Key (16 bytes containing "MPA ID Req Frame") + 4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16 bytes containing "MPA ID Rep Frame") + 8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) | + + 12 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16 |M|C|R| Res | Rev | PD_Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ ~ ~ Private Data ~ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 8: MPA Request/Reply Frame
Key: This field contains the "key" used to validate that the sender is an MPA sender. Initiator mode senders MUST set this field to the fixed value "MPA ID Req Frame" or (in byte order) 4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder mode receivers MUST check this field for the same value, and close the connection and report an error locally if any other value is detected. Responder mode senders MUST set this field to the fixed value "MPA ID Rep Frame" or (in byte order) 4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator mode receivers MUST check this field for the same value, and close the connection and report an error locally if any other value is detected. M: This bit declares an endpoint's REQUIRED Marker usage. When this bit is '1' in an MPA Request Frame, the Initiator declares that Markers are REQUIRED in FPDUs sent from the Responder. When set to '1' in an MPA Reply Frame, this bit declares that Markers are REQUIRED in FPDUs sent from the Initiator. When in a received MPA Request Frame or MPA Reply Frame and the value is '0', Markers MUST NOT be added to the data stream by that endpoint. When '1' Markers MUST be added as described in Section 4.3, MPA Markers. C: This bit declares an endpoint's preferred CRC usage. When this field is '0' in the MPA Request Frame and the MPA Reply Frame, CRCs MUST not be checked and need not be generated by either endpoint. When this bit is '1' in either the MPA Request Frame or MPA Reply Frame, CRCs MUST be generated and checked by both endpoints. Note that even when not in use, the CRC field remains present in the FPDU. When CRCs are not in use, the CRC field MUST be considered valid for FPDU checking regardless of its contents. R: This bit is set to zero, and not checked on reception in the MPA Request Frame. In the MPA Reply Frame, this bit is the Rejected Connection bit, set by the Responders ULP to indicate acceptance '0', or rejection '1', of the connection parameters provided in the Private Data. Res: This field is reserved for future use. It MUST be set to zero when sending, and not checked on reception.
Rev: This field contains the revision of MPA. For this version of the specification, senders MUST set this field to one. MPA receivers compliant with this version of the specification MUST check this field. If the MPA receiver cannot interoperate with the received version, then it MUST close the connection and report an error locally. Otherwise, the MPA receiver should report the received version to the ULP. PD_Length: This field MUST contain the length in octets of the Private Data field. A value of zero indicates that there is no Private Data field present at all. If the receiver detects that the PD_Length field does not match the length of the Private Data field, or if the length of the Private Data field exceeds 512 octets, the receiver MUST close the connection and report an error locally. Otherwise, the MPA receiver should pass the PD_Length value and Private Data to the ULP. Private Data: This field may contain any value defined by ULPs or may not be present. The Private Data field MUST be between 0 and 512 octets in length. ULPs define how to size, set, and validate this field within these limits. Private Data usage is further discussed in Section 7.1.4.7.1.2. Connection Startup Rules
The following rules apply to MPA connection Startup Phase: 1. When MPA is started in the Initiator mode, the MPA implementation MUST send a valid MPA Request Frame. The MPA Request Frame MAY include ULP-supplied Private Data. 2. When MPA is started in the Responder mode, the MPA implementation MUST wait until an MPA Request Frame is received and validated before entering Full MPA/DDP Operation. If the MPA Request Frame is improperly formatted, the implementation MUST close the TCP connection and exit MPA. If the MPA Request Frame is properly formatted but the Private Data is not acceptable, the implementation SHOULD return an MPA Reply Frame with the Rejected Connection bit set to '1'; the MPA Reply Frame MAY include ULP-supplied Private Data; the implementation MUST exit MPA, leaving the TCP connection open. The ULP may close TCP or use the connection for other purposes. If the MPA Request Frame is properly formatted and the Private Data is acceptable, the implementation SHOULD return an MPA Reply Frame with the Rejected Connection bit set to '0'; the MPA Reply
Frame MAY include ULP-supplied Private Data; and the Responder SHOULD prepare to interpret any data received as FPDUs and pass any received ULPDUs to DDP. Note: Since the receiver's ability to deal with Markers is unknown until the Request and Reply Frames have been received, sending FPDUs before this occurs is not possible. Note: The requirement to wait on a Request Frame before sending a Reply Frame is a design choice. It makes for a well-ordered sequence of events at each end, and avoids having to specify how to deal with situations where both ends start at the same time. 3. MPA Initiator mode implementations MUST receive and validate an MPA Reply Frame. If the MPA Reply Frame is improperly formatted, the implementation MUST close the TCP connection and exit MPA. If the MPA Reply Frame is properly formatted but is the Private Data is not acceptable, or if the Rejected Connection bit is set to '1', the implementation MUST exit MPA, leaving the TCP connection open. The ULP may close TCP or use the connection for other purposes. If the MPA Reply Frame is properly formatted and the Private Data is acceptable, and the Reject Connection bit is set to '0', the implementation SHOULD enter Full MPA/DDP Operation Phase; interpreting any received data as FPDUs and sending DDP ULPDUs as FPDUs. 4. MPA Responder mode implementations MUST receive and validate at least one FPDU before sending any FPDUs or Markers. Note: This requirement is present to allow the Initiator time to get its receiver into Full Operation before an FPDU arrives, avoiding potential race conditions at the Initiator. This was also subject to some debate in the work group before rough consensus was reached. Eliminating this requirement would allow faster startup in some types of applications. However, that would also make certain implementations (particularly "dual stack") much harder. 5. If a received "Key" does not match the expected value (see Section 7.1.1, MPA Request and Reply Frame Format) the TCP/DDP connection MUST be closed, and an error returned to the ULP.
6. The received Private Data fields may be used by Consumers at either end to further validate the connection and set up DDP or other ULP parameters. The Initiator ULP MAY close the TCP/MPA/DDP connection as a result of validating the Private Data fields. The Responder SHOULD return an MPA Reply Frame with the "Reject Connection" bit set to '1' if the validation of the Private Data is not acceptable to the ULP. 7. When the first FPDU is to be sent, then if Markers are enabled, the first octets sent are the special Marker 0x00000000, followed by the start of the FPDU (the FPDU's ULPDU Length field). If Markers are not enabled, the first octets sent are the start of the FPDU (the FPDU's ULPDU Length field). 8. MPA implementations MUST use the difference between the MPA Request Frame and the MPA Reply Frame to check for incorrect "Initiator/Initiator" startups. Implementations SHOULD put a timeout on waiting for the MPA Request Frame when started in Responder mode, to detect incorrect "Responder/Responder" startups. 9. MPA implementations MUST validate the PD_Length field. The buffer that receives the Private Data field MUST be large enough to receive that data; the amount of Private Data MUST not exceed the PD_Length or the application buffer. If any of the above fails, the startup frame MUST be considered improperly formatted. 10. MPA implementations SHOULD implement a reasonable timeout while waiting for the entire set of startup frames; this prevents certain denial-of-service attacks. ULPs SHOULD implement a reasonable timeout while waiting for FPDUs, ULPDUs, and application level messages to guard against application failures and certain denial-of-service attacks.7.1.3. Example Delayed Startup Sequence
A variety of startup sequences are possible when using MPA on TCP. Following is an example of an MPA/DDP startup that occurs after TCP has been running for a while and has exchanged some amount of streaming data. This example does not use any Private Data (an example that does is shown later in Section 7.1.4.2, Example Immediate Startup Using Private Data), although it is perfectly legal to include the Private Data. Note that since the example does not use any Private Data, there are no ULP interactions shown between receiving "startup frames" and putting MPA into Full Operation.
Initiator Responder +---------------------------+ |ULP streaming mode | | <Hello> request to | | transition to DDP/MPA | +---------------------------+ | mode (optional). | --------> |ULP gets request; | +---------------------------+ | enables MPA Responder | | mode with last (optional)| | streaming mode | | <Hello Ack> for MPA to | | send. | +---------------------------+ |MPA waits for incoming | |ULP receives streaming | <-------- | <MPA Request Frame>. | | <Hello Ack>; | +---------------------------+ |Enters MPA Initiator mode; | |MPA sends | | <MPA Request Frame>; | |MPA waits for incoming | +---------------------------+ | <MPA Reply Frame>. | - - - - > |MPA receives | +---------------------------+ | <MPA Request Frame>. | |Consumer binds DDP to MPA; | |MPA sends the | | <MPA Reply Frame>. | |DDP/MPA enables FPDU | +---------------------------+ | decoding, but does not | |MPA receives the | < - - - - | send any FPDUs. | | <MPA Reply Frame> | +---------------------------+ |Consumer binds DDP to MPA; | |DDP/MPA begins Full | | Operation. | |MPA sends first FPDU (as | +---------------------------+ | DDP ULPDUs become | ========> |MPA receives first FPDU. | | available). | |MPA sends first FPDU (as | +---------------------------+ | DDP ULPDUs become | <====== | available). | +---------------------------+ Figure 9: Example Delayed Startup Negotiation
An example Delayed Startup sequence is described below: * Active and passive sides start up a TCP connection in the usual fashion, probably using sockets APIs. They exchange some amount of streaming mode data. At some point, one side (the MPA Initiator) sends streaming mode data that effectively says "Hello, let's go into MPA/DDP mode". * When the remote side (the MPA Responder) gets this streaming mode message, the Consumer would send a last streaming mode message that effectively says "I acknowledge your Hello, and am now in MPA Responder mode". The exchange of these messages establishes the exact point in the TCP stream where MPA is enabled. The Responding Consumer enables MPA in the Responder mode and waits for the initial MPA startup message. * The Initiating Consumer would enable MPA startup in the Initiator mode which then sends the MPA Request Frame. It is assumed that no Private Data messages are needed for this example, although it is possible to do so. The Initiating MPA (and Consumer) would also wait for the MPA connection to be accepted. * The Responding MPA would receive the initial MPA Request Frame and would inform the Consumer that this message arrived. The Consumer can then accept the MPA/DDP connection or close the TCP connection. * To accept the connection request, the Responding Consumer would use an appropriate API to bind the TCP/MPA connections to a DDP endpoint, thus enabling MPA/DDP into Full Operation. In the process of going to Full Operation, MPA sends the MPA Reply Frame. MPA/DDP waits for the first incoming FPDU before sending any FPDUs. * If the initial TCP data was not a properly formatted MPA Request Frame, MPA will close or reset the TCP connection immediately. * The Initiating MPA would receive the MPA Reply Frame and would report this message to the Consumer. The Consumer can then accept the MPA/DDP connection, or close or reset the TCP connection to abort the process. * On determining that the connection is acceptable, the Initiating Consumer would use an appropriate API to bind the TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP into Full Operation. MPA/DDP would begin sending DDP messages as MPA FPDUs.
7.1.4. Use of Private Data
This section is advisory in nature, in that it suggests a method by which a ULP can deal with pre-DDP connection information exchange.7.1.4.1. Motivation
Prior RDMA protocols have been developed that provide Private Data via out-of-band mechanisms. As a result, many applications now expect some form of Private Data to be available for application use prior to setting up the DDP/RDMA connection. Following are some examples of the use of Private Data. An RDMA endpoint (referred to as a Queue Pair, or QP, in InfiniBand and the [VERBS-RDMA]) must be associated with a Protection Domain. No receive operations may be posted to the endpoint before it is associated with a Protection Domain. Indeed under both the InfiniBand and proposed RDMA/DDP verbs [VERBS-RDMA] an endpoint/QP is created within a Protection Domain. There are some applications where the choice of Protection Domain is dependent upon the identity of the remote ULP client. For example, if a user session requires multiple connections, it is highly desirable for all of those connections to use a single Protection Domain. Note: Use of Protection Domains is further discussed in [RDMASEC]. InfiniBand, the DAT APIs [DAT-API], and the IT-API [IT-API] all provide for the active-side ULP to provide Private Data when requesting a connection. This data is passed to the ULP to allow it to determine whether to accept the connection, and if so with which endpoint (and implicitly which Protection Domain). The Private Data can also be used to ensure that both ends of the connection have configured their RDMA endpoints compatibly on such matters as the RDMA Read capacity (see [RDMAP]). Further ULP- specific uses are also presumed, such as establishing the identity of the client. Private Data is also allowed for when accepting the connection, to allow completion of any negotiation on RDMA resources and for other ULP reasons. There are several potential ways to exchange this Private Data. For example, the InfiniBand specification includes a connection management protocol that allows a small amount of Private Data to be exchanged using datagrams before actually starting the RDMA connection.
This document allows for small amounts of Private Data to be exchanged as part of the MPA startup sequence. The actual Private Data fields are carried in the MPA Request Frame and the MPA Reply Frame. If larger amounts of Private Data or more negotiation is necessary, TCP streaming mode messages may be exchanged prior to enabling MPA.
7.1.4.2. Example Immediate Startup Using Private Data
Initiator Responder +---------------------------+ |TCP SYN sent. | +--------------------------+ +---------------------------+ --------> |TCP gets SYN packet; | +---------------------------+ | sends SYN-Ack. | |TCP gets SYN-Ack | <-------- +--------------------------+ | sends Ack. | +---------------------------+ --------> +--------------------------+ +---------------------------+ |Consumer enables MPA | |Consumer enables MPA | |Responder mode, waits for | |Initiator mode with | | <MPA Request frame>. | |Private Data; MPA sends | +--------------------------+ | <MPA Request Frame>; | |MPA waits for incoming | +--------------------------+ | <MPA Reply Frame>. | - - - - > |MPA receives | +---------------------------+ | <MPA Request Frame>. | |Consumer examines Private | |Data, provides MPA with | |return Private Data, | |binds DDP to MPA, and | |enables MPA to send an | | <MPA Reply Frame>. | |DDP/MPA enables FPDU | +---------------------------+ |decoding, but does not | |MPA receives the | < - - - - |send any FPDUs. | | <MPA Reply Frame>. | +--------------------------+ |Consumer examines Private | |Data, binds DDP to MPA, | |and enables DDP/MPA to | |begin Full Operation. | |MPA sends first FPDU (as | +--------------------------+ |DDP ULPDUs become | ========> |MPA receives first FPDU. | |available). | |MPA sends first FPDU (as | +---------------------------+ |DDP ULPDUs become | <====== |available). | +--------------------------+ Figure 10: Example Immediate Startup Negotiation Note: The exact order of when MPA is started in the TCP connection sequence is implementation dependent; the above diagram shows one possible sequence. Also, the Initiator "Ack" to the Responder's "SYN-Ack" may be combined into the same TCP segment containing the MPA Request Frame (as is allowed by TCP RFCs).
The example immediate startup sequence is described below: * The passive side (Responding Consumer) would listen on the TCP destination port, to indicate its readiness to accept a connection. * The active side (Initiating Consumer) would request a connection from a TCP endpoint (that expected to upgrade to MPA/DDP/RDMA and expected the Private Data) to a destination address and port. * The Initiating Consumer would initiate a TCP connection to the destination port. Acceptance/rejection of the connection would proceed as per normal TCP connection establishment. * The passive side (Responding Consumer) would receive the TCP connection request as usual allowing normal TCP gatekeepers, such as INETD and TCPserver, to exercise their normal safeguard/logging functions. On acceptance of the TCP connection, the Responding Consumer would enable MPA in the Responder mode and wait for the initial MPA startup message. * The Initiating Consumer would enable MPA startup in the Initiator mode to send an initial MPA Request Frame with its included Private Data message to send. The Initiating MPA (and Consumer) would also wait for the MPA connection to be accepted, and any returned Private Data. * The Responding MPA would receive the initial MPA Request Frame with the Private Data message and would pass the Private Data through to the Consumer. The Consumer can then accept the MPA/DDP connection, close the TCP connection, or reject the MPA connection with a return message. * To accept the connection request, the Responding Consumer would use an appropriate API to bind the TCP/MPA connections to a DDP endpoint, thus enabling MPA/DDP into Full Operation. In the process of going to Full Operation, MPA sends the MPA Reply Frame, which includes the Consumer-supplied Private Data containing any appropriate Consumer response. MPA/DDP waits for the first incoming FPDU before sending any FPDUs. * If the initial TCP data was not a properly formatted MPA Request Frame, MPA will close or reset the TCP connection immediately.
* To reject the MPA connection request, the Responding Consumer would send an MPA Reply Frame with any ULP-supplied Private Data (with reason for rejection), with the "Rejected Connection" bit set to '1', and may close the TCP connection. * The Initiating MPA would receive the MPA Reply Frame with the Private Data message and would report this message to the Consumer, including the supplied Private Data. If the "Rejected Connection" bit is set to a '1', MPA will close the TCP connection and exit. If the "Rejected Connection" bit is set to a '0', and on determining from the MPA Reply Frame Private Data that the connection is acceptable, the Initiating Consumer would use an appropriate API to bind the TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP into Full Operation. MPA/DDP would begin sending DDP messages as MPA FPDUs.7.1.5. "Dual Stack" Implementations
MPA/DDP implementations are commonly expected to be implemented as part of a "dual stack" architecture. One stack is the traditional TCP stack, usually with a sockets interface API (Application Programming Interface). The second stack is the MPA/DDP stack with its own API, and potentially separate code or hardware to deal with the MPA/DDP data. Of course, implementations may vary, so the following comments are of an advisory nature only. The use of the two stacks offers advantages: TCP connection setup is usually done with the TCP stack. This allows use of the usual naming and addressing mechanisms. It also means that any mechanisms used to "harden" the connection setup against security threats are also used when starting MPA/DDP. Some applications may have been originally designed for TCP, but are "enhanced" to utilize MPA/DDP after a negotiation reveals the capability to do so. The negotiation process takes place in TCP's streaming mode, using the usual TCP APIs. Some new applications, designed for RDMA or DDP, still need to exchange some data prior to starting MPA/DDP. This exchange can be of arbitrary length or complexity, but often consists of only a small amount of Private Data, perhaps only a single message. Using the TCP streaming mode for this exchange allows this to be done using well-understood methods.
The main disadvantage of using two stacks is the conversion of an active TCP connection between them. This process must be done with care to prevent loss of data. To avoid some of the problems when using a "dual stack" architecture, the following additional restrictions may be required by the implementation: 1. Enabling the DDP/MPA stack SHOULD be done only when no incoming stream data is expected. This is typically managed by the ULP protocol. When following the recommended startup sequence, the Responder side enters DDP/MPA mode, sends the last streaming mode data, and then waits for the MPA Request Frame. No additional streaming mode data is expected. The Initiator side ULP receives the last streaming mode data, and then enters DDP/MPA mode. Again, no additional streaming mode data is expected. 2. The DDP/MPA MAY provide the ability to send a "last streaming message" as part of its Responder DDP/MPA enable function. This allows the DDP/MPA stack to more easily manage the conversion to DDP/MPA mode (and avoid problems with a very fast return of the MPA Request Frame from the Initiator side). Note: Regardless of the "stack" architecture used, TCP's rules MUST be followed. For example, if network data is lost, re-segmented, or re-ordered, TCP MUST recover appropriately even when this occurs while switching stacks.7.2. Normal Connection Teardown
Each half connection of MPA terminates when DDP closes the corresponding TCP half connection. A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware that a graceful close of the TCP connection has been received by the TCP (e.g., FIN is received).
8. Error Semantics
The following errors MUST be detected by MPA and the codes SHOULD be provided to DDP or other Consumer: Code Error 1 TCP connection closed, terminated, or lost. This includes lost by timeout, too many retries, RST received, or FIN received. 2 Received MPA CRC does not match the calculated value for the FPDU. 3 In the event that the CRC is valid, received MPA Marker (if enabled) and ULPDU Length fields do not agree on the start of an FPDU. If the FPDU start determined from previous ULPDU Length fields does not match with the MPA Marker position, MPA SHOULD deliver an error to DDP. It may not be possible to make this check as a segment arrives, but the check SHOULD be made when a gap creating an out-of-order sequence is closed and any time a Marker points to an already identified FPDU. It is OPTIONAL for a receiver to check each Marker, if multiple Markers are present in an FPDU, or if the segment is received in order. 4 Invalid MPA Request Frame or MPA Response Frame received. In this case, the TCP connection MUST be immediately closed. DDP and other ULPs should treat this similar to code 1, above. When conditions 2 or 3 above are detected, an optimized MPA/TCP implementation MAY choose to silently drop the TCP segment rather than reporting the error to DDP. In this case, the sending TCP will retry the segment, usually correcting the error, unless the problem was at the source. In that case, the source will usually exceed the number of retries and terminate the connection. Once MPA delivers an error of any type, it MUST NOT pass or deliver any additional FPDUs on that half connection. For Error codes 2 and 3, MPA MUST NOT close the TCP connection following a reported error. Closing the connection is the responsibility of DDP's ULP. Note that since MPA will not Deliver any FPDUs on a half connection following an error detected on the receive side of that connection, DDP's ULP is expected to tear down the connection. This may not occur until after one or more last messages are transmitted on the opposite half connection. This allows a diagnostic error message to be sent.
9. Security Considerations
This section discusses the security considerations for MPA.9.1. Protocol-Specific Security Considerations
The vulnerabilities of MPA to third-party attacks are no greater than any other protocol running over TCP. A third party, by sending packets into the network that are delivered to an MPA receiver, could launch a variety of attacks that take advantage of how MPA operates. For example, a third party could send random packets that are valid for TCP, but contain no FPDU headers. An MPA receiver reports an error to DDP when any packet arrives that cannot be validated as an FPDU when properly located on an FPDU boundary. A third party could also send packets that are valid for TCP, MPA, and DDP, but do not target valid buffers. These types of attacks ultimately result in loss of connection and thus become a type of DOS (Denial Of Service) attack. Communication security mechanisms such as IPsec [RFC2401, RFC4301] may be used to prevent such attacks. Independent of how MPA operates, a third party could use ICMP messages to reduce the path MTU to such a small size that performance would likewise be severely impacted. Range checking on path MTU sizes in ICMP packets may be used to prevent such attacks. [RDMAP] and [DDP] are used to control, read, and write data buffers over IP networks. Therefore, the control and the data packets of these protocols are vulnerable to the spoofing, tampering, and information disclosure attacks listed below. In addition, connection to/from an unauthorized or unauthenticated endpoint is a potential problem with most applications using RDMA, DDP, and MPA.9.1.1. Spoofing
Spoofing attacks can be launched by the Remote Peer or by a network based attacker. A network-based spoofing attack applies to all Remote Peers. Because the MPA Stream requires a TCP Stream in the ESTABLISHED state, certain types of traditional forms of wire attacks do not apply -- an end-to-end handshake must have occurred to establish the MPA Stream. So, the only form of spoofing that applies is one when a remote node can both send and receive packets. Yet even with this limitation the Stream is still exposed to the following spoofing attacks.
9.1.1.1. Impersonation
A network-based attacker can impersonate a legal MPA/DDP/RDMAP peer (by spoofing a legal IP address) and establish an MPA/DDP/RDMAP Stream with the victim. End-to-end authentication (i.e., IPsec or ULP authentication) provides protection against this attack.9.1.1.2. Stream Hijacking
Stream hijacking happens when a network-based attacker follows the Stream establishment phase, and waits until the authentication phase (if such a phase exists) is completed successfully. He can then spoof the IP address and redirect the Stream from the victim to its own machine. For example, an attacker can wait until an iSCSI authentication is completed successfully, and hijack the iSCSI Stream. The best protection against this form of attack is end-to-end integrity protection and authentication, such as IPsec, to prevent spoofing. Another option is to provide physical security. Discussion of physical security is out of scope for this document.9.1.1.3. Man-in-the-Middle Attack
If a network-based attacker has the ability to delete, inject, replay, or modify packets that will still be accepted by MPA (e.g., TCP sequence number is correct, FPDU is valid, etc.), then the Stream can be exposed to a man-in-the-middle attack. The attacker could potentially use the services of [DDP] and [RDMAP] to read the contents of the associated Data Buffer, to modify the contents of the associated Data Buffer, or to disable further access to the buffer. Other attacks on the connection setup sequence and even on TCP can be used to cause denial of service. The only countermeasure for this form of attack is to either secure the MPA/DDP/RDMAP Stream (i.e., integrity protect) or attempt to provide physical security to prevent man-in-the-middle type attacks. The best protection against this form of attack is end-to-end integrity protection and authentication, such as IPsec, to prevent spoofing or tampering. If Stream or session level authentication and integrity protection are not used, then a man-in-the-middle attack can occur, enabling spoofing and tampering. Another approach is to restrict access to only the local subnet/link and provide some mechanism to limit access, such as physical security or 802.1.x. This model is an extremely limited deployment scenario and will not be further examined here.
9.1.2. Eavesdropping
Generally speaking, Stream confidentiality protects against eavesdropping. Stream and/or session authentication and integrity protection are a counter measurement against various spoofing and tampering attacks. The effectiveness of authentication and integrity against a specific attack depend on whether the authentication is machine-level authentication (as the one provided by IPsec) or ULP authentication.9.2. Introduction to Security Options
The following security services can be applied to an MPA/DDP/RDMAP Stream: 1. Session confidentiality - protects against eavesdropping. 2. Per-packet data source authentication - protects against the following spoofing attacks: network-based impersonation, Stream hijacking, and man in the middle. 3. Per-packet integrity - protects against tampering done by network-based modification of FPDUs (indirectly affecting buffer content through DDP services). 4. Packet sequencing - protects against replay attacks, which is a special case of the above tampering attack. If an MPA/DDP/RDMAP Stream may be subject to impersonation attacks, or Stream hijacking attacks, it is recommended that the Stream be authenticated, integrity protected, and protected from replay attacks. It may use confidentiality protection to protect from eavesdropping (in case the MPA/DDP/RDMAP Stream traverses a public network). IPsec is capable of providing the above security services for IP and TCP traffic. ULP protocols may be able to provide part of the above security services. See [NFSv4CHAN] for additional information on a promising approach called "channel binding". From [NFSv4CHAN]: "The concept of channel bindings allows applications to prove that the end-points of two secure channels at different network layers are the same by binding authentication at one channel to the session protection at the other channel. The use of channel
bindings allows applications to delegate session protection to lower layers, which may significantly improve performance for some applications."9.3. Using IPsec with MPA
IPsec can be used to protect against the packet injection attacks outlined above. Because IPsec is designed to secure individual IP packets, MPA can run above IPsec without change. IPsec packets are processed (e.g., integrity checked and decrypted) in the order they are received, and an MPA receiver will process the decrypted FPDUs contained in these packets in the same manner as FPDUs contained in unsecured IP packets. MPA implementations MUST implement IPsec as described in Section 9.4 below. The use of IPsec is up to ULPs and administrators.9.4. Requirements for IPsec Encapsulation of MPA/DDP
The IP Storage working group has spent significant time and effort to define the normative IPsec requirements for IP storage [RFC3723]. Portions of that specification are applicable to a wide variety of protocols, including the RDDP protocol suite. In order not to replicate this effort, an MPA on TCP implementation MUST follow the requirements defined in RFC 3723, Sections 2.3 and 5, including the associated normative references for those sections. Additionally, since IPsec acceleration hardware may only be able to handle a limited number of active Internet Key Exchange Protocol (IKE) Phase 2 security associations (SAs), Phase 2 delete messages MAY be sent for idle SAs, as a means of keeping the number of active Phase 2 SAs to a minimum. The receipt of an IKE Phase 2 delete message MUST NOT be interpreted as a reason for tearing down a DDP/RDMA Stream. Rather, it is preferable to leave the Stream up, and if additional traffic is sent on it, to bring up another IKE Phase 2 SA to protect it. This avoids the potential for continually bringing Streams up and down. The IPsec requirements for RDDP are based on the version of IPsec specified in RFC 2401 [RFC2401] and related RFCs, as profiled by RFC 3723 [RFC3723], despite the existence of a newer version of IPsec specified in RFC 4301 [RFC4301] and related RFCs. One of the important early applications of the RDDP protocols is their use with iSCSI [iSER]; RDDP's IPsec requirements follow those of IPsec in order to facilitate that usage by allowing a common profile of IPsec to be used with iSCSI and the RDDP protocols. In the future, RFC
3723 may be updated to the newer version of IPsec; the IPsec security requirements of any such update should apply uniformly to iSCSI and the RDDP protocols. Note that there are serious security issues if IPsec is not implemented end-to-end. For example, if IPsec is implemented as a tunnel in the middle of the network, any hosts between the peer and the IPsec tunneling device can freely attack the unprotected Stream.10. IANA Considerations
No IANA actions are required by this document. If a well-known port is chosen as the mechanism to identify a DDP on MPA on TCP, the well-known port must be registered with IANA. Because the use of the port is DDP specific, registration of the port with IANA is left to DDP.