RFC 5044

Marker PDU Aligned Framing for TCP Specification

Pages: 74
Proposed Standard
→ Errata
Updated by: 6581 7146

Part 2 of 3 – Pages 22 to 44

RFC5044 - Page 22 prevText

5.  MPA's interactions with TCP

   The following sections describe MPA's interactions with TCP.  This
   section discusses using a standard layered TCP stack with MPA
   attached above a TCP socket.  Discussion of using an optimized MPA-
   aware TCP with an MPA implementation that takes advantage of the
   extra optimizations is done in Appendix A.

                   +-----------------------------------+
                   | +-----+       +-----------------+ |
                   | | MPA |       | Other Protocols | |
                   | +-----+       +-----------------+ |
                   |    ||                  ||         |
                   |  ----- socket API --------------  |
                   |            ||                     |
                   |         +-----+                   |
                   |         | TCP |                   |
                   |         +-----+                   |
                   |            ||                     |
                   |         +-----+                   |
                   |         | IP  |                   |
                   |         +-----+                   |
                   +-----------------------------------+

                   Figure 7: Fully Layered Implementation

   The Fully layered implementation is described for completeness;
   however, the user is cautioned that the reduced probability of FPDU
   alignment when transmitting with this implementation will tend to
   introduce a higher overhead at optimized receivers.  In addition, the
   lack of out-of-order receive processing will significantly reduce the
   value of DDP/MPA by imposing higher buffering and copying overhead in
   the local receiver.

5.1.  MPA transmitters with a standard layered TCP

   MPA transmitters SHOULD calculate a MULPDU as described in Section
   4.5.  If the TCP implementation allows EMSS to be determined by MPA,
   that value should be used.  If the transmit side TCP implementation
   is not able to report the EMSS, MPA SHOULD use the current MTU value
   to establish a likely FPDU size, taking into account the various
   expected header sizes.

   MPA transmitters SHOULD also use whatever facilities the TCP stack
   presents to cause the TCP transmitter to start TCP segments at FPDU
   boundaries.  Multiple FPDUs MAY be packed into a single TCP segment
   as determined by the EMSS calculation as long as they are entirely
   contained in the TCP segment.

RFC5044 - Page 23

   For example, passing FPDU buffers sized to the current EMSS to the
   TCP socket and using the TCP_NODELAY socket option to disable the
   Nagle [RFC896] algorithm will usually result in many of the segments
   starting with an FPDU.

   It is recognized that various effects can cause an FPDU Alignment to
   be lost.  Following are a few of the effects:

   *   ULPDUs that are smaller than the MULPDU.  If these are sent in a
       continuous stream, FPDU Alignment will be lost.  Note that
       careful use of a dynamic MULPDU can help in this case; the MULPDU
       for future FPDUs can be adjusted to re-establish alignment with
       the segments based on the current EMSS.

   *   Sending enough data that the TCP receive window limit is reached.
       TCP may send a smaller segment to exactly fill the receive
       window.

   *   Sending data when TCP is operating up against the congestion
       window.  If TCP is not tracking the congestion window in
       segments, it may transmit a smaller segment to exactly fill the
       receive window.

   *   Changes in EMSS due to varying TCP options, or changes in MTU.

   If FPDU Alignment with TCP segments is lost for any reason, the
   alignment is regained after a break in transmission where the TCP
   send buffers are emptied.  Many usage models for DDP/MPA will include
   such breaks.

   MPA receivers are REQUIRED to be able to operate correctly even if
   alignment is lost (see Section 6).

5.2.  MPA receivers with a standard layered TCP

   MPA receivers will get TCP data in the usual ordered stream.  The
   receivers MUST identify FPDU boundaries by using the ULPDU_LENGTH
   field, as described in Section 6.  Receivers MAY utilize markers to
   check for FPDU boundary consistency, but they are NOT required to
   examine the markers to determine the FPDU boundaries.

RFC5044 - Page 24

6.  MPA Receiver FPDU Identification

   An MPA receiver MUST first verify the FPDU before passing the ULPDU
   to DDP.  To do this, the receiver MUST:

   *   locate the start of the FPDU unambiguously,

   *   verify its CRC (if CRC checking is enabled).

   If the above conditions are true, the MPA receiver passes the ULPDU
   to DDP.

   To detect the start of the FPDU unambiguously one of the following
   MUST be used:

   1:  In an ordered TCP stream, the ULPDU Length field in the current
       FPDU when FPDU has a valid CRC, can be used to identify the
       beginning of the next FPDU.

   2:  For optimized MPA/TCP receivers that support out-of-order
       reception of FPDUs (see Section 4.3, MPA Markers) a Marker can
       always be used to locate the beginning of an FPDU (in FPDUs with
       valid CRCs).  Since the location of the Marker is known in the
       octet stream (sequence number space), the Marker can always be
       found.

   3:  Having found an FPDU by means of a Marker, an optimized MPA/TCP
       receiver can find following contiguous FPDUs by using the ULPDU
       Length fields (from FPDUs with valid CRCs) to establish the next
       FPDU boundary.

   The ULPDU Length field (see Section 4) MUST be used to determine if
   the entire FPDU is present before forwarding the ULPDU to DDP.

   CRC calculation is discussed in Section 4.4 above.

7.  Connection Semantics

7.1.  Connection Setup

   MPA requires that the Consumer MUST activate MPA, and any TCP
   enhancements for MPA, on a TCP half connection at the same location
   in the octet stream at both the sender and the receiver.  This is
   required in order for the Marker scheme to correctly locate the
   Markers (if enabled) and to correctly locate the first FPDU.

   MPA, and any TCP enhancements for MPA are enabled by the ULP in both
   directions at once at an endpoint.

RFC5044 - Page 25

   This can be accomplished several ways, and is left up to DDP's ULP:

   *   DDP's ULP MAY require DDP on MPA startup immediately after TCP
       connection setup.  This has the advantage that no streaming mode
       negotiation is needed.  An example of such a protocol is shown in
       Figure 10: Example Immediate Startup negotiation.

       This may be accomplished by using a well-known port, or a service
       locator protocol to locate an appropriate port on which DDP on
       MPA is expected to operate.

   *   DDP's ULP MAY negotiate the start of DDP on MPA sometime after a
       normal TCP startup, using TCP streaming data exchanges on the
       same connection.  The exchange establishes that DDP on MPA (as
       well as other ULPs) will be used, and exactly locates the point
       in the octet stream where MPA is to begin operation.  Note that
       such a negotiation protocol is outside the scope of this
       specification.  A simplified example of such a protocol is shown
       in Figure 9: Example Delayed Startup negotiation on page 33.

   An MPA endpoint operates in two distinct phases.

   The Startup Phase is used to verify correct MPA setup, exchange CRC
   and Marker configuration, and optionally pass Private Data between
   endpoints prior to completing a DDP connection.  During this phase,
   specifically formatted frames are exchanged as TCP byte streams
   without using CRCs or Markers.  During this phase a DDP endpoint need
   not be "bound" to the MPA connection.  In fact, the choice of DDP
   endpoint and its operating parameters may not be known until the
   Consumer supplied Private Data (if any) has been examined by the
   Consumer.

   The second distinct phase is Full Operation during which FPDUs are
   sent using all the rules that pertain (CRCs, Markers, MULPDU
   restrictions, etc.).  A DDP endpoint MUST be "bound" to the MPA
   connection at entry to this phase.

   When Private Data is passed between ULPs in the Startup Phase, the
   ULP is responsible for interpreting that data, and then placing MPA
   into Full Operation.

   Note: The following text differentiates the two endpoints by calling
       them Initiator and Responder.  This is quite arbitrary and is NOT
       related to the TCP startup (SYN, SYN/ACK sequence).  The
       Initiator is the side that sends first in the MPA startup
       sequence (the MPA Request Frame).

RFC5044 - Page 26

   Note: The possibility that both endpoints would be allowed to make a
       connection at the same time, sometimes called an active/active
       connection, was considered by the work group and rejected.  There
       were several motivations for this decision.  One was that
       applications needing this facility were few (none other than
       theoretical at the time of this document).  Another was that the
       facility created some implementation difficulties, particularly
       with the "dual stack" designs described later on.  A last issue
       was that dealing with rejected connections at startup would have
       required at least an additional frame type, and more recovery
       actions, complicating the protocol.  While none of these issues
       was overwhelming, the group and implementers were not motivated
       to do the work to resolve these issues.  The protocol includes a
       method of detecting these active/active startup attempts so that
       they can be rejected and an error reported.

   The ULP is responsible for determining which side is Initiator or
   Responder.  For client/server type ULPs, this is easy.  For peer-peer
   ULPs (which might utilize a TCP style active/active startup), some
   mechanism (not defined by this specification) must be established, or
   some streaming mode data exchanged prior to MPA startup to determine
   which side starts in Initiator and which starts in Responder MPA
   mode.

7.1.1  MPA Request and Reply Frame Format

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   0  |                                                               |
      +         Key (16 bytes containing "MPA ID Req Frame")          +
   4  |      (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65)        |
      +         Or  (16 bytes containing "MPA ID Rep Frame")          +
   8  |      (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65)        |
      +                                                               +
   12 |                                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   16 |M|C|R| Res     |     Rev       |          PD_Length            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      ~                                                               ~
      ~                   Private Data                                ~
      |                                                               |
      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                     Figure 8: MPA Request/Reply Frame

RFC5044 - Page 27

   Key: This field contains the "key" used to validate that the sender
       is an MPA sender.  Initiator mode senders MUST set this field to
       the fixed value "MPA ID Req Frame" or (in byte order) 4D 50 41 20
       49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal).  Responder
       mode receivers MUST check this field for the same value, and
       close the connection and report an error locally if any other
       value is detected.  Responder mode senders MUST set this field to
       the fixed value "MPA ID Rep Frame" or (in byte order) 4D 50 41 20
       49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal).  Initiator
       mode receivers MUST check this field for the same value, and
       close the connection and report an error locally if any other
       value is detected.

   M: This bit declares an endpoint's REQUIRED Marker usage.  When this
       bit is '1' in an MPA Request Frame, the Initiator declares that
       Markers are REQUIRED in FPDUs sent from the Responder.  When set
       to '1' in an MPA Reply Frame, this bit declares that Markers are
       REQUIRED in FPDUs sent from the Initiator.  When in a received
       MPA Request Frame or MPA Reply Frame and the value is '0',
       Markers MUST NOT be added to the data stream by that endpoint.
       When '1' Markers MUST be added as described in Section 4.3, MPA
       Markers.

   C: This bit declares an endpoint's preferred CRC usage.  When this
       field is '0' in the MPA Request Frame and the MPA Reply Frame,
       CRCs MUST not be checked and need not be generated by either
       endpoint.  When this bit is '1' in either the MPA Request Frame
       or MPA Reply Frame, CRCs MUST be generated and checked by both
       endpoints.  Note that even when not in use, the CRC field remains
       present in the FPDU.  When CRCs are not in use, the CRC field
       MUST be considered valid for FPDU checking regardless of its
       contents.

   R: This bit is set to zero, and not checked on reception in the MPA
       Request Frame.  In the MPA Reply Frame, this bit is the Rejected
       Connection bit, set by the Responders ULP to indicate acceptance
       '0', or rejection '1', of the connection parameters provided in
       the Private Data.

   Res: This field is reserved for future use.  It MUST be set to zero
       when sending, and not checked on reception.

RFC5044 - Page 28

   Rev: This field contains the revision of MPA.  For this version of
       the specification, senders MUST set this field to one.  MPA
       receivers compliant with this version of the specification MUST
       check this field.  If the MPA receiver cannot interoperate with
       the received version, then it MUST close the connection and
       report an error locally.  Otherwise, the MPA receiver should
       report the received version to the ULP.

   PD_Length: This field MUST contain the length in octets of the
       Private Data field.  A value of zero indicates that there is no
       Private Data field present at all.  If the receiver detects that
       the PD_Length field does not match the length of the Private Data
       field, or if the length of the Private Data field exceeds 512
       octets, the receiver MUST close the connection and report an
       error locally.  Otherwise, the MPA receiver should pass the
       PD_Length value and Private Data to the ULP.

   Private Data: This field may contain any value defined by ULPs or may
       not be present.  The Private Data field MUST be between 0 and 512
       octets in length.  ULPs define how to size, set, and validate
       this field within these limits.  Private Data usage is further
       discussed in Section 7.1.4.

7.1.2.  Connection Startup Rules

   The following rules apply to MPA connection Startup Phase:

   1.  When MPA is started in the Initiator mode, the MPA implementation
       MUST send a valid MPA Request Frame.  The MPA Request Frame MAY
       include ULP-supplied Private Data.

   2.  When MPA is started in the Responder mode, the MPA implementation
       MUST wait until an MPA Request Frame is received and validated
       before entering Full MPA/DDP Operation.

       If the MPA Request Frame is improperly formatted, the
       implementation MUST close the TCP connection and exit MPA.

       If the MPA Request Frame is properly formatted but the Private
       Data is not acceptable, the implementation SHOULD return an MPA
       Reply Frame with the Rejected Connection bit set to '1'; the MPA
       Reply Frame MAY include ULP-supplied Private Data; the
       implementation MUST exit MPA, leaving the TCP connection open.
       The ULP may close TCP or use the connection for other purposes.

       If the MPA Request Frame is properly formatted and the Private
       Data is acceptable, the implementation SHOULD return an MPA Reply
       Frame with the Rejected Connection bit set to '0'; the MPA Reply

RFC5044 - Page 29

       Frame MAY include ULP-supplied Private Data; and the Responder
       SHOULD prepare to interpret any data received as FPDUs and pass
       any received ULPDUs to DDP.

       Note: Since the receiver's ability to deal with Markers is
           unknown until the Request and Reply Frames have been
           received, sending FPDUs before this occurs is not possible.


       Note: The requirement to wait on a Request Frame before sending a
           Reply Frame is a design choice.  It makes for a well-ordered
           sequence of events at each end, and avoids having to specify
           how to deal with situations where both ends start at the same
           time.

   3.  MPA Initiator mode implementations MUST receive and validate an
       MPA Reply Frame.

       If the MPA Reply Frame is improperly formatted, the
       implementation MUST close the TCP connection and exit MPA.

       If the MPA Reply Frame is properly formatted but is the Private
       Data is not acceptable, or if the Rejected Connection bit is set
       to '1', the implementation MUST exit MPA, leaving the TCP
       connection open.  The ULP may close TCP or use the connection for
       other purposes.

       If the MPA Reply Frame is properly formatted and the Private Data
       is acceptable, and the Reject Connection bit is set to '0', the
       implementation SHOULD enter Full MPA/DDP Operation Phase;
       interpreting any received data as FPDUs and sending DDP ULPDUs as
       FPDUs.

   4.  MPA Responder mode implementations MUST receive and validate at
       least one FPDU before sending any FPDUs or Markers.

       Note: This requirement is present to allow the Initiator time to
           get its receiver into Full Operation before an FPDU arrives,
           avoiding potential race conditions at the Initiator.  This
           was also subject to some debate in the work group before
           rough consensus was reached.  Eliminating this requirement
           would allow faster startup in some types of applications.
           However, that would also make certain implementations
           (particularly "dual stack") much harder.

   5.  If a received "Key" does not match the expected value (see
       Section 7.1.1, MPA Request and Reply Frame Format) the TCP/DDP
       connection MUST be closed, and an error returned to the ULP.

RFC5044 - Page 30

   6.  The received Private Data fields may be used by Consumers at
       either end to further validate the connection and set up DDP or
       other ULP parameters.  The Initiator ULP MAY close the
       TCP/MPA/DDP connection as a result of validating the Private Data
       fields.  The Responder SHOULD return an MPA Reply Frame with the
       "Reject Connection" bit set to '1' if the validation of the
       Private Data is not acceptable to the ULP.

   7.  When the first FPDU is to be sent, then if Markers are enabled,
       the first octets sent are the special Marker 0x00000000, followed
       by the start of the FPDU (the FPDU's ULPDU Length field).  If
       Markers are not enabled, the first octets sent are the start of
       the FPDU (the FPDU's ULPDU Length field).

   8.  MPA implementations MUST use the difference between the MPA
       Request Frame and the MPA Reply Frame to check for incorrect
       "Initiator/Initiator" startups.  Implementations SHOULD put a
       timeout on waiting for the MPA Request Frame when started in
       Responder mode, to detect incorrect "Responder/Responder"
       startups.

   9.  MPA implementations MUST validate the PD_Length field.  The
       buffer that receives the Private Data field MUST be large enough
       to receive that data; the amount of Private Data MUST not exceed
       the PD_Length or the application buffer.  If any of the above
       fails, the startup frame MUST be considered improperly formatted.

   10. MPA implementations SHOULD implement a reasonable timeout while
       waiting for the entire set of startup frames; this prevents
       certain denial-of-service attacks.  ULPs SHOULD implement a
       reasonable timeout while waiting for FPDUs, ULPDUs, and
       application level messages to guard against application failures
       and certain denial-of-service attacks.

7.1.3.  Example Delayed Startup Sequence

   A variety of startup sequences are possible when using MPA on TCP.
   Following is an example of an MPA/DDP startup that occurs after TCP
   has been running for a while and has exchanged some amount of
   streaming data.  This example does not use any Private Data (an
   example that does is shown later in Section 7.1.4.2, Example
   Immediate Startup Using Private Data), although it is perfectly legal
   to include the Private Data.  Note that since the example does not
   use any Private Data, there are no ULP interactions shown between
   receiving "startup frames" and putting MPA into Full Operation.

RFC5044 - Page 31

         Initiator                                 Responder

  +---------------------------+
  |ULP streaming mode         |
  |  <Hello> request to       |
  |  transition to DDP/MPA    |           +---------------------------+
  |  mode (optional).         | --------> |ULP gets request;          |
  +---------------------------+           |  enables MPA Responder    |
                                          |  mode with last (optional)|
                                          |  streaming mode           |
                                          |  <Hello Ack> for MPA to   |
                                          |  send.                    |
  +---------------------------+           |MPA waits for incoming     |
  |ULP receives streaming     | <-------- |  <MPA Request Frame>.     |
  |  <Hello Ack>;             |           +---------------------------+
  |Enters MPA Initiator mode; |
  |MPA sends                  |
  |  <MPA Request Frame>;     |
  |MPA waits for incoming     |           +---------------------------+
  |  <MPA Reply Frame>.       | - - - - > |MPA receives               |
  +---------------------------+           |  <MPA Request Frame>.     |
                                          |Consumer binds DDP to MPA; |
                                          |MPA sends the              |
                                          |  <MPA Reply Frame>.       |
                                          |DDP/MPA enables FPDU       |
  +---------------------------+           |  decoding, but does not   |
  |MPA receives the           | < - - - - |  send any FPDUs.          |
  |  <MPA Reply Frame>        |           +---------------------------+
  |Consumer binds DDP to MPA; |
  |DDP/MPA begins Full        |
  |  Operation.               |
  |MPA sends first FPDU (as   |           +---------------------------+
  |  DDP ULPDUs become        | ========> |MPA receives first FPDU.   |
  |  available).              |           |MPA sends first FPDU (as   |
  +---------------------------+           |  DDP ULPDUs become        |
                                  <====== |  available).              |
                                          +---------------------------+

              Figure 9: Example Delayed Startup Negotiation

RFC5044 - Page 32

   An example Delayed Startup sequence is described below:

       *   Active and passive sides start up a TCP connection in the
           usual fashion, probably using sockets APIs.  They exchange
           some amount of streaming mode data.  At some point, one side
           (the MPA Initiator) sends streaming mode data that
           effectively says "Hello, let's go into MPA/DDP mode".

   *   When the remote side (the MPA Responder) gets this streaming mode
       message, the Consumer would send a last streaming mode message
       that effectively says "I acknowledge your Hello, and am now in
       MPA Responder mode".  The exchange of these messages establishes
       the exact point in the TCP stream where MPA is enabled.  The
       Responding Consumer enables MPA in the Responder mode and waits
       for the initial MPA startup message.

       *   The Initiating Consumer would enable MPA startup in the
           Initiator mode which then sends the MPA Request Frame.  It is
           assumed that no Private Data messages are needed for this
           example, although it is possible to do so.  The Initiating
           MPA (and Consumer) would also wait for the MPA connection to
           be accepted.

   *   The Responding MPA would receive the initial MPA Request Frame
       and would inform the Consumer that this message arrived.  The
       Consumer can then accept the MPA/DDP connection or close the TCP
       connection.

   *   To accept the connection request, the Responding Consumer would
       use an appropriate API to bind the TCP/MPA connections to a DDP
       endpoint, thus enabling MPA/DDP into Full Operation.  In the
       process of going to Full Operation, MPA sends the MPA Reply
       Frame.  MPA/DDP waits for the first incoming FPDU before sending
       any FPDUs.

   *   If the initial TCP data was not a properly formatted MPA Request
       Frame, MPA will close or reset the TCP connection immediately.

       *   The Initiating MPA would receive the MPA Reply Frame and
           would report this message to the Consumer.  The Consumer can
           then accept the MPA/DDP connection, or close or reset the TCP
           connection to abort the process.

       *   On determining that the connection is acceptable, the
           Initiating Consumer would use an appropriate API to bind the
           TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP
           into Full Operation.  MPA/DDP would begin sending DDP
           messages as MPA FPDUs.

RFC5044 - Page 33

7.1.4.  Use of Private Data

   This section is advisory in nature, in that it suggests a method by
   which a ULP can deal with pre-DDP connection information exchange.

7.1.4.1.  Motivation

   Prior RDMA protocols have been developed that provide Private Data
   via out-of-band mechanisms.  As a result, many applications now
   expect some form of Private Data to be available for application use
   prior to setting up the DDP/RDMA connection.  Following are some
   examples of the use of Private Data.

   An RDMA endpoint (referred to as a Queue Pair, or QP, in InfiniBand
   and the [VERBS-RDMA]) must be associated with a Protection Domain.
   No receive operations may be posted to the endpoint before it is
   associated with a Protection Domain.  Indeed under both the
   InfiniBand and proposed RDMA/DDP verbs [VERBS-RDMA] an endpoint/QP is
   created within a Protection Domain.

   There are some applications where the choice of Protection Domain is
   dependent upon the identity of the remote ULP client.  For example,
   if a user session requires multiple connections, it is highly
   desirable for all of those connections to use a single Protection
   Domain.  Note: Use of Protection Domains is further discussed in
   [RDMASEC].

   InfiniBand, the DAT APIs [DAT-API], and the IT-API [IT-API] all
   provide for the active-side ULP to provide Private Data when
   requesting a connection.  This data is passed to the ULP to allow it
   to determine whether to accept the connection, and if so with which
   endpoint (and implicitly which Protection Domain).

   The Private Data can also be used to ensure that both ends of the
   connection have configured their RDMA endpoints compatibly on such
   matters as the RDMA Read capacity (see [RDMAP]).  Further ULP-
   specific uses are also presumed, such as establishing the identity of
   the client.

   Private Data is also allowed for when accepting the connection, to
   allow completion of any negotiation on RDMA resources and for other
   ULP reasons.

   There are several potential ways to exchange this Private Data.  For
   example, the InfiniBand specification includes a connection
   management protocol that allows a small amount of Private Data to be
   exchanged using datagrams before actually starting the RDMA
   connection.

RFC5044 - Page 34

   This document allows for small amounts of Private Data to be
   exchanged as part of the MPA startup sequence.  The actual Private
   Data fields are carried in the MPA Request Frame and the MPA Reply
   Frame.

   If larger amounts of Private Data or more negotiation is necessary,
   TCP streaming mode messages may be exchanged prior to enabling MPA.

RFC5044 - Page 35

7.1.4.2.  Example Immediate Startup Using Private Data

          Initiator                                 Responder

   +---------------------------+
   |TCP SYN sent.              |           +--------------------------+
   +---------------------------+ --------> |TCP gets SYN packet;      |
   +---------------------------+           |  sends SYN-Ack.          |
   |TCP gets SYN-Ack           | <-------- +--------------------------+
   |  sends Ack.               |
   +---------------------------+ --------> +--------------------------+
   +---------------------------+           |Consumer enables MPA      |
   |Consumer enables MPA       |           |Responder mode, waits for |
   |Initiator mode with        |           |  <MPA Request frame>.    |
   |Private Data; MPA sends    |           +--------------------------+
   |  <MPA Request Frame>;     |
   |MPA waits for incoming     |           +--------------------------+
   |  <MPA Reply Frame>.       | - - - - > |MPA receives              |
   +---------------------------+           |  <MPA Request Frame>.    |
                                           |Consumer examines Private |
                                           |Data, provides MPA with   |
                                           |return Private Data,      |
                                           |binds DDP to MPA, and     |
                                           |enables MPA to send an    |
                                           |  <MPA Reply Frame>.      |
                                           |DDP/MPA enables FPDU      |
   +---------------------------+           |decoding, but does not    |
   |MPA receives the           | < - - - - |send any FPDUs.           |
   |  <MPA Reply Frame>.       |           +--------------------------+
   |Consumer examines Private  |
   |Data, binds DDP to MPA,    |
   |and enables DDP/MPA to     |
   |begin Full Operation.      |
   |MPA sends first FPDU (as   |           +--------------------------+
   |DDP ULPDUs become          | ========> |MPA receives first FPDU.  |
   |available).                |           |MPA sends first FPDU (as  |
   +---------------------------+           |DDP ULPDUs become         |
                                   <====== |available).               |
                                           +--------------------------+

             Figure 10: Example Immediate Startup Negotiation

   Note: The exact order of when MPA is started in the TCP connection
       sequence is implementation dependent; the above diagram shows one
       possible sequence.  Also, the Initiator "Ack" to the Responder's
       "SYN-Ack" may be combined into the same TCP segment containing
       the MPA Request Frame (as is allowed by TCP RFCs).

RFC5044 - Page 36

   The example immediate startup sequence is described below:

   *   The passive side (Responding Consumer) would listen on the TCP
       destination port, to indicate its readiness to accept a
       connection.

       *   The active side (Initiating Consumer) would request a
           connection from a TCP endpoint (that expected to upgrade to
           MPA/DDP/RDMA and expected the Private Data) to a destination
           address and port.

       *   The Initiating Consumer would initiate a TCP connection to
           the destination port.  Acceptance/rejection of the connection
           would proceed as per normal TCP connection establishment.

   *   The passive side (Responding Consumer) would receive the TCP
       connection request as usual allowing normal TCP gatekeepers, such
       as INETD and TCPserver, to exercise their normal
       safeguard/logging functions.  On acceptance of the TCP
       connection, the Responding Consumer would enable MPA in the
       Responder mode and wait for the initial MPA startup message.

       *   The Initiating Consumer would enable MPA startup in the
           Initiator mode to send an initial MPA Request Frame with its
           included Private Data message to send.  The Initiating MPA
           (and Consumer) would also wait for the MPA connection to be
           accepted, and any returned Private Data.

   *   The Responding MPA would receive the initial MPA Request Frame
       with the Private Data message and would pass the Private Data
       through to the Consumer.  The Consumer can then accept the
       MPA/DDP connection, close the TCP connection, or reject the MPA
       connection with a return message.

   *   To accept the connection request, the Responding Consumer would
       use an appropriate API to bind the TCP/MPA connections to a DDP
       endpoint, thus enabling MPA/DDP into Full Operation.  In the
       process of going to Full Operation, MPA sends the MPA Reply
       Frame, which includes the Consumer-supplied Private Data
       containing any appropriate Consumer response.  MPA/DDP waits for
       the first incoming FPDU before sending any FPDUs.

   *   If the initial TCP data was not a properly formatted MPA Request
       Frame, MPA will close or reset the TCP connection immediately.

RFC5044 - Page 37

   *   To reject the MPA connection request, the Responding Consumer
       would send an MPA Reply Frame with any ULP-supplied Private Data
       (with reason for rejection), with the "Rejected Connection" bit
       set to '1', and may close the TCP connection.

       *   The Initiating MPA would receive the MPA Reply Frame with the
           Private Data message and would report this message to the
           Consumer, including the supplied Private Data.

           If the "Rejected Connection" bit is set to a '1', MPA will
           close the TCP connection and exit.

           If the "Rejected Connection" bit is set to a '0', and on
           determining from the MPA Reply Frame Private Data that the
           connection is acceptable, the Initiating Consumer would use
           an appropriate API to bind the TCP/MPA connections to a DDP
           endpoint thus enabling MPA/DDP into Full Operation.  MPA/DDP
           would begin sending DDP messages as MPA FPDUs.

7.1.5.  "Dual Stack" Implementations

   MPA/DDP implementations are commonly expected to be implemented as
   part of a "dual stack" architecture.  One stack is the traditional
   TCP stack, usually with a sockets interface API (Application
   Programming Interface).  The second stack is the MPA/DDP stack with
   its own API, and potentially separate code or hardware to deal with
   the MPA/DDP data.  Of course, implementations may vary, so the
   following comments are of an advisory nature only.

   The use of the two stacks offers advantages:

       TCP connection setup is usually done with the TCP stack.  This
       allows use of the usual naming and addressing mechanisms.  It
       also means that any mechanisms used to "harden" the connection
       setup against security threats are also used when starting
       MPA/DDP.

       Some applications may have been originally designed for TCP, but
       are "enhanced" to utilize MPA/DDP after a negotiation reveals the
       capability to do so.  The negotiation process takes place in
       TCP's streaming mode, using the usual TCP APIs.

       Some new applications, designed for RDMA or DDP, still need to
       exchange some data prior to starting MPA/DDP.  This exchange can
       be of arbitrary length or complexity, but often consists of only
       a small amount of Private Data, perhaps only a single message.
       Using the TCP streaming mode for this exchange allows this to be
       done using well-understood methods.

RFC5044 - Page 38

   The main disadvantage of using two stacks is the conversion of an
   active TCP connection between them.  This process must be done with
   care to prevent loss of data.

   To avoid some of the problems when using a "dual stack" architecture,
   the following additional restrictions may be required by the
   implementation:

   1.  Enabling the DDP/MPA stack SHOULD be done only when no incoming
       stream data is expected.  This is typically managed by the ULP
       protocol.  When following the recommended startup sequence, the
       Responder side enters DDP/MPA mode, sends the last streaming mode
       data, and then waits for the MPA Request Frame.  No additional
       streaming mode data is expected.  The Initiator side ULP receives
       the last streaming mode data, and then enters DDP/MPA mode.
       Again, no additional streaming mode data is expected.

   2.  The DDP/MPA MAY provide the ability to send a "last streaming
       message" as part of its Responder DDP/MPA enable function.  This
       allows the DDP/MPA stack to more easily manage the conversion to
       DDP/MPA mode (and avoid problems with a very fast return of the
       MPA Request Frame from the Initiator side).

   Note: Regardless of the "stack" architecture used, TCP's rules MUST
       be followed.  For example, if network data is lost, re-segmented,
       or re-ordered, TCP MUST recover appropriately even when this
       occurs while switching stacks.

7.2.  Normal Connection Teardown

   Each half connection of MPA terminates when DDP closes the
   corresponding TCP half connection.

   A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware
   that a graceful close of the TCP connection has been received by the
   TCP (e.g., FIN is received).

RFC5044 - Page 39

8.  Error Semantics

   The following errors MUST be detected by MPA and the codes SHOULD be
   provided to DDP or other Consumer:

   Code Error

   1   TCP connection closed, terminated, or lost.  This includes lost
       by timeout, too many retries, RST received, or FIN received.

   2   Received MPA CRC does not match the calculated value for the
       FPDU.

   3   In the event that the CRC is valid, received MPA Marker (if
       enabled) and ULPDU Length fields do not agree on the start of an
       FPDU.  If the FPDU start determined from previous ULPDU Length
       fields does not match with the MPA Marker position, MPA SHOULD
       deliver an error to DDP.  It may not be possible to make this
       check as a segment arrives, but the check SHOULD be made when a
       gap creating an out-of-order sequence is closed and any time a
       Marker points to an already identified FPDU.  It is OPTIONAL for
       a receiver to check each Marker, if multiple Markers are present
       in an FPDU, or if the segment is received in order.

   4   Invalid MPA Request Frame or MPA Response Frame received.  In
       this case, the TCP connection MUST be immediately closed.  DDP
       and other ULPs should treat this similar to code 1, above.

   When conditions 2 or 3 above are detected, an optimized MPA/TCP
   implementation MAY choose to silently drop the TCP segment rather
   than reporting the error to DDP.  In this case, the sending TCP will
   retry the segment, usually correcting the error, unless the problem
   was at the source.  In that case, the source will usually exceed the
   number of retries and terminate the connection.

   Once MPA delivers an error of any type, it MUST NOT pass or deliver
   any additional FPDUs on that half connection.

   For Error codes 2 and 3, MPA MUST NOT close the TCP connection
   following a reported error.  Closing the connection is the
   responsibility of DDP's ULP.

       Note that since MPA will not Deliver any FPDUs on a half
       connection following an error detected on the receive side of
       that connection, DDP's ULP is expected to tear down the
       connection.  This may not occur until after one or more last
       messages are transmitted on the opposite half connection.  This
       allows a diagnostic error message to be sent.

RFC5044 - Page 40

9.  Security Considerations

   This section discusses the security considerations for MPA.

9.1.  Protocol-Specific Security Considerations

   The vulnerabilities of MPA to third-party attacks are no greater than
   any other protocol running over TCP.  A third party, by sending
   packets into the network that are delivered to an MPA receiver, could
   launch a variety of attacks that take advantage of how MPA operates.
   For example, a third party could send random packets that are valid
   for TCP, but contain no FPDU headers.  An MPA receiver reports an
   error to DDP when any packet arrives that cannot be validated as an
   FPDU when properly located on an FPDU boundary.  A third party could
   also send packets that are valid for TCP, MPA, and DDP, but do not
   target valid buffers.  These types of attacks ultimately result in
   loss of connection and thus become a type of DOS (Denial Of Service)
   attack.  Communication security mechanisms such as IPsec [RFC2401,
   RFC4301] may be used to prevent such attacks.

   Independent of how MPA operates, a third party could use ICMP
   messages to reduce the path MTU to such a small size that performance
   would likewise be severely impacted.  Range checking on path MTU
   sizes in ICMP packets may be used to prevent such attacks.

   [RDMAP] and [DDP] are used to control, read, and write data buffers
   over IP networks.  Therefore, the control and the data packets of
   these protocols are vulnerable to the spoofing, tampering, and
   information disclosure attacks listed below.  In addition, connection
   to/from an unauthorized or unauthenticated endpoint is a potential
   problem with most applications using RDMA, DDP, and MPA.

9.1.1.  Spoofing

   Spoofing attacks can be launched by the Remote Peer or by a network
   based attacker.  A network-based spoofing attack applies to all
   Remote Peers.  Because the MPA Stream requires a TCP Stream in the
   ESTABLISHED state, certain types of traditional forms of wire attacks
   do not apply -- an end-to-end handshake must have occurred to
   establish the MPA Stream.  So, the only form of spoofing that applies
   is one when a remote node can both send and receive packets.  Yet
   even with this limitation the Stream is still exposed to the
   following spoofing attacks.

RFC5044 - Page 41

9.1.1.1.  Impersonation

   A network-based attacker can impersonate a legal MPA/DDP/RDMAP peer
   (by spoofing a legal IP address) and establish an MPA/DDP/RDMAP
   Stream with the victim.  End-to-end authentication (i.e., IPsec or
   ULP authentication) provides protection against this attack.

9.1.1.2.  Stream Hijacking

   Stream hijacking happens when a network-based attacker follows the
   Stream establishment phase, and waits until the authentication phase
   (if such a phase exists) is completed successfully.  He can then
   spoof the IP address and redirect the Stream from the victim to its
   own machine.  For example, an attacker can wait until an iSCSI
   authentication is completed successfully, and hijack the iSCSI
   Stream.

   The best protection against this form of attack is end-to-end
   integrity protection and authentication, such as IPsec, to prevent
   spoofing.  Another option is to provide physical security.
   Discussion of physical security is out of scope for this document.

9.1.1.3.  Man-in-the-Middle Attack

   If a network-based attacker has the ability to delete, inject,
   replay, or modify packets that will still be accepted by MPA (e.g.,
   TCP sequence number is correct, FPDU is valid, etc.), then the Stream
   can be exposed to a man-in-the-middle attack.  The attacker could
   potentially use the services of [DDP] and [RDMAP] to read the
   contents of the associated Data Buffer, to modify the contents of the
   associated Data Buffer, or to disable further access to the buffer.
   Other attacks on the connection setup sequence and even on TCP can be
   used to cause denial of service.  The only countermeasure for this
   form of attack is to either secure the MPA/DDP/RDMAP Stream (i.e.,
   integrity protect) or attempt to provide physical security to prevent
   man-in-the-middle type attacks.

   The best protection against this form of attack is end-to-end
   integrity protection and authentication, such as IPsec, to prevent
   spoofing or tampering.  If Stream or session level authentication and
   integrity protection are not used, then a man-in-the-middle attack
   can occur, enabling spoofing and tampering.

   Another approach is to restrict access to only the local subnet/link
   and provide some mechanism to limit access, such as physical security
   or 802.1.x.  This model is an extremely limited deployment scenario
   and will not be further examined here.

RFC5044 - Page 42

9.1.2.  Eavesdropping

   Generally speaking, Stream confidentiality protects against
   eavesdropping.  Stream and/or session authentication and integrity
   protection are a counter measurement against various spoofing and
   tampering attacks.  The effectiveness of authentication and integrity
   against a specific attack depend on whether the authentication is
   machine-level authentication (as the one provided by IPsec) or ULP
   authentication.

9.2.  Introduction to Security Options

   The following security services can be applied to an MPA/DDP/RDMAP
   Stream:

   1.  Session confidentiality - protects against eavesdropping.

   2.  Per-packet data source authentication - protects against the
       following spoofing attacks: network-based impersonation, Stream
       hijacking, and man in the middle.

   3.  Per-packet integrity - protects against tampering done by
       network-based modification of FPDUs (indirectly affecting buffer
       content through DDP services).

   4.  Packet sequencing - protects against replay attacks, which is a
       special case of the above tampering attack.

   If an MPA/DDP/RDMAP Stream may be subject to impersonation attacks,
   or Stream hijacking attacks, it is recommended that the Stream be
   authenticated, integrity protected, and protected from replay
   attacks.  It may use confidentiality protection to protect from
   eavesdropping (in case the MPA/DDP/RDMAP Stream traverses a public
   network).

   IPsec is capable of providing the above security services for IP and
   TCP traffic.

   ULP protocols may be able to provide part of the above security
   services.  See [NFSv4CHAN] for additional information on a promising
   approach called "channel binding".  From [NFSv4CHAN]:

       "The concept of channel bindings allows applications to prove
       that the end-points of two secure channels at different network
       layers are the same by binding authentication at one channel to
       the session protection at the other channel.  The use of channel

RFC5044 - Page 43

       bindings allows applications to delegate session protection to
       lower layers, which may significantly improve performance for
       some applications."

9.3.  Using IPsec with MPA

   IPsec can be used to protect against the packet injection attacks
   outlined above.  Because IPsec is designed to secure individual IP
   packets, MPA can run above IPsec without change.  IPsec packets are
   processed (e.g., integrity checked and decrypted) in the order they
   are received, and an MPA receiver will process the decrypted FPDUs
   contained in these packets in the same manner as FPDUs contained in
   unsecured IP packets.

   MPA implementations MUST implement IPsec as described in Section 9.4
   below.  The use of IPsec is up to ULPs and administrators.

9.4.  Requirements for IPsec Encapsulation of MPA/DDP

   The IP Storage working group has spent significant time and effort to
   define the normative IPsec requirements for IP storage [RFC3723].
   Portions of that specification are applicable to a wide variety of
   protocols, including the RDDP protocol suite.  In order not to
   replicate this effort, an MPA on TCP implementation MUST follow the
   requirements defined in RFC 3723, Sections 2.3 and 5, including the
   associated normative references for those sections.

   Additionally, since IPsec acceleration hardware may only be able to
   handle a limited number of active Internet Key Exchange Protocol
   (IKE) Phase 2 security associations (SAs), Phase 2 delete messages
   MAY be sent for idle SAs, as a means of keeping the number of active
   Phase 2 SAs to a minimum.  The receipt of an IKE Phase 2 delete
   message MUST NOT be interpreted as a reason for tearing down a
   DDP/RDMA Stream.  Rather, it is preferable to leave the Stream up,
   and if additional traffic is sent on it, to bring up another IKE
   Phase 2 SA to protect it.  This avoids the potential for continually
   bringing Streams up and down.

   The IPsec requirements for RDDP are based on the version of IPsec
   specified in RFC 2401 [RFC2401] and related RFCs, as profiled by RFC
   3723 [RFC3723], despite the existence of a newer version of IPsec
   specified in RFC 4301 [RFC4301] and related RFCs.  One of the
   important early applications of the RDDP protocols is their use with
   iSCSI [iSER]; RDDP's IPsec requirements follow those of IPsec in
   order to facilitate that usage by allowing a common profile of IPsec
   to be used with iSCSI and the RDDP protocols.  In the future, RFC

RFC5044 - Page 44

   3723 may be updated to the newer version of IPsec; the IPsec security
   requirements of any such update should apply uniformly to iSCSI and
   the RDDP protocols.

   Note that there are serious security issues if IPsec is not
   implemented end-to-end.  For example, if IPsec is implemented as a
   tunnel in the middle of the network, any hosts between the peer and
   the IPsec tunneling device can freely attack the unprotected Stream.

10.  IANA Considerations

   No IANA actions are required by this document.

   If a well-known port is chosen as the mechanism to identify a DDP on
   MPA on TCP, the well-known port must be registered with IANA.
   Because the use of the port is DDP specific, registration of the port
   with IANA is left to DDP.

(next page on part 3)