RFC 6824

TCP Extensions for Multipath Operation with Multiple Addresses

Pages: 64
Obsoleted by: 8684

Part 3 of 4 – Pages 35 to 56

RFC6824 - Page 35 prevText

3.4.  Address Knowledge Exchange (Path Management)

   We use the term "path management" to refer to the exchange of
   information about additional paths between hosts, which in this
   design is managed by multiple addresses at hosts.  For more detail of
   the architectural thinking behind this design, see the MPTCP
   Architecture document [2].

   This design makes use of two methods of sharing such information, and
   both can be used on a connection.  The first is the direct setup of
   new subflows, already described in Section 3.2, where the initiator
   has an additional address.  The second method, described in the
   following subsections, signals addresses explicitly to the other host
   to allow it to initiate new subflows.  The two mechanisms are
   complementary: the first is implicit and simple, while the explicit
   is more complex but is more robust.  Together, the mechanisms allow

RFC6824 - Page 36

   addresses to change in flight (and thus support operation through
   NATs, since the source address need not be known), and also allow the
   signaling of previously unknown addresses, and of addresses belonging
   to other address families (e.g., both IPv4 and IPv6).

   Here is an example of typical operation of the protocol:

   o  An MPTCP connection is initially set up between address/port A1 of
      Host A and address/port B1 of Host B.  If Host A is multihomed and
      multiaddressed, it can start an additional subflow from its
      address A2 to B1, by sending a SYN with a Join option from A2 to
      B1, using B's previously declared token for this connection.
      Alternatively, if B is multihomed, it can try to set up a new
      subflow from B2 to A1, using A's previously declared token.  In
      either case, the SYN will be sent to the port already in use for
      the original subflow on the receiving host.

   o  Simultaneously (or after a timeout), an ADD_ADDR option
      (Section 3.4.1) is sent on an existing subflow, informing the
      receiver of the sender's alternative address(es).  The recipient
      can use this information to open a new subflow to the sender's
      additional address.  In our example, A will send ADD_ADDR option
      informing B of address/port A2.  The mix of using the SYN-based
      option and the ADD_ADDR option, including timeouts, is
      implementation specific and can be tailored to agree with local
      policy.

   o  If subflow A2-B1 is successfully set up, Host B can use the
      Address ID in the Join option to correlate this with the ADD_ADDR
      option that will also arrive on an existing subflow; now B knows
      not to open A2-B1, ignoring the ADD_ADDR.  Otherwise, if B has not
      received the A2-B1 MP_JOIN SYN but received the ADD_ADDR, it can
      try to initiate a new subflow from one or more of its addresses to
      address A2.  This permits new sessions to be opened if one host is
      behind a NAT.

   Other ways of using the two signaling mechanisms are possible; for
   instance, signaling addresses in other address families can only be
   done explicitly using the Add Address option.

3.4.1.  Address Advertisement

   The Add Address (ADD_ADDR) TCP option announces additional addresses
   (and optionally, ports) on which a host can be reached (Figure 12).
   Multiple instances of this TCP option can be added in a single
   message if there is sufficient TCP option space; otherwise, multiple
   TCP messages containing this option will be sent.  This option can be
   used at any time during a connection, depending on when the sender

RFC6824 - Page 37

   wishes to enable multiple paths and/or when paths become available.
   As with all MPTCP signals, the receiver MUST undertake standard TCP
   validity checks before acting upon it.

   Every address has an Address ID that can be used for uniquely
   identifying the address within a connection for address removal.
   This is also used to identify MP_JOIN options (see Section 3.2)
   relating to the same address, even when address translators are in
   use.  The Address ID MUST uniquely identify the address to the sender
   (within the scope of the connection), but the mechanism for
   allocating such IDs is implementation specific.

   All address IDs learned via either MP_JOIN or ADD_ADDR SHOULD be
   stored by the receiver in a data structure that gathers all the
   Address ID to address mappings for a connection (identified by a
   token pair).  In this way, there is a stored mapping between Address
   ID, observed source address, and token pair for future processing of
   control information for a connection.  Note that an implementation
   MAY discard incoming address advertisements at will, for example, for
   avoiding the required mapping state, or because advertised addresses
   are of no use to it (for example, IPv6 addresses when it has IPv4
   only).  Therefore, a host MUST treat address advertisements as soft
   state, and it MAY choose to refresh advertisements periodically.

   This option is shown in Figure 12.  The illustration is sized for
   IPv4 addresses (IPVer = 4).  For IPv6, the IPVer field will read 6,
   and the length of the address will be 16 octets (instead of 4).

   The presence of the final 2 octets, specifying the TCP port number to
   use, are optional and can be inferred from the length of the option.
   Although it is expected that the majority of use cases will use the
   same port pairs as used for the initial subflow (e.g., port 80
   remains port 80 on all subflows, as does the ephemeral port at the
   client), there may be cases (such as port-based load balancing) where
   the explicit specification of a different port is required.  If no
   port is specified, MPTCP SHOULD attempt to connect to the specified
   address on the same port as is already in use by the subflow on which
   the ADD_ADDR signal was sent; this is discussed in more detail in
   Section 3.8.

RFC6824 - Page 38

                           1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +---------------+---------------+-------+-------+---------------+
      |     Kind      |     Length    |Subtype| IPVer |  Address ID   |
      +---------------+---------------+-------+-------+---------------+
      |          Address (IPv4 - 4 octets / IPv6 - 16 octets)         |
      +-------------------------------+-------------------------------+
      |   Port (2 octets, optional)   |
      +-------------------------------+

                 Figure 12: Add Address (ADD_ADDR) Option

   Due to the proliferation of NATs, it is reasonably likely that one
   host may attempt to advertise private addresses [18].  It is not
   desirable to prohibit this, since there may be cases where both hosts
   have additional interfaces on the same private network, and a host
   MAY want to advertise such addresses.  The MP_JOIN handshake to
   create a new subflow (Section 3.2) provides mechanisms to minimize
   security risks.  The MP_JOIN message contains a 32-bit token that
   uniquely identifies the connection to the receiving host.  If the
   token is unknown, the host will return with a RST.  In the unlikely
   event that the token is known, subflow setup will continue, but the
   HMAC exchange must occur for authentication.  This will fail, and
   will provide sufficient protection against two unconnected hosts
   accidentally setting up a new subflow upon the signal of a private
   address.  Further security considerations around the issue of
   ADD_ADDR messages that accidentally misdirect, or maliciously direct,
   new MP_JOIN attempts are discussed in Section 5.

   Ideally, ADD_ADDR and REMOVE_ADDR options would be sent reliably, and
   in order, to the other end.  This would ensure that this address
   management does not unnecessarily cause an outage in the connection
   when remove/add addresses are processed in reverse order, and also to
   ensure that all possible paths are used.  Note, however, that losing
   reliability and ordering will not break the multipath connections, it
   will just reduce the opportunity to open multipath paths and to
   survive different patterns of path failures.

   Therefore, implementing reliability signals for these TCP options is
   not necessary.  In order to minimize the impact of the loss of these
   options, however, it is RECOMMENDED that a sender should send these
   options on all available subflows.  If these options need to be
   received in order, an implementation SHOULD only send one ADD_ADDR/
   REMOVE_ADDR option per RTT, to minimize the risk of misordering.

   A host can send an ADD_ADDR message with an already assigned Address
   ID, but the Address MUST be the same as previously assigned to this
   Address ID, and the Port MUST be different from one already in use

RFC6824 - Page 39

   for this Address ID.  If these conditions are not met, the receiver
   SHOULD silently ignore the ADD_ADDR.  A host wishing to replace an
   existing Address ID MUST first remove the existing one
   (Section 3.4.2).

   A host that receives an ADD_ADDR but finds a connection set up to
   that IP address and port number is unsuccessful SHOULD NOT perform
   further connection attempts to this address/port combination for this
   connection.  A sender that wants to trigger a new incoming connection
   attempt on a previously advertised address/port combination can
   therefore refresh ADD_ADDR information by sending the option again.

   During normal MPTCP operation, it is unlikely that there will be
   sufficient TCP option space for ADD_ADDR to be included along with
   those for data sequence numbering (Section 3.3.1).  Therefore, it is
   expected that an MPTCP implementation will send the ADD_ADDR option
   on separate ACKs.  As discussed earlier, however, an MPTCP
   implementation MUST NOT treat duplicate ACKs with any MPTCP option,
   with the exception of the DSS option, as indications of congestion
   [12], and an MPTCP implementation SHOULD NOT send more than two
   duplicate ACKs in a row for signaling purposes.

3.4.2.  Remove Address

   If, during the lifetime of an MPTCP connection, a previously
   announced address becomes invalid (e.g., if the interface
   disappears), the affected host SHOULD announce this so that the peer
   can remove subflows related to this address.

   This is achieved through the Remove Address (REMOVE_ADDR) option
   (Figure 13), which will remove a previously added address (or list of
   addresses) from a connection and terminate any subflows currently
   using that address.

   For security purposes, if a host receives a REMOVE_ADDR option, it
   must ensure the affected path(s) are no longer in use before it
   instigates closure.  The receipt of REMOVE_ADDR SHOULD first trigger
   the sending of a TCP keepalive [19] on the path, and if a response is
   received the path SHOULD NOT be removed.  Typical TCP validity tests
   on the subflow (e.g., ensuring sequence and ACK numbers are correct)
   MUST also be undertaken.  An implementation can use indications of
   these test failures as part of intrusion detection or error logging.

   The sending and receipt (if no keepalive response was received) of
   this message SHOULD trigger the sending of RSTs by both hosts on the
   affected subflow(s) (if possible), as a courtesy to cleaning up
   middlebox state, before cleaning up any local state.

RFC6824 - Page 40

   Address removal is undertaken by ID, so as to permit the use of NATs
   and other middleboxes that rewrite source addresses.  If there is no
   address at the requested ID, the receiver will silently ignore the
   request.

   A subflow that is still functioning MUST be closed with a FIN
   exchange as in regular TCP, rather than using this option.  For more
   information, see Section 3.3.3.

                        1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +---------------+---------------+-------+-------+---------------+
   |     Kind      |  Length = 3+n |Subtype|(resvd)|   Address ID  | ...
   +---------------+---------------+-------+-------+---------------+
                              (followed by n-1 Address IDs, if required)

              Figure 13: Remove Address (REMOVE_ADDR) Option

3.5.  Fast Close

   Regular TCP has the means of sending a reset (RST) signal to abruptly
   close a connection.  With MPTCP, the RST only has the scope of the
   subflow and will only close the concerned subflow but not affect the
   remaining subflows.  MPTCP's connection will stay alive at the data
   level, in order to permit break-before-make handover between
   subflows.  It is therefore necessary to provide an MPTCP-level
   "reset" to allow the abrupt closure of the whole MPTCP connection,
   and this is the MP_FASTCLOSE option.

   MP_FASTCLOSE is used to indicate to the peer that the connection will
   be abruptly closed and no data will be accepted anymore.  The reasons
   for triggering an MP_FASTCLOSE are implementation specific.  Regular
   TCP does not allow sending a RST while the connection is in a
   synchronized state [1].  Nevertheless, implementations allow the
   sending of a RST in this state, if, for example, the operating system
   is running out of resources.  In these cases, MPTCP should send the
   MP_FASTCLOSE.  This option is illustrated in Figure 14.

RFC6824 - Page 41

                            1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +---------------+---------------+-------+-----------------------+
       |     Kind      |    Length     |Subtype|      (reserved)       |
       +---------------+---------------+-------+-----------------------+
       |                      Option Receiver's Key                    |
       |                            (64 bits)                          |
       |                                                               |
       +---------------------------------------------------------------+

                Figure 14: Fast Close (MP_FASTCLOSE) Option

   If Host A wants to force the closure of an MPTCP connection, the
   MPTCP Fast Close procedure is as follows:

   o  Host A sends an ACK containing the MP_FASTCLOSE option on one
      subflow, containing the key of Host B as declared in the initial
      connection handshake.  On all the other subflows, Host A sends a
      regular TCP RST to close these subflows, and tears them down.
      Host A now enters FASTCLOSE_WAIT state.

   o  Upon receipt of an MP_FASTCLOSE, containing the valid key, Host B
      answers on the same subflow with a TCP RST and tears down all
      subflows.  Host B can now close the whole MPTCP connection (it
      transitions directly to CLOSED state).

   o  As soon as Host A has received the TCP RST on the remaining
      subflow, it can close this subflow and tear down the whole
      connection (transition from FASTCLOSE_WAIT to CLOSED states).  If
      Host A receives an MP_FASTCLOSE instead of a TCP RST, both hosts
      attempted fast closure simultaneously.  Host A should reply with a
      TCP RST and tear down the connection.

   o  If Host A does not receive a TCP RST in reply to its MP_FASTCLOSE
      after one retransmission timeout (RTO) (the RTO of the subflow
      where the MPTCP_RST has been sent), it SHOULD retransmit the
      MP_FASTCLOSE.  The number of retransmissions SHOULD be limited to
      avoid this connection from being retained for a long time, but
      this limit is implementation specific.  A RECOMMENDED number is 3.

3.6.  Fallback

   Sometimes, middleboxes will exist on a path that could prevent the
   operation of MPTCP.  MPTCP has been designed in order to cope with
   many middlebox modifications (see Section 6), but there are still
   some cases where a subflow could fail to operate within the MPTCP
   requirements.  These cases are notably the following: the loss of TCP
   options on a path and the modification of payload data.  If such an

RFC6824 - Page 42

   event occurs, it is necessary to "fall back" to the previous, safe
   operation.  This may be either falling back to regular TCP or
   removing a problematic subflow.

   At the start of an MPTCP connection (i.e., the first subflow), it is
   important to ensure that the path is fully MPTCP capable and the
   necessary TCP options can reach each host.  The handshake as
   described in Section 3.1 SHOULD fall back to regular TCP if either of
   the SYN messages do not have the MPTCP options: this is the same, and
   desired, behavior in the case where a host is not MPTCP capable, or
   the path does not support the MPTCP options.  When attempting to join
   an existing MPTCP connection (Section 3.2), if a path is not MPTCP
   capable and the TCP options do not get through on the SYNs, the
   subflow will be closed according to the MP_JOIN logic.

   There is, however, another corner case that should be addressed.
   That is one of MPTCP options getting through on the SYN, but not on
   regular packets.  This can be resolved if the subflow is the first
   subflow, and thus all data in flight is contiguous, using the
   following rules.

   A sender MUST include a DSS option with data sequence mapping in
   every segment until one of the sent segments has been acknowledged
   with a DSS option containing a Data ACK.  Upon reception of the
   acknowledgment, the sender has the confirmation that the DSS option
   passes in both directions and may choose to send fewer DSS options
   than once per segment.

   If, however, an ACK is received for data (not just for the SYN)
   without a DSS option containing a Data ACK, the sender determines the
   path is not MPTCP capable.  In the case of this occurring on an
   additional subflow (i.e., one started with MP_JOIN), the host MUST
   close the subflow with a RST.  In the case of the first subflow
   (i.e., that started with MP_CAPABLE), it MUST drop out of an MPTCP
   mode back to regular TCP.  The sender will send one final data
   sequence mapping, with the Data-Level Length value of 0 indicating an
   infinite mapping (in case the path drops options in one direction
   only), and then revert to sending data on the single subflow without
   any MPTCP options.

   Note that this rule essentially prohibits the sending of data on the
   third packet of an MP_CAPABLE or MP_JOIN handshake, since both that
   option and a DSS cannot fit in TCP option space.  If the initiator is
   to send first, another segment must be sent that contains the data
   and DSS.  Note also that an additional subflow cannot be used until
   the initial path has been verified as MPTCP capable.

RFC6824 - Page 43

   These rules should cover all cases where such a failure could happen:
   whether it's on the forward or reverse path and whether the server or
   the client first sends data.  If lost options on data packets occur
   on any other subflow apart from the initial subflow, it should be
   treated as a standard path failure.  The data would not be DATA_ACKed
   (since there is no mapping for the data), and the subflow can be
   closed with a RST.

   The case described above is a specialized case of fallback, for when
   the lack of MPTCP support is detected before any data is acknowledged
   at the connection level on a subflow.  More generally, fallback
   (either closing a subflow, or to regular TCP) can become necessary at
   any point during a connection if a non-MPTCP-aware middlebox changes
   the data stream.

   As described in Section 3.3, each portion of data for which there is
   a mapping is protected by a checksum.  This mechanism is used to
   detect if middleboxes have made any adjustments to the payload
   (added, removed, or changed data).  A checksum will fail if the data
   has been changed in any way.  This will also detect if the length of
   data on the subflow is increased or decreased, and this means the
   data sequence mapping is no longer valid.  The sender no longer knows
   what subflow-level sequence number the receiver is genuinely
   operating at (the middlebox will be faking ACKs in return), and it
   cannot signal any further mappings.  Furthermore, in addition to the
   possibility of payload modifications that are valid at the
   application layer, there is the possibility that false positives
   could be hit across MPTCP segment boundaries, corrupting the data.
   Therefore, all data from the start of the segment that failed the
   checksum onwards is not trustworthy.

   When multiple subflows are in use, the data in flight on a subflow
   will likely involve data that is not contiguously part of the
   connection-level stream, since segments will be spread across the
   multiple subflows.  Due to the problems identified above, it is not
   possible to determine what the adjustment has done to the data
   (notably, any changes to the subflow sequence numbering).  Therefore,
   it is not possible to recover the subflow, and the affected subflow
   must be immediately closed with a RST, featuring an MP_FAIL option
   (Figure 15), which defines the data sequence number at the start of
   the segment (defined by the data sequence mapping) that had the
   checksum failure.  Note that the MP_FAIL option requires the use of
   the full 64-bit sequence number, even if 32-bit sequence numbers are
   normally in use in the DSS signals on the path.

RFC6824 - Page 44

                           1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +---------------+---------------+-------+----------------------+
      |     Kind      |   Length=12   |Subtype|      (reserved)      |
      +---------------+---------------+-------+----------------------+
      |                                                              |
      |                 Data Sequence Number (8 octets)              |
      |                                                              |
      +--------------------------------------------------------------+

                   Figure 15: Fallback (MP_FAIL) Option

   The receiver MUST discard all data following the data sequence number
   specified.  Failed data MUST NOT be DATA_ACKed and so will be
   retransmitted on other subflows (Section 3.3.6).

   A special case is when there is a single subflow and it fails with a
   checksum error.  If it is known that all unacknowledged data in
   flight is contiguous (which will usually be the case with a single
   subflow), an infinite mapping can be applied to the subflow without
   the need to close it first, and essentially turn off all further
   MPTCP signaling.  In this case, if a receiver identifies a checksum
   failure when there is only one path, it will send back an MP_FAIL
   option on the subflow-level ACK, referring to the data-level sequence
   number of the start of the segment on which the checksum error was
   detected.  The sender will receive this, and if all unacknowledged
   data in flight is contiguous, will signal an infinite mapping.  This
   infinite mapping will be a DSS option (Section 3.3) on the first new
   packet, containing a data sequence mapping that acts retroactively,
   referring to the start of the subflow sequence number of the last
   segment that was known to be delivered intact.  From that point
   onwards, data can be altered by a middlebox without affecting MPTCP,
   as the data stream is equivalent to a regular, legacy TCP session.

   In the rare case that the data is not contiguous (which could happen
   when there is only one subflow but it is retransmitting data from a
   subflow that has recently been uncleanly closed), the receiver MUST
   close the subflow with a RST with MP_FAIL.  The receiver MUST discard
   all data that follows the data sequence number specified.  The sender
   MAY attempt to create a new subflow belonging to the same connection,
   and, if it chooses to do so, SHOULD place the single subflow
   immediately in single-path mode by setting an infinite data sequence
   mapping.  This mapping will begin from the data-level sequence number
   that was declared in the MP_FAIL.

RFC6824 - Page 45

   After a sender signals an infinite mapping, it MUST only use subflow
   ACKs to clear its send buffer.  This is because Data ACKs may become
   misaligned with the subflow ACKs when middleboxes insert or delete
   data.  The receive SHOULD stop generating Data ACKs after it receives
   an infinite mapping.

   When a connection has fallen back, only one subflow can send data;
   otherwise, the receiver would not know how to reorder the data.  In
   practice, this means that all MPTCP subflows will have to be
   terminated except one.  Once MPTCP falls back to regular TCP, it MUST
   NOT revert to MPTCP later in the connection.

   It should be emphasized that we are not attempting to prevent the use
   of middleboxes that want to adjust the payload.  An MPTCP-aware
   middlebox could provide such functionality by also rewriting
   checksums.

3.7.  Error Handling

   In addition to the fallback mechanism as described above, the
   standard classes of TCP errors may need to be handled in an MPTCP-
   specific way.  Note that changing semantics -- such as the relevance
   of a RST -- are covered in Section 4.  Where possible, we do not want
   to deviate from regular TCP behavior.

   The following list covers possible errors and the appropriate MPTCP
   behavior:

   o  Unknown token in MP_JOIN (or HMAC failure in MP_JOIN ACK, or
      missing MP_JOIN in SYN/ACK response): send RST (analogous to TCP's
      behavior on an unknown port)

   o  DSN out of window (during normal operation): drop the data, do not
      send Data ACKs

   o  Remove request for unknown address ID: silently ignore

3.8.  Heuristics

   There are a number of heuristics that are needed for performance or
   deployment but that are not required for protocol correctness.  In
   this section, we detail such heuristics.  Note that discussion of
   buffering and certain sender and receiver window behaviors are
   presented in Sections 3.3.4 and 3.3.5, as well as retransmission in
   Section 3.3.6.

RFC6824 - Page 46

3.8.1.  Port Usage

   Under typical operation, an MPTCP implementation SHOULD use the same
   ports as already in use.  In other words, the destination port of a
   SYN containing an MP_JOIN option SHOULD be the same as the remote
   port of the first subflow in the connection.  The local port for such
   SYNs SHOULD also be the same as for the first subflow (and as such,
   an implementation SHOULD reserve ephemeral ports across all local IP
   addresses), although there may be cases where this is infeasible.
   This strategy is intended to maximize the probability of the SYN
   being permitted by a firewall or NAT at the recipient and to avoid
   confusing any network monitoring software.

   There may also be cases, however, where the passive opener wishes to
   signal to the other host that a specific port should be used, and
   this facility is provided in the Add Address option as documented in
   Section 3.4.1.  It is therefore feasible to allow multiple subflows
   between the same two addresses but using different port pairs, and
   such a facility could be used to allow load balancing within the
   network based on 5-tuples (e.g., some ECMP implementations [7]).

3.8.2.  Delayed Subflow Start

   Many TCP connections are short-lived and consist only of a few
   segments, and so the overheads of using MPTCP outweigh any benefits.
   A heuristic is required, therefore, to decide when to start using
   additional subflows in an MPTCP connection.  We expect that
   experience gathered from deployments will provide further guidance on
   this, and will be affected by particular application characteristics
   (which are likely to change over time).  However, a suggested
   general-purpose heuristic that an implementation MAY choose to employ
   is as follows.  Results from experimental deployments are needed in
   order to verify the correctness of this proposal.

   If a host has data buffered for its peer (which implies that the
   application has received a request for data), the host opens one
   subflow for each initial window's worth of data that is buffered.

   Consideration should also be given to limiting the rate of adding new
   subflows, as well as limiting the total number of subflows open for a
   particular connection.  A host may choose to vary these values based
   on its load or knowledge of traffic and path characteristics.

   Note that this heuristic alone is probably insufficient.  Traffic for
   many common applications, such as downloads, is highly asymmetric and
   the host that is multihomed may well be the client that will never

RFC6824 - Page 47

   fill its buffers, and thus never use MPTCP.  Advanced APIs that allow
   an application to signal its traffic requirements would aid in these
   decisions.

   An additional time-based heuristic could be applied, opening
   additional subflows after a given period of time has passed.  This
   would alleviate the above issue, and also provide resilience for low-
   bandwidth but long-lived applications.

   This section has shown some of the considerations that an implementer
   should give when developing MPTCP heuristics, but is not intended to
   be prescriptive.

3.8.3.  Failure Handling

   Requirements for MPTCP's handling of unexpected signals have been
   given in Section 3.7.  There are other failure cases, however, where
   a hosts can choose appropriate behavior.

   For example, Section 3.1 suggests that a host SHOULD fall back to
   trying regular TCP SYNs after one or more failures of MPTCP SYNs for
   a connection.  A host may keep a system-wide cache of such
   information, so that it can back off from using MPTCP, firstly for
   that particular destination host, and eventually on a whole
   interface, if MPTCP connections continue failing.

   Another failure could occur when the MP_JOIN handshake fails.
   Section 3.7 specifies that an incorrect handshake MUST lead to the
   subflow being closed with a RST.  A host operating an active
   intrusion detection system may choose to start blocking MP_JOIN
   packets from the source host if multiple failed MP_JOIN attempts are
   seen.  From the connection initiator's point of view, if an MP_JOIN
   fails, it SHOULD NOT attempt to connect to the same IP address and
   port during the lifetime of the connection, unless the other host
   refreshes the information with another ADD_ADDR option.  Note that
   the ADD_ADDR option is informational only, and does not guarantee the
   other host will attempt a connection.

   In addition, an implementation may learn, over a number of
   connections, that certain interfaces or destination addresses
   consistently fail and may default to not trying to use MPTCP for
   these.  Behavior could also be learned for particularly badly
   performing subflows or subflows that regularly fail during use, in
   order to temporarily choose not to use these paths.

RFC6824 - Page 48

4.  Semantic Issues

   In order to support multipath operation, the semantics of some TCP
   components have changed.  To aid clarity, this section collects these
   semantic changes as a reference.

   Sequence number:  The (in-header) TCP sequence number is specific to
      the subflow.  To allow the receiver to reorder application data,
      an additional data-level sequence space is used.  In this data-
      level sequence space, the initial SYN and the final DATA_FIN
      occupy 1 octet of sequence space.  There is an explicit mapping of
      data sequence space to subflow sequence space, which is signaled
      through TCP options in data packets.

   ACK:  The ACK field in the TCP header acknowledges only the subflow
      sequence number, not the data-level sequence space.
      Implementations SHOULD NOT attempt to infer a data-level
      acknowledgment from the subflow ACKs.  This separates subflow- and
      connection-level processing at an end host.

   Duplicate ACK:  A duplicate ACK that includes any MPTCP signaling
      (with the exception of the DSS option) MUST NOT be treated as a
      signal of congestion.  To limit the chances of non-MPTCP-aware
      entities mistakenly interpreting duplicate ACKs as a signal of
      congestion, MPTCP SHOULD NOT send more than two duplicate ACKs
      containing (non-DSS) MPTCP signals in a row.

   Receive Window:  The receive window in the TCP header indicates the
      amount of free buffer space for the whole data-level connection
      (as opposed to for this subflow) that is available at the
      receiver.  This is the same semantics as regular TCP, but to
      maintain these semantics the receive window must be interpreted at
      the sender as relative to the sequence number given in the
      DATA_ACK rather than the subflow ACK in the TCP header.  In this
      way, the original flow control role is preserved.  Note that some
      middleboxes may change the receive window, and so a host SHOULD
      use the maximum value of those recently seen on the constituent
      subflows for the connection-level receive window, and also needs
      to maintain a subflow-level window for subflow-level processing.

   FIN:  The FIN flag in the TCP header applies only to the subflow it
      is sent on, not to the whole connection.  For connection-level FIN
      semantics, the DATA_FIN option is used.

   RST:  The RST flag in the TCP header applies only to the subflow it
      is sent on, not to the whole connection.  The MP_FASTCLOSE option
      provides the fast close functionality of a RST at the MPTCP
      connection level.

RFC6824 - Page 49

   Address List:  Address list management (i.e., knowledge of the local
      and remote hosts' lists of available IP addresses) is handled on a
      per-connection basis (as opposed to per subflow, per host, or per
      pair of communicating hosts).  This permits the application of
      per-connection local policy.  Adding an address to one connection
      (either explicitly through an Add Address message, or implicitly
      through a Join) has no implication for other connections between
      the same pair of hosts.

   5-tuple:  The 5-tuple (protocol, local address, local port, remote
      address, remote port) presented by kernel APIs to the application
      layer in a non-multipath-aware application is that of the first
      subflow, even if the subflow has since been closed and removed
      from the connection.  This decision, and other related API issues,
      are discussed in more detail in [6].

5.  Security Considerations

   As identified in [9], the addition of multipath capability to TCP
   will bring with it a number of new classes of threat.  In order to
   prevent these, [2] presents a set of requirements for a security
   solution for MPTCP.  The fundamental goal is for the security of
   MPTCP to be "no worse" than regular TCP today, and the key security
   requirements are:

   o  Provide a mechanism to confirm that the parties in a subflow
      handshake are the same as in the original connection setup.

   o  Provide verification that the peer can receive traffic at a new
      address before using it as part of a connection.

   o  Provide replay protection, i.e., ensure that a request to add/
      remove a subflow is 'fresh'.

   In order to achieve these goals, MPTCP includes a hash-based
   handshake algorithm documented in Sections 3.1 and 3.2.

   The security of the MPTCP connection hangs on the use of keys that
   are shared once at the start of the first subflow, and are never sent
   again over the network (unless used in the fast close mechanism,
   Section 3.5).  To ease demultiplexing while not giving away any
   cryptographic material, future subflows use a truncated cryptographic
   hash of this key as the connection identification "token".  The keys
   are concatenated and used as keys for creating Hash-based Message
   Authentication Codes (HMACs) used on subflow setup, in order to
   verify that the parties in the handshake are the same as in the
   original connection setup.  It also provides verification that the
   peer can receive traffic at this new address.  Replay attacks would

RFC6824 - Page 50

   still be possible when only keys are used; therefore, the handshakes
   use single-use random numbers (nonces) at both ends -- this ensures
   the HMAC will never be the same on two handshakes.  Guidance on
   generating random numbers suitable for use as keys is given in [14]
   and discussed in Section 3.1.

   The use of crypto capability bits in the initial connection handshake
   to negotiate use of a particular algorithm allows the deployment of
   additional crypto mechanisms in the future.  Note that this would be
   susceptible to bid-down attacks only if the attacker was on-path (and
   thus would be able to modify the data anyway).  The security
   mechanism presented in this document should therefore protect against
   all forms of flooding and hijacking attacks discussed in [9].

   During normal operation, regular TCP protection mechanisms (such as
   ensuring sequence numbers are in-window) will provide the same level
   of protection against attacks on individual TCP subflows as exists
   for regular TCP today.  Implementations will introduce additional
   buffers compared to regular TCP, to reassemble data at the connection
   level.  The application of window sizing will minimize the risk of
   denial-of-service attacks consuming resources.

   As discussed in Section 3.4.1, a host may advertise its private
   addresses, but these might point to different hosts in the receiver's
   network.  The MP_JOIN handshake (Section 3.2) will ensure that this
   does not succeed in setting up a subflow to the incorrect host.
   However, it could still create unwanted TCP handshake traffic.  This
   feature of MPTCP could be a target for denial-of-service exploits,
   with malicious participants in MPTCP connections encouraging the
   recipient to target other hosts in the network.  Therefore,
   implementations should consider heuristics (Section 3.8) at both the
   sender and receiver to reduce the impact of this.

   A small security risk could theoretically exist with key reuse, but
   in order to accomplish a replay attack, both the sender and receiver
   keys, and the sender and receiver random numbers, in the MP_JOIN
   handshake (Section 3.2) would have to match.

   Whilst this specification defines a "medium" security solution,
   meeting the criteria specified at the start of this section and the
   threat analysis ([9]), since attacks only ever get worse, it is
   likely that a future Standards Track version of MPTCP would need to
   be able to support stronger security.  There are several ways the
   security of MPTCP could potentially be improved; some of these would
   be compatible with MPTCP as defined in this document, whilst others
   may not be.  For now, the best approach is to get experience with the
   current approach, establish what might work, and check that the
   threat analysis is still accurate.

RFC6824 - Page 51

   Possible ways of improving MPTCP security could include:

   o  defining a new MPCTP cryptographic algorithm, as negotiated in
      MP_CAPABLE.  A sub-case could be to include an additional
      deployment assumption, such as stateful servers, in order to allow
      a more powerful algorithm to be used.

   o  defining how to secure data transfer with MPTCP, whilst not
      changing the signaling part of the protocol.

   o  defining security that requires more option space, perhaps in
      conjunction with a "long options" proposal for extending the TCP
      options space (such as those surveyed in [20]), or perhaps
      building on the current approach with a second stage of MPTCP-
      option-based security.

   o  revisiting the working group's decision to exclusively use TCP
      options for MPTCP signaling, and instead look at also making use
      of the TCP payloads.

   MPTCP has been designed with several methods available to indicate a
   new security mechanism, including:

   o  available flags in MP_CAPABLE (Figure 4);

   o  available subtypes in the MPTCP option (Figure 3);

   o  the version field in MP_CAPABLE (Figure 4);

6.  Interactions with Middleboxes

   Multipath TCP was designed to be deployable in the present world.
   Its design takes into account "reasonable" existing middlebox
   behavior.  In this section, we outline a few representative
   middlebox-related failure scenarios and show how Multipath TCP
   handles them.  Next, we list the design decisions multipath has made
   to accommodate the different middleboxes.

   A primary concern is our use of a new TCP option.  Middleboxes should
   forward packets with unknown options unchanged, yet there are some
   that don't.  These we expect will either strip options and pass the
   data, drop packets with new options, copy the same option into
   multiple segments (e.g., when doing segmentation), or drop options
   during segment coalescing.

RFC6824 - Page 52

   MPTCP uses a single new TCP option "Kind", and all message types are
   defined by "subtype" values (see Section 8).  This should reduce the
   chances of only some types of MPTCP options being passed, and instead
   the key differing characteristics are different paths, and the
   presence of the SYN flag.

   MPTCP SYN packets on the first subflow of a connection contain the
   MP_CAPABLE option (Section 3.1).  If this is dropped, MPTCP SHOULD
   fall back to regular TCP.  If packets with the MP_JOIN option
   (Section 3.2) are dropped, the paths will simply not be used.

   If a middlebox strips options but otherwise passes the packets
   unchanged, MPTCP will behave safely.  If an MP_CAPABLE option is
   dropped on either the outgoing or the return path, the initiating
   host can fall back to regular TCP, as illustrated in Figure 16 and
   discussed in Section 3.1.

   Subflow SYNs contain the MP_JOIN option.  If this option is stripped
   on the outgoing path, the SYN will appear to be a regular SYN to Host
   B.  Depending on whether there is a listening socket on the target
   port, Host B will reply either with SYN/ACK or RST (subflow
   connection fails).  When Host A receives the SYN/ACK it sends a RST
   because the SYN/ACK does not contain the MP_JOIN option and its
   token.  Either way, the subflow setup fails, but otherwise does not
   affect the MPTCP connection as a whole.

        Host A                             Host B
         |              Middlebox M            |
         |                   |                 |
         |  SYN(MP_CAPABLE)  |        SYN      |
         |-------------------|---------------->|
         |                SYN/ACK              |
         |<------------------------------------|
     a) MP_CAPABLE option stripped on outgoing path

       Host A                               Host B
         |            SYN(MP_CAPABLE)          |
         |------------------------------------>|
         |             Middlebox M             |
         |                 |                   |
         |    SYN/ACK      |SYN/ACK(MP_CAPABLE)|
         |<----------------|-------------------|
     b) MP_CAPABLE option stripped on return path

   Figure 16: Connection Setup with Middleboxes that
              Strip Options from Packets

RFC6824 - Page 53

   We now examine data flow with MPTCP, assuming the flow is correctly
   set up, which implies the options in the SYN packets were allowed
   through by the relevant middleboxes.  If options are allowed through
   and there is no resegmentation or coalescing to TCP segments,
   Multipath TCP flows can proceed without problems.

   The case when options get stripped on data packets has been discussed
   in the Fallback section.  If a fraction of options are stripped,
   behavior is not deterministic.  If some data sequence mappings are
   lost, the connection can continue so long as mappings exist for the
   subflow-level data (e.g., if multiple maps have been sent that
   reinforce each other).  If some subflow-level space is left unmapped,
   however, the subflow is treated as broken and is closed, through the
   process described in Section 3.6.  MPTCP should survive with a loss
   of some Data ACKs, but performance will degrade as the fraction of
   stripped options increases.  We do not expect such cases to appear in
   practice, though: most middleboxes will either strip all options or
   let them all through.

   We end this section with a list of middlebox classes, their behavior,
   and the elements in the MPTCP design that allow operation through
   such middleboxes.  Issues surrounding dropping packets with options
   or stripping options were discussed above, and are not included here:

   o  NATs [21] (Network Address (and Port) Translators) change the
      source address (and often source port) of packets.  This means
      that a host will not know its public-facing address for signaling
      in MPTCP.  Therefore, MPTCP permits implicit address addition via
      the MP_JOIN option, and the handshake mechanism ensures that
      connection attempts to private addresses [18] do not cause
      problems.  Explicit address removal is undertaken by an Address ID
      to allow no knowledge of the source address.

   o  Performance Enhancing Proxies (PEPs) [22] might proactively ACK
      data to increase performance.  MPTCP, however, relies on accurate
      congestion control signals from the end host, and non-MPTCP-aware
      PEPs will not be able to provide such signals.  MPTCP will,
      therefore, fall back to single-path TCP, or close the problematic
      subflow (see Section 3.6).

   o  Traffic Normalizers [23] may not allow holes in sequence numbers,
      and may cache packets and retransmit the same data.  MPTCP looks
      like standard TCP on the wire, and will not retransmit different
      data on the same subflow sequence number.  In the event of a
      retransmission, the same data will be retransmitted on the
      original TCP subflow even if it is additionally retransmitted at
      the connection level on a different subflow.

RFC6824 - Page 54

   o  Firewalls [24] might perform initial sequence number randomization
      on TCP connections.  MPTCP uses relative sequence numbers in data
      sequence mapping to cope with this.  Like NATs, firewalls will not
      permit many incoming connections, so MPTCP supports address
      signaling (ADD_ADDR) so that a multiaddressed host can invite its
      peer behind the firewall/NAT to connect out to its additional
      interface.

   o  Intrusion Detection Systems look out for traffic patterns and
      content that could threaten a network.  Multipath will mean that
      such data is potentially spread, so it is more difficult for an
      IDS to analyze the whole traffic, and potentially increases the
      risk of false positives.  However, for an MPTCP-aware IDS, tokens
      can be read by such systems to correlate multiple subflows and
      reassemble for analysis.

   o  Application-level middleboxes such as content-aware firewalls may
      alter the payload within a subflow, such as rewriting URIs in HTTP
      traffic.  MPTCP will detect these using the checksum and close the
      affected subflow(s), if there are other subflows that can be used.
      If all subflows are affected, multipath will fall back to TCP,
      allowing such middleboxes to change the payload.  MPTCP-aware
      middleboxes should be able to adjust the payload and MPTCP
      metadata in order not to break the connection.

   In addition, all classes of middleboxes may affect TCP traffic in the
   following ways:

   o  TCP options may be removed, or packets with unknown options
      dropped, by many classes of middleboxes.  It is intended that the
      initial SYN exchange, with a TCP option, will be sufficient to
      identify the path capabilities.  If such a packet does not get
      through, MPTCP will end up falling back to regular TCP.

   o  Segmentation/Coalescing (e.g., TCP segmentation offloading) might
      copy options between packets and might strip some options.
      MPTCP's data sequence mapping includes the relative subflow
      sequence number instead of using the sequence number in the
      segment.  In this way, the mapping is independent of the packets
      that carry it.

   o  The receive window may be shrunk by some middleboxes at the
      subflow level.  MPTCP will use the maximum window at data level,
      but will also obey subflow-specific windows.

RFC6824 - Page 55

7.  Acknowledgments

   The authors were originally supported by Trilogy
   (http://www.trilogy-project.org), a research project (ICT-216372)
   partially funded by the European Community under its Seventh
   Framework Program.

   Alan Ford was originally supported by Roke Manor Research.

   The authors gratefully acknowledge significant input into this
   document from Sebastien Barre, Christoph Paasch, and Andrew McDonald.

   The authors also wish to acknowledge reviews and contributions from
   Iljitsch van Beijnum, Lars Eggert, Marcelo Bagnulo, Robert Hancock,
   Pasi Sarolahti, Toby Moncaster, Philip Eardley, Sergio Lembo,
   Lawrence Conroy, Yoshifumi Nishida, Bob Briscoe, Stein Gjessing,
   Andrew McGregor, Georg Hampel, Anumita Biswas, Wes Eddy, Alexey
   Melnikov, Francis Dupont, Adrian Farrel, Barry Leiba, Robert Sparks,
   Sean Turner, Stephen Farrell, and Martin Stiemerling.

8.  IANA Considerations

   This document defines a new TCP option for MPTCP, assigned a value of
   30 (decimal) from the TCP option space.  This value is the value of
   "Kind" as seen in all MPTCP options in this document.  This value is
   defined as:

           +------+--------+-----------------------+-----------+
           | Kind | Length |        Meaning        | Reference |
           +------+--------+-----------------------+-----------+
           |  30  |    N   | Multipath TCP (MPTCP) |  RFC 6824 |
           +------+--------+-----------------------+-----------+

                     Table 1: TCP Option Kind Numbers

   This document also defines a 4-bit subtype field, for which IANA has
   created and will maintain a new sub-registry entitled "MPTCP Option
   Subtypes" under the "Transmission Control Protocol (TCP) Parameters"
   registry.  Initial values for the MPTCP option subtype registry are
   given below; future assignments are to be defined by Standards Action
   as defined by [25].  Assignments consist of the MPTCP subtype's
   symbolic name and its associated value, as per the following table.

RFC6824 - Page 56

   +-------+--------------+----------------------------+---------------+
   | Value |    Symbol    |            Name            |   Reference   |
   +-------+--------------+----------------------------+---------------+
   |  0x0  |  MP_CAPABLE  |      Multipath Capable     |  Section 3.1  |
   |  0x1  |    MP_JOIN   |       Join Connection      |  Section 3.2  |
   |  0x2  |      DSS     | Data Sequence Signal (Data |  Section 3.3  |
   |       |              |    ACK and data sequence   |               |
   |       |              |          mapping)          |               |
   |  0x3  |   ADD_ADDR   |         Add Address        | Section 3.4.1 |
   |  0x4  |  REMOVE_ADDR |       Remove Address       | Section 3.4.2 |
   |  0x5  |    MP_PRIO   |   Change Subflow Priority  | Section 3.3.8 |
   |  0x6  |    MP_FAIL   |          Fallback          |  Section 3.6  |
   |  0x7  | MP_FASTCLOSE |         Fast Close         |  Section 3.5  |
   +-------+--------------+----------------------------+---------------+

                      Table 2: MPTCP Option Subtypes

   Values 0x8 through 0xe are currently unassigned.  The value 0xf is
   reserved for Private Use within controlled testbeds.

   IANA has created another sub-registry, "MPTCP Handshake Algorithms"
   under the "Transmission Control Protocol (TCP) Parameters" registry,
   based on the flags in MP_CAPABLE (Section 3.1).  The flags consist of
   8 bits, labeled "A" through "H", and this document assigns the bits
   as follows:

         +----------+-------------------+-----------------------+
         | Flag Bit |      Meaning      |       Reference       |
         +----------+-------------------+-----------------------+
         |     A    | Checksum required | RFC 6824, Section 3.1 |
         |     B    |   Extensibility   | RFC 6824, Section 3.1 |
         |    C-G   |     Unassigned    |                       |
         |     H    |     HMAC-SHA1     | RFC 6824, Section 3.2 |
         +----------+-------------------+-----------------------+

                    Table 3: MPTCP Handshake Algorithms

   Note that the meanings of bits C through H can be dependent upon bit
   B, depending on how Extensibility is defined in future
   specifications; see Section 3.1 for more information.

   Future assignments in this registry are also to be defined by
   Standards Action as defined by [25].  Assignments consist of the
   value of the flags, a symbolic name for the algorithm, and a
   reference to its specification.

(next page on part 4)