RFC 7323

TCP Extensions for High Performance

Pages: 49
Proposed Standard
→ Errata
Obsoletes: 1323

Part 3 of 3 – Pages 27 to 49

RFC7323 - Page 27 prevText

6.  Conclusions and Acknowledgments

   This memo presented a set of extensions to TCP to provide efficient
   operation over large bandwidth * delay product paths and reliable
   operation over very high-speed paths.  These extensions are designed
   to provide compatible interworking with TCP stacks that do not
   implement the extensions.

   These mechanisms are implemented using TCP options for scaled windows
   and timestamps.  The timestamps are used for two distinct mechanisms:
   RTTM and PAWS.

   The Window Scale option was originally suggested by Mike St. Johns of
   USAF/DCA.  The present form of the option was suggested by Mike
   Karels of UC Berkeley in response to a more cumbersome scheme defined
   by Van Jacobson.  Lixia Zhang helped formulate the PAWS mechanism
   description in [RFC1185].

   Finally, much of this work originated as the result of discussions
   within the End-to-End Task Force on the theoretical limitations of
   transport protocols in general and TCP in particular.  Task force
   members and others on the end2end-interest list have made valuable
   contributions by pointing out flaws in the algorithms and the
   documentation.  Continued discussion and development since the
   publication of [RFC1323] originally occurred in the IETF TCP Large
   Windows Working Group, later on in the End-to-End Task Force, and
   most recently in the IETF TCP Maintenance Working Group.  The authors
   are grateful for all these contributions.

7.  Security Considerations

   The TCP sequence space is a fixed size, and as the window becomes
   larger, it becomes easier for an attacker to generate forged packets
   that can fall within the TCP window and be accepted as valid
   segments.  While use of timestamps and PAWS can help to mitigate
   this, when using PAWS, if an attacker is able to forge a packet that
   is acceptable to the TCP connection, a timestamp that is in the
   future would cause valid segments to be dropped due to PAWS checks.
   Hence, implementers should take care to not open the TCP window
   drastically beyond the requirements of the connection.

RFC7323 - Page 28

   See [RFC5961] for mitigation strategies to blind in-window attacks.

   A naive implementation that derives the timestamp clock value
   directly from a system uptime clock may unintentionally leak this
   information to an attacker.  This does not directly compromise any of
   the mechanisms described in this document.  However, this may be
   valuable information to a potential attacker.  It is therefore
   RECOMMENDED to generate a random, per-connection offset to be used
   with the clock source when generating the Timestamps option value
   (see Section 5.4).  By carefully choosing this random offset, further
   improvements as described in [RFC6191] are possible.

   Expanding the TCP window beyond 64 KiB for IPv6 allows Jumbograms
   [RFC2675] to be used when the local network supports packets larger
   than 64 KiB.  When larger TCP segments are used, the TCP checksum
   becomes weaker.

   Mechanisms to protect the TCP header from modification should also
   protect the TCP options.

   Middleboxes and TCP options:

      Some middleboxes have been known to remove the TCP options
      described in this document from TCP segments [Honda11].
      Middleboxes that remove TCP options described in this document
      from the <SYN> segment interfere with the selection of parameters
      appropriate for the session.  Removing any of these options in a
      <SYN,ACK> segment will leave the end hosts in a state that
      destroys the proper operation of the protocol.

      *  If a Window Scale option is removed from a <SYN,ACK> segment,
         the end hosts will not negotiate the window scaling factor
         correctly.  Middleboxes must not remove or modify the Window
         Scale option from <SYN,ACK> segments.

      *  If a stateful firewall uses the window field to detect whether
         a received segment is inside the current window, and does not
         support the Window Scale option, it will not be able to
         correctly determine whether or not a packet is in the window.
         These middle boxes must also support the Window Scale option
         and apply the scale factor when processing segments.  If the
         window scale factor cannot be determined, it must not do
         window-based processing.

RFC7323 - Page 29

      *  If the Timestamps option is removed from the <SYN> or <SYN,ACK>
         segments, high speed connections that need PAWS would not have
         that protection.  Successful negotiation of the Timestamps
         option enforces a stricter verification of incoming segments at
         the receiver.  If the Timestamps option was removed from a
         subsequent data segment after a successful negotiation (e.g.,
         as part of resegmentation), the segment is discarded by the
         receiver without further processing.  Middleboxes should not
         remove the Timestamps option.

      *  It must be noted that [RFC1323] doesn't address the case of the
         Timestamps option being dropped or selectively omitted after
         being negotiated, and that the update in this document may
         cause some broken middlebox behavior to be detected
         (potentially unresponsive TCP sessions).

   Implementations that depend on PAWS could provide a mechanism for the
   application to determine whether or not PAWS is in use on the
   connection and choose to terminate the connection if that protection
   doesn't exist.  This is not just to protect the connection against
   middleboxes that might remove the Timestamps option, but also against
   remote hosts that do not have Timestamp support.

7.1.  Privacy Considerations

   The TCP options described in this document do not expose individual
   user's data.  However, a naive implementation simply using the system
   clock as a source for the Timestamps option will reveal
   characteristics of the TCP, potentially allowing more targeted
   attacks.  It is therefore RECOMMENDED to generate a random, per-
   connection offset to be used with the clock source when generating
   the Timestamps option value (see Section 5.4).

   Furthermore, the combination, relative ordering, and padding of the
   TCP options described in Sections 2.2 and 3.2 will reveal additional
   clues to allow the fingerprinting of the system.

8.  IANA Considerations

   The described TCP options are well known from the superceded
   [RFC1323].  IANA has updated the "TCP Option Kind Numbers" table
   under "TCP Parameters" to list this document (RFC 7323) as the
   reference for "Window Scale" and "Timestamps".

RFC7323 - Page 30

9.  References

9.1.  Normative References

   [RFC793]   Postel, J., "Transmission Control Protocol", STD 7, RFC
              793, September 1981.

   [RFC1191]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
              November 1990.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

9.2.  Informative References

   [Allman99] Allman, M. and V. Paxson, "On Estimating End-to-End
              Network Path Properties", Proceedings of the ACM SIGCOMM
              Technical Symposium, Cambridge, MA, September 1999,
              <http://aciri.org/mallman/papers/estimation-la.pdf>.

   [Floyd05]  Floyd, S., "Subject: Re: [tcpm] RFC 1323: Timestamps
              option", message to the TCPM mailing list, 26 January
              2007, <http://www.ietf.org/mail-archive/web/tcpm/current/
              msg02508.html>.

   [Garlick77]
              Garlick, L., Rom, R., and J. Postel, "Issues in Reliable
              Host-to-Host Protocols", Proceedings of the Second
              Berkeley Workshop on Distributed Data Management and
              Computer Networks, March 1977,
              <http://www.rfc-editor.org/ien/ien12.txt>.

   [Honda11]  Honda, M., Nishida, Y., Raiciu, C., Greenhalgh, A.,
              Handley, M., and H. Tokuda, "Is it Still Possible to
              Extend TCP?", Proceedings of the ACM Internet Measurement
              Conference (IMC) '11, November 2011.

   [Jacobson88a]
              Jacobson, V., "Congestion Avoidance and Control", SIGCOMM
              '88, Stanford, CA, August 1988,
              <http://ee.lbl.gov/papers/congavoid.pdf>.

   [Jacobson90a]
              Jacobson, V., "4BSD Header Prediction", ACM Computer
              Communication Review, April 1990.

RFC7323 - Page 31

   [Jacobson90c]
              Jacobson, V., "Subject: modified TCP congestion avoidance
              algorithm", message to the End2End-Interest mailing list,
              30 April 1990, <ftp://ftp.isi.edu/end2end/
              end2end-interest-1990.mail>.

   [Karn87]   Karn, P. and C. Partridge, "Estimating Round-Trip Times in
              Reliable Transport Protocols", Proceedings of SIGCOMM '87,
              August 1987.

   [Kuehlewind10]
              Kuehlewind, M. and B. Briscoe, "Chirping for Congestion
              Control - Implementation Feasibility", November 2010,
              <http://bobbriscoe.net/projects/netsvc_i-f/
              chirp_pfldnet10.pdf>.

   [Kuzmanovic03]
              Kuzmanovic, A. and E. Knightly, "TCP-LP: Low-Priority
              Service via End-Point Congestion Control", 2003,
              <www.cs.northwestern.edu/~akuzma/doc/TCP-LP-ToN.pdf>.

   [Ludwig00] Ludwig, R. and K. Sklower, "The Eifel Retransmission
              Timer", ACM SIGCOMM Computer Communication Review Volume
              30 Issue 3, July 2000,
              <http://ccr.sigcomm.org/archive/2000/july00/
              LudwigFinal.pdf>.

   [Martin03] Martin, D., "Subject: [Tsvwg] RFC 1323.bis", message to
              the TSVWG mailing list, 30 September 2003,
              <http://www.ietf.org/mail-archive/web/tsvwg/current/
              msg04435.html>.

   [Medina04] Medina, A., Allman, M., and S. Floyd, "Measuring
              Interactions Between Transport Protocols and Middleboxes",
              Proceedings of the ACM SIGCOMM/USENIX Internet Measurement
              Conference, October 2004,
              <http://www.icir.net/tbit/tbit-Aug2004.pdf>.

   [Medina05] Medina, A., Allman, M., and S. Floyd, "Measuring the
              Evolution of Transport Protocols in the Internet", ACM
              Computer Communication Review Volume 35, No. 2, April
              2005,
              <http://icir.net/floyd/papers/TCPevolution-Mar2005.pdf>.

RFC7323 - Page 32

   [RE-1323BIS]
              Oppermann, A., "Subject: Re: [tcpm] I-D Action: draft-
              ietf.tcpm-1323bis-13.txt", message to the TCPM mailing
              list, 01 June 2013, <http://www.ietf.org/
              mail-archive/web/tcpm/current/msg08001.html>.

   [RFC1072]  Jacobson, V. and R. Braden, "TCP extensions for long-delay
              paths", RFC 1072, October 1988.

   [RFC1122]  Braden, R., "Requirements for Internet Hosts -
              Communication Layers", STD 3, RFC 1122, October 1989.

   [RFC1185]  Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for
              High-Speed Paths", RFC 1185, October 1990.

   [RFC1323]  Jacobson, V., Braden, B., and D. Borman, "TCP Extensions
              for High Performance", RFC 1323, May 1992.

   [RFC1981]  McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery
              for IP version 6", RFC 1981, August 1996.

   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
              Selective Acknowledgment Options", RFC 2018, October 1996.

   [RFC2675]  Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms",
              RFC 2675, August 1999.

   [RFC2883]  Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
              Extension to the Selective Acknowledgement (SACK) Option
              for TCP", RFC 2883, July 2000.

   [RFC3522]  Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm
              for TCP", RFC 3522, April 2003.

   [RFC4015]  Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm
              for TCP", RFC 4015, February 2005.

   [RFC4821]  Mathis, M. and J. Heffner, "Packetization Layer Path MTU
              Discovery", RFC 4821, March 2007.

   [RFC4963]  Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly
              Errors at High Data Rates", RFC 4963, July 2007.

   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
              Control", RFC 5681, September 2009.

RFC7323 - Page 33

   [RFC5961]  Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's
              Robustness to Blind In-Window Attacks", RFC 5961, August
              2010.

   [RFC6191]  Gont, F., "Reducing the TIME-WAIT State Using TCP
              Timestamps", BCP 159, RFC 6191, April 2011.

   [RFC6298]  Paxson, V., Allman, M., Chu, J., and M. Sargent,
              "Computing TCP's Retransmission Timer", RFC 6298, June
              2011.

   [RFC6528]  Gont, F. and S. Bellovin, "Defending against Sequence
              Number Attacks", RFC 6528, February 2012.

   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
              and Y. Nishida, "A Conservative Loss Recovery Algorithm
              Based on Selective Acknowledgment (SACK) for TCP", RFC
              6675, August 2012.

   [RFC6691]  Borman, D., "TCP Options and Maximum Segment Size (MSS)",
              RFC 6691, July 2012.

   [RFC6817]  Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind,
              "Low Extra Delay Background Transport (LEDBAT)", RFC 6817,
              December 2012.

RFC7323 - Page 34

Appendix A.  Implementation Suggestions

   TCP Option Layout

      The following layout is recommended for sending options on
      non-<SYN> segments to achieve maximum feasible alignment of 32-bit
      and 64-bit machines.

                   +--------+--------+--------+--------+
                   |   NOP  |  NOP   |  TSopt |   10   |
                   +--------+--------+--------+--------+
                   |          TSval timestamp          |
                   +--------+--------+--------+--------+
                   |          TSecr timestamp          |
                   +--------+--------+--------+--------+

   Interaction with the TCP Urgent Pointer

      The TCP Urgent Pointer, like the TCP window, is a 16-bit value.
      Some of the original discussion for the TCP Window Scale option
      included proposals to increase the Urgent Pointer to 32 bits.  As
      it turns out, this is unnecessary.  There are two observations
      that should be made:

      (1)  With IP version 4, the largest amount of TCP data that can be
           sent in a single packet is 65495 bytes (64 KiB - 1 - size of
           fixed IP and TCP headers).

      (2)  Updates to the Urgent Pointer while the user is in "urgent
           mode" are invisible to the user.

      This means that if the Urgent Pointer points beyond the end of the
      TCP data in the current segment, then the user will remain in
      urgent mode until the next TCP segment arrives.  That segment will
      update the Urgent Pointer to a new offset, and the user will never
      have left urgent mode.

      Thus, to properly implement the Urgent Pointer, the sending TCP
      only has to check for overflow of the 16-bit Urgent Pointer field
      before filling it in.  If it does overflow, than a value of 65535
      should be inserted into the Urgent Pointer.

      The same technique applies to IP version 6, except in the case of
      IPv6 Jumbograms.  When IPv6 Jumbograms are supported, [RFC2675]
      requires additional steps for dealing with the Urgent Pointer;
      these steps are described in Section 5.2 of [RFC2675].

RFC7323 - Page 35

Appendix B.  Duplicates from Earlier Connection Incarnations

   There are two cases to be considered: (1) a system crashing (and
   losing connection state) and restarting, and (2) the same connection
   being closed and reopened without a loss of host state.  These will
   be described in the following two sections.

B.1.  System Crash with Loss of State

   TCP's quiet time of one MSL upon system startup handles the loss of
   connection state in a system crash/restart.  For an explanation, see,
   for example, "Knowing When to Keep Quiet" in the TCP protocol
   specification [RFC0793].  The MSL that is required here does not
   depend upon the transfer speed.  The current TCP MSL of 2 minutes
   seemed acceptable as an operational compromise, when many host
   systems used to take this long to boot after a crash.  Current host
   systems can boot considerably faster.

   The Timestamps option may be used to ease the MSL requirements (or to
   provide additional security against data corruption).  If timestamps
   are being used and if the timestamp clock can be guaranteed to be
   monotonic over a system crash/restart, i.e., if the first value of
   the sender's timestamp clock after a crash/restart can be guaranteed
   to be greater than the last value before the restart, then a quiet
   time is unnecessary.

   To dispense totally with the quiet time would require that the host
   clock be synchronized to a time source that is stable over the crash/
   restart period, with an accuracy of one timestamp clock tick or
   better.  We can back off from this strict requirement to take
   advantage of approximate clock synchronization.  Suppose that the
   clock is always resynchronized to within N timestamp clock ticks and
   that booting (extended with a quiet time, if necessary) takes more
   than N ticks.  This will guarantee monotonicity of the timestamps,
   which can then be used to reject old duplicates even without an
   enforced MSL.

B.2.  Closing and Reopening a Connection

   When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state
   ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793]).
   Applications built upon TCP that close one connection and open a new
   one (e.g., an FTP data transfer connection using Stream mode) must
   choose a new socket pair each time.  The TIME-WAIT delay serves two
   different purposes:

RFC7323 - Page 36

   (a)  Implement the full-duplex reliable close handshake of TCP.

        The proper time to delay the final close step is not really
        related to the MSL; it depends instead upon the RTO for the FIN
        segments and, therefore, upon the RTT of the path.  (It could be
        argued that the side that is sending a FIN knows what degree of
        reliability it needs, and therefore it should be able to
        determine the length of the TIME-WAIT delay for the FIN's
        recipient.  This could be accomplished with an appropriate TCP
        option in FIN segments.)

        Although there is no formal upper bound on RTT, common network
        engineering practice makes an RTT greater than 1 minute very
        unlikely.  Thus, the 4-minute delay in TIME-WAIT state works
        satisfactorily to provide a reliable full-duplex TCP close.
        Note again that this is independent of MSL enforcement and
        network speed.

        The TIME-WAIT state could cause an indirect performance problem
        if an application needed to repeatedly close one connection and
        open another at a very high frequency, since the number of
        available TCP ports on a host is less than 2^16.  However, high
        network speeds are not the major contributor to this problem;
        the RTT is the limiting factor in how quickly connections can be
        opened and closed.  Therefore, this problem will be no worse at
        high transfer speeds.

   (b)  Allow old duplicate segments to expire.

        To replace this function of TIME-WAIT state, a mechanism would
        have to operate across connections.  PAWS is defined strictly
        within a single connection; the last timestamp (TS.Recent) is
        kept in the connection control block and discarded when a
        connection is closed.

        An additional mechanism could be added to the TCP, a per-host
        cache of the last timestamp received from any connection.  This
        value could then be used in the PAWS mechanism to reject old
        duplicate segments from earlier incarnations of the connection,
        if the timestamp clock can be guaranteed to have ticked at least
        once since the old connection was open.  This would require that
        the TIME-WAIT delay plus the RTT together must be at least one
        tick of the sender's timestamp clock.  Such an extension is not
        part of the proposal of this RFC.

        Note that this is a variant on the mechanism proposed by
        Garlick, Rom, and Postel [Garlick77], which required each host
        to maintain connection records containing the highest sequence

RFC7323 - Page 37

        numbers on every connection.  Using timestamps instead, it is
        only necessary to keep one quantity per remote host, regardless
        of the number of simultaneous connections to that host.

Appendix C.  Summary of Notation

   The following notation has been used in this document.

   Options

      WSopt:            TCP Window Scale option
      TSopt:            TCP Timestamps option

   Option Fields

      shift.cnt:        Window scale byte in WSopt
      TSval:            32-bit Timestamp Value field in TSopt
      TSecr:            32-bit Timestamp Reply field in TSopt

   Option Fields in Current Segment

      SEG.TSval:        TSval field from TSopt in current segment
      SEG.TSecr:        TSecr field from TSopt in current segment
      SEG.WSopt:        8-bit value in WSopt

   Clock Values

      my.TSclock:       System-wide source of 32-bit timestamp values
      my.TSclock.rate:  Period of my.TSclock (1 ms to 1 sec)
      Snd.TSoffset:     An offset for randomizing Snd.TSclock
      Snd.TSclock:      my.TSclock + Snd.TSoffset

   Per-Connection State Variables

      TS.Recent:        Latest received Timestamp
      Last.ACK.sent:    Last ACK field sent
      Snd.TS.OK:        1-bit flag
      Snd.WS.OK:        1-bit flag
      Rcv.Wind.Shift:   Receive window scale exponent
      Snd.Wind.Shift:   Send window scale exponent
      Start.Time:       Snd.TSclock value when the segment being timed
                        was sent (used by code from before RFC 1323).

   Procedure

      Update_SRTT(m)    Procedure to update the smoothed RTT and RTT
                        variance estimates, using the rules of
                        [Jacobson88a], given m, a new RTT measurement

RFC7323 - Page 38

   Send Sequence Variables

      SND.UNA:          Send unacknowledged
      SND.NXT:          Send next
      SND.WND:          Send window
      ISS:              Initial send sequence number

   Receive Sequence Variables

      RCV.NXT:          Receive next
      RCV.WND:          Receive window
      IRS:              Initial receive sequence number

Appendix D.  Event Processing Summary

   This appendix attempts to specify the algorithms unambiguously by
   presenting modifications to the Event Processing rules in Section 3.9
   of RFC 793.  The change bars ("|") indicate lines that are different
   from RFC 793.

   OPEN Call

      ...

      An initial send sequence number (ISS) is selected.  Send a <SYN>
 |    segment of the form:
 |
 |      <SEQ=ISS><CTL=SYN><TSval=Snd.TSclock><WSopt=Rcv.Wind.Shift>

      ...

   SEND Call

      CLOSED STATE (i.e., TCB does not exist)

         ...

      LISTEN STATE

         If active and the foreign socket is specified, then change the
         connection from passive to active, select an ISS.  Send a SYN
 |       segment containing the options: <TSval=Snd.TSclock> and
 |       <WSopt=Rcv.Wind.Shift>.  Set SND.UNA to ISS, SND.NXT to ISS+1.
         Enter SYN-SENT state.  ...

      SYN-SENT STATE
      SYN-RECEIVED STATE

RFC7323 - Page 39

         ...

      ESTABLISHED STATE
      CLOSE-WAIT STATE

         Segmentize the buffer and send it with a piggybacked
         acknowledgment (acknowledgment value = RCV.NXT).  ...

         If the urgent flag is set ...

 |       If the Snd.TS.OK flag is set, then include the TCP Timestamps
 |       option <TSval=Snd.TSclock,TSecr=TS.Recent> in each data
 |       segment.
 |
 |       Scale the receive window for transmission in the segment
 |       header:
 |
 |               SEG.WND = (RCV.WND >> Rcv.Wind.Shift).

   SEGMENT ARRIVES

      ...

      If the state is LISTEN then

         first check for an RST

            ...

         second check for an ACK

            ...

         third check for a SYN

            If the SYN bit is set, check the security.  If the ...

               ...

            If the SEG.PRC is less than the TCB.PRC then continue.

 |          Check for a Window Scale option (WSopt); if one is found,
 |          save SEG.WSopt in Snd.Wind.Shift and set Snd.WS.OK flag on.
 |          Otherwise, set both Snd.Wind.Shift and Rcv.Wind.Shift to
 |          zero and clear Snd.WS.OK flag.
 |
 |          Check for a TSopt option; if one is found, save SEG.TSval in
 |          the variable TS.Recent and turn on the Snd.TS.OK bit.

RFC7323 - Page 40

            Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any
            other control or text should be queued for processing later.
            ISS should be selected and a SYN segment sent of the form:

                    <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK>

 |           If the Snd.WS.OK bit is on, include a WSopt
 |           <WSopt=Rcv.Wind.Shift> in this segment.  If the Snd.TS.OK
 |           bit is on, include a TSopt <TSval=Snd.TSclock,
 |           TSecr=TS.Recent> in this segment.  Last.ACK.sent is set to
 |           RCV.NXT.

            SND.NXT is set to ISS+1 and SND.UNA to ISS.  The connection
            state should be changed to SYN-RECEIVED.  Note that any
            other incoming control or data (combined with SYN) will be
            processed in the SYN-RECEIVED state, but processing of SYN
            and ACK should not be repeated.  If the listen was not fully
            specified (i.e., the foreign socket was not fully
            specified), then the unspecified fields should be filled in
            now.

         fourth other text or control

            ...

      If the state is SYN-SENT then

         first check the ACK bit

            ...

         ...

         fourth check the SYN bit

            ...

            If the SYN bit is on and the security/compartment and
            precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1,
            IRS is set to SEG.SEQ.  SND.UNA should be advanced to equal
            SEG.ACK (if there is an ACK), and any segments on the
            retransmission queue which are thereby acknowledged should
            be removed.

 |          Check for a Window Scale option (WSopt); if it is found,
 |          save SEG.WSopt in Snd.Wind.Shift; otherwise, set both
 |          Snd.Wind.Shift and Rcv.Wind.Shift to zero.
 |

RFC7323 - Page 41

 |          Check for a TSopt option; if one is found, save SEG.TSval in
 |          variable TS.Recent and turn on the Snd.TS.OK bit in the
 |          connection control block.  If the ACK bit is set, use
 |          Snd.TSclock - SEG.TSecr as the initial RTT estimate.

            If SND.UNA > ISS (our SYN has been ACKed), change the
            connection state to ESTABLISHED, form an <ACK> segment:

                    <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>

 |          and send it.  If the Snd.TS.OK bit is on, include a TSopt
 |          option <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK>
 |          segment.  Last.ACK.sent is set to RCV.NXT.

            Data or controls that were queued for transmission may be
            included.  If there are other controls or text in the
            segment, then continue processing at the sixth step below
            where the URG bit is checked; otherwise, return.

            Otherwise, enter SYN-RECEIVED, form a <SYN,ACK> segment:

                    <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK>

 |          and send it.  If the Snd.TS.OK bit is on, include a TSopt
 |          option <TSval=Snd.TSclock,TSecr=TS.Recent> in this segment.
 |          If the Snd.WS.OK bit is on, include a WSopt option
 |          <WSopt=Rcv.Wind.Shift> in this segment.  Last.ACK.sent is
 |          set to RCV.NXT.

            If there are other controls or text in the segment, queue
            them for processing after the ESTABLISHED state has been
            reached, return.

         fifth, if neither of the SYN or RST bits is set then drop the
         segment and return.

      Otherwise

      first check the sequence number

         SYN-RECEIVED STATE
         ESTABLISHED STATE
         FIN-WAIT-1 STATE
         FIN-WAIT-2 STATE
         CLOSE-WAIT STATE
         CLOSING STATE
         LAST-ACK STATE
         TIME-WAIT STATE

RFC7323 - Page 42

            Segments are processed in sequence.  Initial tests on
            arrival are used to discard old duplicates, but further
            processing is done in SEG.SEQ order.  If a segment's
            contents straddle the boundary between old and new, only the
            new parts should be processed.

 |          Rescale the received window field:
 |
 |                TrueWindow = SEG.WND << Snd.Wind.Shift,
 |
 |          and use "TrueWindow" in place of SEG.WND in the following
 |          steps.
 |
 |          Check whether the segment contains a Timestamps option and
 |          if bit Snd.TS.OK is on.  If so:
 |
 |             If SEG.TSval < TS.Recent and the RST bit is off:
 |
 |                If the connection has been idle more than 24 days,
 |                save SEG.TSval in variable TS.Recent, else the segment
 |                is not acceptable; follow the steps below for an
 |                unacceptable segment.
 |
 |             If SEG.TSval >= TS.Recent and SEG.SEQ <= Last.ACK.sent,
 |             then save SEG.TSval in variable TS.Recent.

            There are four cases for the acceptability test for an
            incoming segment:

               ...

            If an incoming segment is not acceptable, an acknowledgment
            should be sent in reply (unless the RST bit is set; if so
            drop the segment and return):

                    <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>

 |          Last.ACK.sent is set to SEG.ACK of the acknowledgment.  If
 |          the Snd.TS.OK bit is on, include the Timestamps option
 |          <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK> segment.
            Set Last.ACK.sent to SEG.ACK and send the <ACK> segment.
            After sending the acknowledgment, drop the unacceptable
            segment and return.

      ...

RFC7323 - Page 43

      fifth check the ACK field,

         if the ACK bit is off drop the segment and return

         if the ACK bit is on

            ...

            ESTABLISHED STATE

               If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <-
 |             SEG.ACK.  Also compute a new estimate of round-trip time.
 |             If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr;
 |             otherwise, use the elapsed time since the first segment
 |             in the retransmission queue was sent.  Any segments on
               the retransmission queue that are thereby entirely
               acknowledged...

      ...

      seventh, process the segment text,

         ESTABLISHED STATE
         FIN-WAIT-1 STATE
         FIN-WAIT-2 STATE

            ...

            Send an acknowledgment of the form:

                    <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>

 |          If the Snd.TS.OK bit is on, include the Timestamps option
 |          <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK> segment.
 |          Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send
 |          it.  This acknowledgment should be piggybacked on a segment
            being transmitted if possible without incurring undue delay.

            ...

RFC7323 - Page 44

Appendix E.  Timestamps Edge Cases

   While the rules laid out for when to calculate RTTM produce the
   correct results most of the time, there are some edge cases where an
   incorrect RTTM can be calculated.  All of these situations involve
   the loss of segments.  It is felt that these scenarios are rare, and
   that if they should happen, they will cause a single RTTM measurement
   to be inflated, which mitigates its effects on RTO calculations.

   [Martin03] cites two similar cases when the returning <ACK> is lost,
   and before the retransmission timer fires, another returning <ACK>
   segment arrives, which acknowledges the data.  In this case, the RTTM
   calculated will be inflated:

          clock
            tc=1   <A, TSval=1> ------------------->

            tc=2   (lost) <---- <ACK(A), TSecr=1, win=n>
                (RTTM would have been 1)

                   (receive window opens, window update is sent)
            tc=5        <---- <ACK(A), TSecr=1, win=m>
                   (RTTM is calculated at 4)

   One thing to note about this situation is that it is somewhat bounded
   by RTO + RTT, limiting how far off the RTTM calculation will be.
   While more complex scenarios can be constructed that produce larger
   inflations (e.g., retransmissions are lost), those scenarios involve
   multiple segment losses, and the connection will have other more
   serious operational problems than using an inflated RTTM in the RTO
   calculation.

Appendix F.  Window Retraction Example

   Consider an established TCP connection using a scale factor of 128,
   Snd.Wind.Shift=7 and Rcv.Wind.Shift=7, that is running with a very
   small window because the receiver is bottlenecked and both ends are
   doing small reads and writes.

   Consider the ACKs coming back:

   SEG.ACK  SEG.WIN computed SND.WIN   receiver's actual window
   1000     2       1256               1300

   The sender writes 40 bytes and receiver ACKs:

   1040     2       1296               1300

RFC7323 - Page 45

   The sender writes 5 additional bytes and the receiver has a problem.
   Two choices:

   1045     2       1301               1300   - BEYOND BUFFER

   1045     1       1173               1300   - RETRACTED WINDOW

   This is a general problem and can happen any time the sender does a
   write, which is smaller than the window scale factor.

   In most stacks, it is at least partially obscured when the window
   size is larger than some small number of segments because the stacks
   prefer to announce windows that are an integral number of segments,
   rounded up to the next scale factor.  This plus silly window
   suppression tends to cause less frequent, larger window updates.  If
   the window was rounded down to a segment size, there is more
   opportunity to advance the window, the BEYOND BUFFER case above,
   rather than retracting it.

Appendix G.  RTO Calculation Modification

   Taking multiple RTT samples per window would shorten the history
   calculated by the RTO mechanism in [RFC6298], and the below algorithm
   aims to maintain a similar history as originally intended by
   [RFC6298].

   It is roughly known how many samples a congestion window worth of
   data will yield, not accounting for ACK compression, and ACK losses.
   Such events will result in more history of the path being reflected
   in the final value for RTO, and are uncritical.  This modification
   will ensure that a similar amount of time is taken into account for
   the RTO estimation, regardless of how many samples are taken per
   window:

      ExpectedSamples = ceiling(FlightSize / (SMSS * 2))

      alpha' = alpha / ExpectedSamples

      beta' = beta / ExpectedSamples

   Note that the factor 2 in ExpectedSamples is due to "Delayed ACKs".

RFC7323 - Page 46

   Instead of using alpha and beta in the algorithm of [RFC6298], use
   alpha' and beta' instead:

      RTTVAR <- (1 - beta') * RTTVAR + beta' * |SRTT - R'|

      SRTT <- (1 - alpha') * SRTT + alpha' * R'

      (for each sample R')

Appendix H.  Changes from RFC 1323

   Several important updates and clarifications to the specification in
   RFC 1323 are made in this document.  The technical changes are
   summarized below:

   (a)  A wrong reference to SND.WND was corrected to SEG.WND in
        Section 2.3.

   (b)  Section 2.4 was added describing the unavoidable window
        retraction issue and explicitly describing the mitigation steps
        necessary.

   (c)  In Section 3.2, the wording how the Timestamps option
        negotiation is to be performed was updated with RFC2119 wording.
        Further, a number of paragraphs were added to clarify the
        expected behavior with a compliant implementation using TSopt,
        as RFC 1323 left room for interpretation -- e.g., potential late
        enablement of TSopt.

   (d)  The description of which TSecr values can be used to update the
        measured RTT has been clarified.  Specifically, with timestamps,
        the Karn algorithm [Karn87] is disabled.  The Karn algorithm
        disables all RTT measurements during retransmission, since it is
        ambiguous whether the <ACK> is for the original segment, or the
        retransmitted segment.  With timestamps, that ambiguity is
        removed since the TSecr in the <ACK> will contain the TSval from
        whichever data segment made it to the destination.

   (e)  RTTM update processing explicitly excludes segments not updating
        SND.UNA.  The original text could be interpreted to allow taking
        RTT samples when SACK acknowledges some new, non-continuous
        data.

RFC7323 - Page 47

   (f)  In RFC 1323, Section 3.4, step (2) of the algorithm to control
        which timestamp is echoed was incorrect in two regards:

        (1)  It failed to update TS.Recent for a retransmitted segment
             that resulted from a lost <ACK>.

        (2)  It failed if SEG.LEN = 0.

        In the new algorithm, the case of SEG.TSval >= TS.Recent is
        included for consistency with the PAWS test.

   (g)  It is now recommended that the Timestamps option is included in
        <RST> segments if the incoming segment contained a Timestamps
        option.

   (h)  <RST> segments are explicitly excluded from PAWS processing.

   (i)  Added text to clarify the precedence between regular TCP
        [RFC0793] and this document's Timestamps option / PAWS
        processing.  Discussion about combined acceptability checks are
        ongoing.

   (j)  Snd.TSoffset and Snd.TSclock variables have been added.
        Snd.TSclock is the sum of my.TSclock and Snd.TSoffset.  This
        allows the starting points for timestamp values to be randomized
        on a per-connection basis.  Setting Snd.TSoffset to zero yields
        the same results as [RFC1323].  Text was added to guide
        implementers to the proper selection of these offsets, as
        entirely random offsets for each new connection will conflict
        with PAWS.

   (k)  Appendix A has been expanded with information about the TCP
        Urgent Pointer.  An earlier revision contained text around the
        TCP MSS option, which was split off into [RFC6691].

   (l)  One correction was made to the Event Processing Summary in
        Appendix D.  In SEND CALL/ESTABLISHED STATE, RCV.WND is used to
        fill in the SEG.WND value, not SND.WND.

   (m)  Appendix G was added to exemplify how an RTO calculation might
        be updated to properly take the much higher RTT sampling
        frequency enabled by the Timestamps option into account.

RFC7323 - Page 48

   Editorial changes to the document, that don't impact the
   implementation or function of the mechanisms described in this
   document, include:

   (a)  Removed much of the discussion in Section 1 to streamline the
        document.  However, detailed examples and discussions in
        Sections 2, 3, and 5 are kept as guidelines for implementers.

   (b)  Added short text that the use of WS increases the chances of
        sequence number wrap, thus the PAWS mechanism is required in
        certain environments.

   (c)  Removed references to "new" options, as the options were
        introduced in [RFC1323] already.  Changed the text in
        Section 1.3 to specifically address TS and WS options.

   (d)  Section 1.4 was added for [RFC2119] wording.  Normative text was
        updated with the appropriate phrases.

   (e)  Added < > brackets to mark specific types of segments, and
        replaced most occurrences of "packet" with "segment", where TCP
        segments are referred to.

   (f)  Updated the text in Section 3 to take into account what has been
        learned since [RFC1323].

   (g)  Removed some unused references.

   (h)  Removed the list of changes between [RFC1323] and prior
        versions.  These changes are mentioned in Appendix C of
        [RFC1323].

   (i)  Moved "Changes from RFC 1323" to the end of the appendices for
        easier lookup.  In addition, the entries were split into a
        technical and an editorial part, and sorted to roughly
        correspond with the sections in the text where they apply.

RFC7323 - Page 49

Authors' Addresses

   David Borman
   Quantum Corporation
   Mendota Heights, MN  55120
   USA

   EMail: david.borman@quantum.com


   Bob Braden
   University of Southern California
   4676 Admiralty Way
   Marina del Rey, CA  90292
   USA

   EMail: braden@isi.edu


   Van Jacobson
   Google, Inc.
   1600 Amphitheatre Parkway
   Mountain View, CA  94043
   USA

   EMail: vanj@google.com


   Richard Scheffenegger (editor)
   NetApp, Inc.
   Am Euro Platz 2
   Vienna,  1120
   Austria

   EMail: rs@netapp.com