6. Conclusions and Acknowledgments
This memo presented a set of extensions to TCP to provide efficient operation over large bandwidth * delay product paths and reliable operation over very high-speed paths. These extensions are designed to provide compatible interworking with TCP stacks that do not implement the extensions. These mechanisms are implemented using TCP options for scaled windows and timestamps. The timestamps are used for two distinct mechanisms: RTTM and PAWS. The Window Scale option was originally suggested by Mike St. Johns of USAF/DCA. The present form of the option was suggested by Mike Karels of UC Berkeley in response to a more cumbersome scheme defined by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism description in [RFC1185]. Finally, much of this work originated as the result of discussions within the End-to-End Task Force on the theoretical limitations of transport protocols in general and TCP in particular. Task force members and others on the end2end-interest list have made valuable contributions by pointing out flaws in the algorithms and the documentation. Continued discussion and development since the publication of [RFC1323] originally occurred in the IETF TCP Large Windows Working Group, later on in the End-to-End Task Force, and most recently in the IETF TCP Maintenance Working Group. The authors are grateful for all these contributions.7. Security Considerations
The TCP sequence space is a fixed size, and as the window becomes larger, it becomes easier for an attacker to generate forged packets that can fall within the TCP window and be accepted as valid segments. While use of timestamps and PAWS can help to mitigate this, when using PAWS, if an attacker is able to forge a packet that is acceptable to the TCP connection, a timestamp that is in the future would cause valid segments to be dropped due to PAWS checks. Hence, implementers should take care to not open the TCP window drastically beyond the requirements of the connection.
See [RFC5961] for mitigation strategies to blind in-window attacks. A naive implementation that derives the timestamp clock value directly from a system uptime clock may unintentionally leak this information to an attacker. This does not directly compromise any of the mechanisms described in this document. However, this may be valuable information to a potential attacker. It is therefore RECOMMENDED to generate a random, per-connection offset to be used with the clock source when generating the Timestamps option value (see Section 5.4). By carefully choosing this random offset, further improvements as described in [RFC6191] are possible. Expanding the TCP window beyond 64 KiB for IPv6 allows Jumbograms [RFC2675] to be used when the local network supports packets larger than 64 KiB. When larger TCP segments are used, the TCP checksum becomes weaker. Mechanisms to protect the TCP header from modification should also protect the TCP options. Middleboxes and TCP options: Some middleboxes have been known to remove the TCP options described in this document from TCP segments [Honda11]. Middleboxes that remove TCP options described in this document from the <SYN> segment interfere with the selection of parameters appropriate for the session. Removing any of these options in a <SYN,ACK> segment will leave the end hosts in a state that destroys the proper operation of the protocol. * If a Window Scale option is removed from a <SYN,ACK> segment, the end hosts will not negotiate the window scaling factor correctly. Middleboxes must not remove or modify the Window Scale option from <SYN,ACK> segments. * If a stateful firewall uses the window field to detect whether a received segment is inside the current window, and does not support the Window Scale option, it will not be able to correctly determine whether or not a packet is in the window. These middle boxes must also support the Window Scale option and apply the scale factor when processing segments. If the window scale factor cannot be determined, it must not do window-based processing.
* If the Timestamps option is removed from the <SYN> or <SYN,ACK> segments, high speed connections that need PAWS would not have that protection. Successful negotiation of the Timestamps option enforces a stricter verification of incoming segments at the receiver. If the Timestamps option was removed from a subsequent data segment after a successful negotiation (e.g., as part of resegmentation), the segment is discarded by the receiver without further processing. Middleboxes should not remove the Timestamps option. * It must be noted that [RFC1323] doesn't address the case of the Timestamps option being dropped or selectively omitted after being negotiated, and that the update in this document may cause some broken middlebox behavior to be detected (potentially unresponsive TCP sessions). Implementations that depend on PAWS could provide a mechanism for the application to determine whether or not PAWS is in use on the connection and choose to terminate the connection if that protection doesn't exist. This is not just to protect the connection against middleboxes that might remove the Timestamps option, but also against remote hosts that do not have Timestamp support.7.1. Privacy Considerations
The TCP options described in this document do not expose individual user's data. However, a naive implementation simply using the system clock as a source for the Timestamps option will reveal characteristics of the TCP, potentially allowing more targeted attacks. It is therefore RECOMMENDED to generate a random, per- connection offset to be used with the clock source when generating the Timestamps option value (see Section 5.4). Furthermore, the combination, relative ordering, and padding of the TCP options described in Sections 2.2 and 3.2 will reveal additional clues to allow the fingerprinting of the system.8. IANA Considerations
The described TCP options are well known from the superceded [RFC1323]. IANA has updated the "TCP Option Kind Numbers" table under "TCP Parameters" to list this document (RFC 7323) as the reference for "Window Scale" and "Timestamps".
9. References
9.1. Normative References
[RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, September 1981. [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.9.2. Informative References
[Allman99] Allman, M. and V. Paxson, "On Estimating End-to-End Network Path Properties", Proceedings of the ACM SIGCOMM Technical Symposium, Cambridge, MA, September 1999, <http://aciri.org/mallman/papers/estimation-la.pdf>. [Floyd05] Floyd, S., "Subject: Re: [tcpm] RFC 1323: Timestamps option", message to the TCPM mailing list, 26 January 2007, <http://www.ietf.org/mail-archive/web/tcpm/current/ msg02508.html>. [Garlick77] Garlick, L., Rom, R., and J. Postel, "Issues in Reliable Host-to-Host Protocols", Proceedings of the Second Berkeley Workshop on Distributed Data Management and Computer Networks, March 1977, <http://www.rfc-editor.org/ien/ien12.txt>. [Honda11] Honda, M., Nishida, Y., Raiciu, C., Greenhalgh, A., Handley, M., and H. Tokuda, "Is it Still Possible to Extend TCP?", Proceedings of the ACM Internet Measurement Conference (IMC) '11, November 2011. [Jacobson88a] Jacobson, V., "Congestion Avoidance and Control", SIGCOMM '88, Stanford, CA, August 1988, <http://ee.lbl.gov/papers/congavoid.pdf>. [Jacobson90a] Jacobson, V., "4BSD Header Prediction", ACM Computer Communication Review, April 1990.
[Jacobson90c] Jacobson, V., "Subject: modified TCP congestion avoidance algorithm", message to the End2End-Interest mailing list, 30 April 1990, <ftp://ftp.isi.edu/end2end/ end2end-interest-1990.mail>. [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times in Reliable Transport Protocols", Proceedings of SIGCOMM '87, August 1987. [Kuehlewind10] Kuehlewind, M. and B. Briscoe, "Chirping for Congestion Control - Implementation Feasibility", November 2010, <http://bobbriscoe.net/projects/netsvc_i-f/ chirp_pfldnet10.pdf>. [Kuzmanovic03] Kuzmanovic, A. and E. Knightly, "TCP-LP: Low-Priority Service via End-Point Congestion Control", 2003, <www.cs.northwestern.edu/~akuzma/doc/TCP-LP-ToN.pdf>. [Ludwig00] Ludwig, R. and K. Sklower, "The Eifel Retransmission Timer", ACM SIGCOMM Computer Communication Review Volume 30 Issue 3, July 2000, <http://ccr.sigcomm.org/archive/2000/july00/ LudwigFinal.pdf>. [Martin03] Martin, D., "Subject: [Tsvwg] RFC 1323.bis", message to the TSVWG mailing list, 30 September 2003, <http://www.ietf.org/mail-archive/web/tsvwg/current/ msg04435.html>. [Medina04] Medina, A., Allman, M., and S. Floyd, "Measuring Interactions Between Transport Protocols and Middleboxes", Proceedings of the ACM SIGCOMM/USENIX Internet Measurement Conference, October 2004, <http://www.icir.net/tbit/tbit-Aug2004.pdf>. [Medina05] Medina, A., Allman, M., and S. Floyd, "Measuring the Evolution of Transport Protocols in the Internet", ACM Computer Communication Review Volume 35, No. 2, April 2005, <http://icir.net/floyd/papers/TCPevolution-Mar2005.pdf>.
[RE-1323BIS] Oppermann, A., "Subject: Re: [tcpm] I-D Action: draft- ietf.tcpm-1323bis-13.txt", message to the TCPM mailing list, 01 June 2013, <http://www.ietf.org/ mail-archive/web/tcpm/current/msg08001.html>. [RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay paths", RFC 1072, October 1988. [RFC1122] Braden, R., "Requirements for Internet Hosts - Communication Layers", STD 3, RFC 1122, October 1989. [RFC1185] Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for High-Speed Paths", RFC 1185, October 1990. [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions for High Performance", RFC 1323, May 1992. [RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery for IP version 6", RFC 1981, August 1996. [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP Selective Acknowledgment Options", RFC 2018, October 1996. [RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", RFC 2675, August 1999. [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An Extension to the Selective Acknowledgement (SACK) Option for TCP", RFC 2883, July 2000. [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm for TCP", RFC 3522, April 2003. [RFC4015] Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm for TCP", RFC 4015, February 2005. [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU Discovery", RFC 4821, March 2007. [RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly Errors at High Data Rates", RFC 4963, July 2007. [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion Control", RFC 5681, September 2009.
[RFC5961] Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's Robustness to Blind In-Window Attacks", RFC 5961, August 2010. [RFC6191] Gont, F., "Reducing the TIME-WAIT State Using TCP Timestamps", BCP 159, RFC 6191, April 2011. [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, "Computing TCP's Retransmission Timer", RFC 6298, June 2011. [RFC6528] Gont, F. and S. Bellovin, "Defending against Sequence Number Attacks", RFC 6528, February 2012. [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., and Y. Nishida, "A Conservative Loss Recovery Algorithm Based on Selective Acknowledgment (SACK) for TCP", RFC 6675, August 2012. [RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)", RFC 6691, July 2012. [RFC6817] Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind, "Low Extra Delay Background Transport (LEDBAT)", RFC 6817, December 2012.
Appendix A. Implementation Suggestions
TCP Option Layout The following layout is recommended for sending options on non-<SYN> segments to achieve maximum feasible alignment of 32-bit and 64-bit machines. +--------+--------+--------+--------+ | NOP | NOP | TSopt | 10 | +--------+--------+--------+--------+ | TSval timestamp | +--------+--------+--------+--------+ | TSecr timestamp | +--------+--------+--------+--------+ Interaction with the TCP Urgent Pointer The TCP Urgent Pointer, like the TCP window, is a 16-bit value. Some of the original discussion for the TCP Window Scale option included proposals to increase the Urgent Pointer to 32 bits. As it turns out, this is unnecessary. There are two observations that should be made: (1) With IP version 4, the largest amount of TCP data that can be sent in a single packet is 65495 bytes (64 KiB - 1 - size of fixed IP and TCP headers). (2) Updates to the Urgent Pointer while the user is in "urgent mode" are invisible to the user. This means that if the Urgent Pointer points beyond the end of the TCP data in the current segment, then the user will remain in urgent mode until the next TCP segment arrives. That segment will update the Urgent Pointer to a new offset, and the user will never have left urgent mode. Thus, to properly implement the Urgent Pointer, the sending TCP only has to check for overflow of the 16-bit Urgent Pointer field before filling it in. If it does overflow, than a value of 65535 should be inserted into the Urgent Pointer. The same technique applies to IP version 6, except in the case of IPv6 Jumbograms. When IPv6 Jumbograms are supported, [RFC2675] requires additional steps for dealing with the Urgent Pointer; these steps are described in Section 5.2 of [RFC2675].
Appendix B. Duplicates from Earlier Connection Incarnations
There are two cases to be considered: (1) a system crashing (and losing connection state) and restarting, and (2) the same connection being closed and reopened without a loss of host state. These will be described in the following two sections.B.1. System Crash with Loss of State
TCP's quiet time of one MSL upon system startup handles the loss of connection state in a system crash/restart. For an explanation, see, for example, "Knowing When to Keep Quiet" in the TCP protocol specification [RFC0793]. The MSL that is required here does not depend upon the transfer speed. The current TCP MSL of 2 minutes seemed acceptable as an operational compromise, when many host systems used to take this long to boot after a crash. Current host systems can boot considerably faster. The Timestamps option may be used to ease the MSL requirements (or to provide additional security against data corruption). If timestamps are being used and if the timestamp clock can be guaranteed to be monotonic over a system crash/restart, i.e., if the first value of the sender's timestamp clock after a crash/restart can be guaranteed to be greater than the last value before the restart, then a quiet time is unnecessary. To dispense totally with the quiet time would require that the host clock be synchronized to a time source that is stable over the crash/ restart period, with an accuracy of one timestamp clock tick or better. We can back off from this strict requirement to take advantage of approximate clock synchronization. Suppose that the clock is always resynchronized to within N timestamp clock ticks and that booting (extended with a quiet time, if necessary) takes more than N ticks. This will guarantee monotonicity of the timestamps, which can then be used to reject old duplicates even without an enforced MSL.B.2. Closing and Reopening a Connection
When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793]). Applications built upon TCP that close one connection and open a new one (e.g., an FTP data transfer connection using Stream mode) must choose a new socket pair each time. The TIME-WAIT delay serves two different purposes:
(a) Implement the full-duplex reliable close handshake of TCP. The proper time to delay the final close step is not really related to the MSL; it depends instead upon the RTO for the FIN segments and, therefore, upon the RTT of the path. (It could be argued that the side that is sending a FIN knows what degree of reliability it needs, and therefore it should be able to determine the length of the TIME-WAIT delay for the FIN's recipient. This could be accomplished with an appropriate TCP option in FIN segments.) Although there is no formal upper bound on RTT, common network engineering practice makes an RTT greater than 1 minute very unlikely. Thus, the 4-minute delay in TIME-WAIT state works satisfactorily to provide a reliable full-duplex TCP close. Note again that this is independent of MSL enforcement and network speed. The TIME-WAIT state could cause an indirect performance problem if an application needed to repeatedly close one connection and open another at a very high frequency, since the number of available TCP ports on a host is less than 2^16. However, high network speeds are not the major contributor to this problem; the RTT is the limiting factor in how quickly connections can be opened and closed. Therefore, this problem will be no worse at high transfer speeds. (b) Allow old duplicate segments to expire. To replace this function of TIME-WAIT state, a mechanism would have to operate across connections. PAWS is defined strictly within a single connection; the last timestamp (TS.Recent) is kept in the connection control block and discarded when a connection is closed. An additional mechanism could be added to the TCP, a per-host cache of the last timestamp received from any connection. This value could then be used in the PAWS mechanism to reject old duplicate segments from earlier incarnations of the connection, if the timestamp clock can be guaranteed to have ticked at least once since the old connection was open. This would require that the TIME-WAIT delay plus the RTT together must be at least one tick of the sender's timestamp clock. Such an extension is not part of the proposal of this RFC. Note that this is a variant on the mechanism proposed by Garlick, Rom, and Postel [Garlick77], which required each host to maintain connection records containing the highest sequence
numbers on every connection. Using timestamps instead, it is only necessary to keep one quantity per remote host, regardless of the number of simultaneous connections to that host.Appendix C. Summary of Notation
The following notation has been used in this document. Options WSopt: TCP Window Scale option TSopt: TCP Timestamps option Option Fields shift.cnt: Window scale byte in WSopt TSval: 32-bit Timestamp Value field in TSopt TSecr: 32-bit Timestamp Reply field in TSopt Option Fields in Current Segment SEG.TSval: TSval field from TSopt in current segment SEG.TSecr: TSecr field from TSopt in current segment SEG.WSopt: 8-bit value in WSopt Clock Values my.TSclock: System-wide source of 32-bit timestamp values my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec) Snd.TSoffset: An offset for randomizing Snd.TSclock Snd.TSclock: my.TSclock + Snd.TSoffset Per-Connection State Variables TS.Recent: Latest received Timestamp Last.ACK.sent: Last ACK field sent Snd.TS.OK: 1-bit flag Snd.WS.OK: 1-bit flag Rcv.Wind.Shift: Receive window scale exponent Snd.Wind.Shift: Send window scale exponent Start.Time: Snd.TSclock value when the segment being timed was sent (used by code from before RFC 1323). Procedure Update_SRTT(m) Procedure to update the smoothed RTT and RTT variance estimates, using the rules of [Jacobson88a], given m, a new RTT measurement
Send Sequence Variables SND.UNA: Send unacknowledged SND.NXT: Send next SND.WND: Send window ISS: Initial send sequence number Receive Sequence Variables RCV.NXT: Receive next RCV.WND: Receive window IRS: Initial receive sequence numberAppendix D. Event Processing Summary
This appendix attempts to specify the algorithms unambiguously by presenting modifications to the Event Processing rules in Section 3.9 of RFC 793. The change bars ("|") indicate lines that are different from RFC 793. OPEN Call ... An initial send sequence number (ISS) is selected. Send a <SYN> | segment of the form: | | <SEQ=ISS><CTL=SYN><TSval=Snd.TSclock><WSopt=Rcv.Wind.Shift> ... SEND Call CLOSED STATE (i.e., TCB does not exist) ... LISTEN STATE If active and the foreign socket is specified, then change the connection from passive to active, select an ISS. Send a SYN | segment containing the options: <TSval=Snd.TSclock> and | <WSopt=Rcv.Wind.Shift>. Set SND.UNA to ISS, SND.NXT to ISS+1. Enter SYN-SENT state. ... SYN-SENT STATE SYN-RECEIVED STATE
... ESTABLISHED STATE CLOSE-WAIT STATE Segmentize the buffer and send it with a piggybacked acknowledgment (acknowledgment value = RCV.NXT). ... If the urgent flag is set ... | If the Snd.TS.OK flag is set, then include the TCP Timestamps | option <TSval=Snd.TSclock,TSecr=TS.Recent> in each data | segment. | | Scale the receive window for transmission in the segment | header: | | SEG.WND = (RCV.WND >> Rcv.Wind.Shift). SEGMENT ARRIVES ... If the state is LISTEN then first check for an RST ... second check for an ACK ... third check for a SYN If the SYN bit is set, check the security. If the ... ... If the SEG.PRC is less than the TCB.PRC then continue. | Check for a Window Scale option (WSopt); if one is found, | save SEG.WSopt in Snd.Wind.Shift and set Snd.WS.OK flag on. | Otherwise, set both Snd.Wind.Shift and Rcv.Wind.Shift to | zero and clear Snd.WS.OK flag. | | Check for a TSopt option; if one is found, save SEG.TSval in | the variable TS.Recent and turn on the Snd.TS.OK bit.
Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any other control or text should be queued for processing later. ISS should be selected and a SYN segment sent of the form: <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK> | If the Snd.WS.OK bit is on, include a WSopt | <WSopt=Rcv.Wind.Shift> in this segment. If the Snd.TS.OK | bit is on, include a TSopt <TSval=Snd.TSclock, | TSecr=TS.Recent> in this segment. Last.ACK.sent is set to | RCV.NXT. SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection state should be changed to SYN-RECEIVED. Note that any other incoming control or data (combined with SYN) will be processed in the SYN-RECEIVED state, but processing of SYN and ACK should not be repeated. If the listen was not fully specified (i.e., the foreign socket was not fully specified), then the unspecified fields should be filled in now. fourth other text or control ... If the state is SYN-SENT then first check the ACK bit ... ... fourth check the SYN bit ... If the SYN bit is on and the security/compartment and precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1, IRS is set to SEG.SEQ. SND.UNA should be advanced to equal SEG.ACK (if there is an ACK), and any segments on the retransmission queue which are thereby acknowledged should be removed. | Check for a Window Scale option (WSopt); if it is found, | save SEG.WSopt in Snd.Wind.Shift; otherwise, set both | Snd.Wind.Shift and Rcv.Wind.Shift to zero. |
| Check for a TSopt option; if one is found, save SEG.TSval in | variable TS.Recent and turn on the Snd.TS.OK bit in the | connection control block. If the ACK bit is set, use | Snd.TSclock - SEG.TSecr as the initial RTT estimate. If SND.UNA > ISS (our SYN has been ACKed), change the connection state to ESTABLISHED, form an <ACK> segment: <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> | and send it. If the Snd.TS.OK bit is on, include a TSopt | option <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK> | segment. Last.ACK.sent is set to RCV.NXT. Data or controls that were queued for transmission may be included. If there are other controls or text in the segment, then continue processing at the sixth step below where the URG bit is checked; otherwise, return. Otherwise, enter SYN-RECEIVED, form a <SYN,ACK> segment: <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK> | and send it. If the Snd.TS.OK bit is on, include a TSopt | option <TSval=Snd.TSclock,TSecr=TS.Recent> in this segment. | If the Snd.WS.OK bit is on, include a WSopt option | <WSopt=Rcv.Wind.Shift> in this segment. Last.ACK.sent is | set to RCV.NXT. If there are other controls or text in the segment, queue them for processing after the ESTABLISHED state has been reached, return. fifth, if neither of the SYN or RST bits is set then drop the segment and return. Otherwise first check the sequence number SYN-RECEIVED STATE ESTABLISHED STATE FIN-WAIT-1 STATE FIN-WAIT-2 STATE CLOSE-WAIT STATE CLOSING STATE LAST-ACK STATE TIME-WAIT STATE
Segments are processed in sequence. Initial tests on arrival are used to discard old duplicates, but further processing is done in SEG.SEQ order. If a segment's contents straddle the boundary between old and new, only the new parts should be processed. | Rescale the received window field: | | TrueWindow = SEG.WND << Snd.Wind.Shift, | | and use "TrueWindow" in place of SEG.WND in the following | steps. | | Check whether the segment contains a Timestamps option and | if bit Snd.TS.OK is on. If so: | | If SEG.TSval < TS.Recent and the RST bit is off: | | If the connection has been idle more than 24 days, | save SEG.TSval in variable TS.Recent, else the segment | is not acceptable; follow the steps below for an | unacceptable segment. | | If SEG.TSval >= TS.Recent and SEG.SEQ <= Last.ACK.sent, | then save SEG.TSval in variable TS.Recent. There are four cases for the acceptability test for an incoming segment: ... If an incoming segment is not acceptable, an acknowledgment should be sent in reply (unless the RST bit is set; if so drop the segment and return): <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> | Last.ACK.sent is set to SEG.ACK of the acknowledgment. If | the Snd.TS.OK bit is on, include the Timestamps option | <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK> segment. Set Last.ACK.sent to SEG.ACK and send the <ACK> segment. After sending the acknowledgment, drop the unacceptable segment and return. ...
fifth check the ACK field, if the ACK bit is off drop the segment and return if the ACK bit is on ... ESTABLISHED STATE If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <- | SEG.ACK. Also compute a new estimate of round-trip time. | If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr; | otherwise, use the elapsed time since the first segment | in the retransmission queue was sent. Any segments on the retransmission queue that are thereby entirely acknowledged... ... seventh, process the segment text, ESTABLISHED STATE FIN-WAIT-1 STATE FIN-WAIT-2 STATE ... Send an acknowledgment of the form: <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> | If the Snd.TS.OK bit is on, include the Timestamps option | <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK> segment. | Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send | it. This acknowledgment should be piggybacked on a segment being transmitted if possible without incurring undue delay. ...
Appendix E. Timestamps Edge Cases
While the rules laid out for when to calculate RTTM produce the correct results most of the time, there are some edge cases where an incorrect RTTM can be calculated. All of these situations involve the loss of segments. It is felt that these scenarios are rare, and that if they should happen, they will cause a single RTTM measurement to be inflated, which mitigates its effects on RTO calculations. [Martin03] cites two similar cases when the returning <ACK> is lost, and before the retransmission timer fires, another returning <ACK> segment arrives, which acknowledges the data. In this case, the RTTM calculated will be inflated: clock tc=1 <A, TSval=1> -------------------> tc=2 (lost) <---- <ACK(A), TSecr=1, win=n> (RTTM would have been 1) (receive window opens, window update is sent) tc=5 <---- <ACK(A), TSecr=1, win=m> (RTTM is calculated at 4) One thing to note about this situation is that it is somewhat bounded by RTO + RTT, limiting how far off the RTTM calculation will be. While more complex scenarios can be constructed that produce larger inflations (e.g., retransmissions are lost), those scenarios involve multiple segment losses, and the connection will have other more serious operational problems than using an inflated RTTM in the RTO calculation.Appendix F. Window Retraction Example
Consider an established TCP connection using a scale factor of 128, Snd.Wind.Shift=7 and Rcv.Wind.Shift=7, that is running with a very small window because the receiver is bottlenecked and both ends are doing small reads and writes. Consider the ACKs coming back: SEG.ACK SEG.WIN computed SND.WIN receiver's actual window 1000 2 1256 1300 The sender writes 40 bytes and receiver ACKs: 1040 2 1296 1300
The sender writes 5 additional bytes and the receiver has a problem. Two choices: 1045 2 1301 1300 - BEYOND BUFFER 1045 1 1173 1300 - RETRACTED WINDOW This is a general problem and can happen any time the sender does a write, which is smaller than the window scale factor. In most stacks, it is at least partially obscured when the window size is larger than some small number of segments because the stacks prefer to announce windows that are an integral number of segments, rounded up to the next scale factor. This plus silly window suppression tends to cause less frequent, larger window updates. If the window was rounded down to a segment size, there is more opportunity to advance the window, the BEYOND BUFFER case above, rather than retracting it.Appendix G. RTO Calculation Modification
Taking multiple RTT samples per window would shorten the history calculated by the RTO mechanism in [RFC6298], and the below algorithm aims to maintain a similar history as originally intended by [RFC6298]. It is roughly known how many samples a congestion window worth of data will yield, not accounting for ACK compression, and ACK losses. Such events will result in more history of the path being reflected in the final value for RTO, and are uncritical. This modification will ensure that a similar amount of time is taken into account for the RTO estimation, regardless of how many samples are taken per window: ExpectedSamples = ceiling(FlightSize / (SMSS * 2)) alpha' = alpha / ExpectedSamples beta' = beta / ExpectedSamples Note that the factor 2 in ExpectedSamples is due to "Delayed ACKs".
Instead of using alpha and beta in the algorithm of [RFC6298], use alpha' and beta' instead: RTTVAR <- (1 - beta') * RTTVAR + beta' * |SRTT - R'| SRTT <- (1 - alpha') * SRTT + alpha' * R' (for each sample R')Appendix H. Changes from RFC 1323
Several important updates and clarifications to the specification in RFC 1323 are made in this document. The technical changes are summarized below: (a) A wrong reference to SND.WND was corrected to SEG.WND in Section 2.3. (b) Section 2.4 was added describing the unavoidable window retraction issue and explicitly describing the mitigation steps necessary. (c) In Section 3.2, the wording how the Timestamps option negotiation is to be performed was updated with RFC2119 wording. Further, a number of paragraphs were added to clarify the expected behavior with a compliant implementation using TSopt, as RFC 1323 left room for interpretation -- e.g., potential late enablement of TSopt. (d) The description of which TSecr values can be used to update the measured RTT has been clarified. Specifically, with timestamps, the Karn algorithm [Karn87] is disabled. The Karn algorithm disables all RTT measurements during retransmission, since it is ambiguous whether the <ACK> is for the original segment, or the retransmitted segment. With timestamps, that ambiguity is removed since the TSecr in the <ACK> will contain the TSval from whichever data segment made it to the destination. (e) RTTM update processing explicitly excludes segments not updating SND.UNA. The original text could be interpreted to allow taking RTT samples when SACK acknowledges some new, non-continuous data.
(f) In RFC 1323, Section 3.4, step (2) of the algorithm to control which timestamp is echoed was incorrect in two regards: (1) It failed to update TS.Recent for a retransmitted segment that resulted from a lost <ACK>. (2) It failed if SEG.LEN = 0. In the new algorithm, the case of SEG.TSval >= TS.Recent is included for consistency with the PAWS test. (g) It is now recommended that the Timestamps option is included in <RST> segments if the incoming segment contained a Timestamps option. (h) <RST> segments are explicitly excluded from PAWS processing. (i) Added text to clarify the precedence between regular TCP [RFC0793] and this document's Timestamps option / PAWS processing. Discussion about combined acceptability checks are ongoing. (j) Snd.TSoffset and Snd.TSclock variables have been added. Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This allows the starting points for timestamp values to be randomized on a per-connection basis. Setting Snd.TSoffset to zero yields the same results as [RFC1323]. Text was added to guide implementers to the proper selection of these offsets, as entirely random offsets for each new connection will conflict with PAWS. (k) Appendix A has been expanded with information about the TCP Urgent Pointer. An earlier revision contained text around the TCP MSS option, which was split off into [RFC6691]. (l) One correction was made to the Event Processing Summary in Appendix D. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to fill in the SEG.WND value, not SND.WND. (m) Appendix G was added to exemplify how an RTO calculation might be updated to properly take the much higher RTT sampling frequency enabled by the Timestamps option into account.
Editorial changes to the document, that don't impact the implementation or function of the mechanisms described in this document, include: (a) Removed much of the discussion in Section 1 to streamline the document. However, detailed examples and discussions in Sections 2, 3, and 5 are kept as guidelines for implementers. (b) Added short text that the use of WS increases the chances of sequence number wrap, thus the PAWS mechanism is required in certain environments. (c) Removed references to "new" options, as the options were introduced in [RFC1323] already. Changed the text in Section 1.3 to specifically address TS and WS options. (d) Section 1.4 was added for [RFC2119] wording. Normative text was updated with the appropriate phrases. (e) Added < > brackets to mark specific types of segments, and replaced most occurrences of "packet" with "segment", where TCP segments are referred to. (f) Updated the text in Section 3 to take into account what has been learned since [RFC1323]. (g) Removed some unused references. (h) Removed the list of changes between [RFC1323] and prior versions. These changes are mentioned in Appendix C of [RFC1323]. (i) Moved "Changes from RFC 1323" to the end of the appendices for easier lookup. In addition, the entries were split into a technical and an editorial part, and sorted to roughly correspond with the sections in the text where they apply.
Authors' Addresses
David Borman Quantum Corporation Mendota Heights, MN 55120 USA EMail: david.borman@quantum.com Bob Braden University of Southern California 4676 Admiralty Way Marina del Rey, CA 90292 USA EMail: braden@isi.edu Van Jacobson Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043 USA EMail: vanj@google.com Richard Scheffenegger (editor) NetApp, Inc. Am Euro Platz 2 Vienna, 1120 Austria EMail: rs@netapp.com