Network Working Group K. Ramakrishnan Request for Comments: 3168 TeraOptic Networks Updates: 2474, 2401, 793 S. Floyd Obsoletes: 2481 ACIRI Category: Standards Track D. Black EMC September 2001 The Addition of Explicit Congestion Notification (ECN) to IP Status of this Memo This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved.Abstract
This memo specifies the incorporation of ECN (Explicit Congestion Notification) to TCP and IP, including ECN's use of two bits in the IP header.Table of Contents
1. Introduction.................................................. 3 2. Conventions and Acronyms...................................... 5 3. Assumptions and General Principles............................ 5 4. Active Queue Management (AQM)................................. 6 5. Explicit Congestion Notification in IP........................ 6 5.1. ECN as an Indication of Persistent Congestion............... 10 5.2. Dropped or Corrupted Packets................................ 11 5.3. Fragmentation............................................... 11 6. Support from the Transport Protocol........................... 12 6.1. TCP......................................................... 13 6.1.1 TCP Initialization......................................... 14 6.1.1.1. Middlebox Issues........................................ 16 6.1.1.2. Robust TCP Initialization with an Echoed Reserved Field. 17 6.1.2. The TCP Sender............................................ 18 6.1.3. The TCP Receiver.......................................... 19 6.1.4. Congestion on the ACK-path................................ 20 6.1.5. Retransmitted TCP packets................................. 20
6.1.6. TCP Window Probes......................................... 22 7. Non-compliance by the End Nodes............................... 22 8. Non-compliance in the Network................................. 24 8.1. Complications Introduced by Split Paths..................... 25 9. Encapsulated Packets.......................................... 25 9.1. IP packets encapsulated in IP............................... 25 9.1.1. The Limited-functionality and Full-functionality Options.. 27 9.1.2. Changes to the ECN Field within an IP Tunnel.............. 28 9.2. IPsec Tunnels............................................... 29 9.2.1. Negotiation between Tunnel Endpoints...................... 31 9.2.1.1. ECN Tunnel Security Association Database Field.......... 32 9.2.1.2. ECN Tunnel Security Association Attribute............... 32 9.2.1.3. Changes to IPsec Tunnel Header Processing............... 33 9.2.2. Changes to the ECN Field within an IPsec Tunnel........... 35 9.2.3. Comments for IPsec Support................................ 35 9.3. IP packets encapsulated in non-IP Packet Headers............ 36 10. Issues Raised by Monitoring and Policing Devices............. 36 11. Evaluations of ECN........................................... 37 11.1. Related Work Evaluating ECN................................ 37 11.2. A Discussion of the ECN nonce.............................. 37 11.2.1. The Incremental Deployment of ECT(1) in Routers.......... 38 12. Summary of changes required in IP and TCP.................... 38 13. Conclusions.................................................. 40 14. Acknowledgements............................................. 41 15. References................................................... 41 16. Security Considerations...................................... 45 17. IPv4 Header Checksum Recalculation........................... 45 18. Possible Changes to the ECN Field in the Network............. 45 18.1. Possible Changes to the IP Header.......................... 46 18.1.1. Erasing the Congestion Indication........................ 46 18.1.2. Falsely Reporting Congestion............................. 47 18.1.3. Disabling ECN-Capability................................. 47 18.1.4. Falsely Indicating ECN-Capability........................ 47 18.2. Information carried in the Transport Header................ 48 18.3. Split Paths................................................ 49 19. Implications of Subverting End-to-End Congestion Control..... 50 19.1. Implications for the Network and for Competing Flows....... 50 19.2. Implications for the Subverted Flow........................ 53 19.3. Non-ECN-Based Methods of Subverting End-to-end Congestion Control.................................................... 54 20. The Motivation for the ECT Codepoints........................ 54 20.1. The Motivation for an ECT Codepoint........................ 54 20.2. The Motivation for two ECT Codepoints...................... 55 21. Why use Two Bits in the IP Header?........................... 57 22. Historical Definitions for the IPv4 TOS Octet................ 58 23. IANA Considerations.......................................... 60 23.1. IPv4 TOS Byte and IPv6 Traffic Class Octet................. 60 23.2. TCP Header Flags........................................... 61
23.3. IPSEC Security Association Attributes....................... 62 24. Authors' Addresses........................................... 62 25. Full Copyright Statement..................................... 631. Introduction
We begin by describing TCP's use of packet drops as an indication of congestion. Next we explain that with the addition of active queue management (e.g., RED) to the Internet infrastructure, where routers detect congestion before the queue overflows, routers are no longer limited to packet drops as an indication of congestion. Routers can instead set the Congestion Experienced (CE) codepoint in the IP header of packets from ECN-capable transports. We describe when the CE codepoint is to be set in routers, and describe modifications needed to TCP to make it ECN-capable. Modifications to other transport protocols (e.g., unreliable unicast or multicast, reliable multicast, other reliable unicast transport protocols) could be considered as those protocols are developed and advance through the standards process. We also describe in this document the issues involving the use of ECN within IP tunnels, and within IPsec tunnels in particular. One of the guiding principles for this document is that, to the extent possible, the mechanisms specified here be incrementally deployable. One challenge to the principle of incremental deployment has been the prior existence of some IP tunnels that were not compatible with the use of ECN. As ECN becomes deployed, non- compatible IP tunnels will have to be upgraded to conform to this document. This document obsoletes RFC 2481, "A Proposal to add Explicit Congestion Notification (ECN) to IP", which defined ECN as an Experimental Protocol for the Internet Community. This document also updates RFC 2474, "Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers", in defining the ECN field in the IP header, RFC 2401, "Security Architecture for the Internet Protocol" to change the handling of IPv4 TOS Byte and IPv6 Traffic Class Octet in tunnel mode header construction to be compatible with the use of ECN, and RFC 793, "Transmission Control Protocol", in defining two new flags in the TCP header. TCP's congestion control and avoidance algorithms are based on the notion that the network is a black-box [Jacobson88, Jacobson90]. The network's state of congestion or otherwise is determined by end- systems probing for the network state, by gradually increasing the load on the network (by increasing the window of packets that are outstanding in the network) until the network becomes congested and a packet is lost. Treating the network as a "black-box" and treating
loss as an indication of congestion in the network is appropriate for pure best-effort data carried by TCP, with little or no sensitivity to delay or loss of individual packets. In addition, TCP's congestion management algorithms have techniques built-in (such as Fast Retransmit and Fast Recovery) to minimize the impact of losses, from a throughput perspective. However, these mechanisms are not intended to help applications that are in fact sensitive to the delay or loss of one or more individual packets. Interactive traffic such as telnet, web-browsing, and transfer of audio and video data can be sensitive to packet losses (especially when using an unreliable data delivery transport such as UDP) or to the increased latency of the packet caused by the need to retransmit the packet after a loss (with the reliable data delivery semantics provided by TCP). Since TCP determines the appropriate congestion window to use by gradually increasing the window size until it experiences a dropped packet, this causes the queues at the bottleneck router to build up. With most packet drop policies at the router that are not sensitive to the load placed by each individual flow (e.g., tail-drop on queue overflow), this means that some of the packets of latency-sensitive flows may be dropped. In addition, such drop policies lead to synchronization of loss across multiple flows. Active queue management mechanisms detect congestion before the queue overflows, and provide an indication of this congestion to the end nodes. Thus, active queue management can reduce unnecessary queuing delay for all traffic sharing that queue. The advantages of active queue management are discussed in RFC 2309 [RFC2309]. Active queue management avoids some of the bad properties of dropping on queue overflow, including the undesirable synchronization of loss across multiple flows. More importantly, active queue management means that transport protocols with mechanisms for congestion control (e.g., TCP) do not have to rely on buffer overflow as the only indication of congestion. Active queue management mechanisms may use one of several methods for indicating congestion to end-nodes. One is to use packet drops, as is currently done. However, active queue management allows the router to separate policies of queuing or dropping packets from the policies for indicating congestion. Thus, active queue management allows routers to use the Congestion Experienced (CE) codepoint in a packet header as an indication of congestion, instead of relying solely on packet drops. This has the potential of reducing the impact of loss on latency-sensitive flows.
There exist some middleboxes (firewalls, load balancers, or intrusion detection systems) in the Internet that either drop a TCP SYN packet configured to negotiate ECN, or respond with a RST. This document specifies procedures that TCP implementations may use to provide robust connectivity even in the presence of such equipment.2. Conventions and Acronyms
The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, when they appear in this document, are to be interpreted as described in [RFC2119].3. Assumptions and General Principles
In this section, we describe some of the important design principles and assumptions that guided the design choices in this proposal. * Because ECN is likely to be adopted gradually, accommodating migration is essential. Some routers may still only drop packets to indicate congestion, and some end-systems may not be ECN- capable. The most viable strategy is one that accommodates incremental deployment without having to resort to "islands" of ECN-capable and non-ECN-capable environments. * New mechanisms for congestion control and avoidance need to co- exist and cooperate with existing mechanisms for congestion control. In particular, new mechanisms have to co-exist with TCP's current methods of adapting to congestion and with routers' current practice of dropping packets in periods of congestion. * Congestion may persist over different time-scales. The time scales that we are concerned with are congestion events that may last longer than a round-trip time. * The number of packets in an individual flow (e.g., TCP connection or an exchange using UDP) may range from a small number of packets to quite a large number. We are interested in managing the congestion caused by flows that send enough packets so that they are still active when network feedback reaches them. * Asymmetric routing is likely to be a normal occurrence in the Internet. The path (sequence of links and routers) followed by data packets may be different from the path followed by the acknowledgment packets in the reverse direction.
* Many routers process the "regular" headers in IP packets more efficiently than they process the header information in IP options. This suggests keeping congestion experienced information in the regular headers of an IP packet. * It must be recognized that not all end-systems will cooperate in mechanisms for congestion control. However, new mechanisms shouldn't make it easier for TCP applications to disable TCP congestion control. The benefit of lying about participating in new mechanisms such as ECN-capability should be small.4. Active Queue Management (AQM)
Random Early Detection (RED) is one mechanism for Active Queue Management (AQM) that has been proposed to detect incipient congestion [FJ93], and is currently being deployed in the Internet [RFC2309]. AQM is meant to be a general mechanism using one of several alternatives for congestion indication, but in the absence of ECN, AQM is restricted to using packet drops as a mechanism for congestion indication. AQM drops packets based on the average queue length exceeding a threshold, rather than only when the queue overflows. However, because AQM may drop packets before the queue actually overflows, AQM is not always forced by memory limitations to discard the packet. AQM can set a Congestion Experienced (CE) codepoint in the packet header instead of dropping the packet, when such a field is provided in the IP header and understood by the transport protocol. The use of the CE codepoint with ECN allows the receiver(s) to receive the packet, avoiding the potential for excessive delays due to retransmissions after packet losses. We use the term 'CE packet' to denote a packet that has the CE codepoint set.5. Explicit Congestion Notification in IP
This document specifies that the Internet provide a congestion indication for incipient congestion (as in RED and earlier work [RJ90]) where the notification can sometimes be through marking packets rather than dropping them. This uses an ECN field in the IP header with two bits, making four ECN codepoints, '00' to '11'. The ECN-Capable Transport (ECT) codepoints '10' and '01' are set by the data sender to indicate that the end-points of the transport protocol are ECN-capable; we call them ECT(0) and ECT(1) respectively. The phrase "the ECT codepoint" in this documents refers to either of the two ECT codepoints. Routers treat the ECT(0) and ECT(1) codepoints as equivalent. Senders are free to use either the ECT(0) or the ECT(1) codepoint to indicate ECT, on a packet-by-packet basis.
The use of both the two codepoints for ECT, ECT(0) and ECT(1), is motivated primarily by the desire to allow mechanisms for the data sender to verify that network elements are not erasing the CE codepoint, and that data receivers are properly reporting to the sender the receipt of packets with the CE codepoint set, as required by the transport protocol. Guidelines for the senders and receivers to differentiate between the ECT(0) and ECT(1) codepoints will be addressed in separate documents, for each transport protocol. In particular, this document does not address mechanisms for TCP end- nodes to differentiate between the ECT(0) and ECT(1) codepoints. Protocols and senders that only require a single ECT codepoint SHOULD use ECT(0). The not-ECT codepoint '00' indicates a packet that is not using ECN. The CE codepoint '11' is set by a router to indicate congestion to the end nodes. Routers that have a packet arriving at a full queue drop the packet, just as they do in the absence of ECN. +-----+-----+ | ECN FIELD | +-----+-----+ ECT CE [Obsolete] RFC 2481 names for the ECN bits. 0 0 Not-ECT 0 1 ECT(1) 1 0 ECT(0) 1 1 CE Figure 1: The ECN Field in IP. The use of two ECT codepoints essentially gives a one-bit ECN nonce in packet headers, and routers necessarily "erase" the nonce when they set the CE codepoint [SCWA99]. For example, routers that erased the CE codepoint would face additional difficulty in reconstructing the original nonce, and thus repeated erasure of the CE codepoint would be more likely to be detected by the end-nodes. The ECN nonce also can address the problem of misbehaving transport receivers lying to the transport sender about whether or not the CE codepoint was set in a packet. The motivations for the use of two ECT codepoints is discussed in more detail in Section 20, along with some discussion of alternate possibilities for the fourth ECT codepoint (that is, the codepoint '01'). Backwards compatibility with earlier ECN implementations that do not understand the ECT(1) codepoint is discussed in Section 11. In RFC 2481 [RFC2481], the ECN field was divided into the ECN-Capable Transport (ECT) bit and the CE bit. The ECN field with only the ECN-Capable Transport (ECT) bit set in RFC 2481 corresponds to the ECT(0) codepoint in this document, and the ECN field with both the
ECT and CE bit in RFC 2481 corresponds to the CE codepoint in this document. The '01' codepoint was left undefined in RFC 2481, and this is the reason for recommending the use of ECT(0) when only a single ECT codepoint is needed. 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | DS FIELD, DSCP | ECN FIELD | +-----+-----+-----+-----+-----+-----+-----+-----+ DSCP: differentiated services codepoint ECN: Explicit Congestion Notification Figure 2: The Differentiated Services and ECN Fields in IP. Bits 6 and 7 in the IPv4 TOS octet are designated as the ECN field. The IPv4 TOS octet corresponds to the Traffic Class octet in IPv6, and the ECN field is defined identically in both cases. The definitions for the IPv4 TOS octet [RFC791] and the IPv6 Traffic Class octet have been superseded by the six-bit DS (Differentiated Services) Field [RFC2474, RFC2780]. Bits 6 and 7 are listed in [RFC2474] as Currently Unused, and are specified in RFC 2780 as approved for experimental use for ECN. Section 22 gives a brief history of the TOS octet. Because of the unstable history of the TOS octet, the use of the ECN field as specified in this document cannot be guaranteed to be backwards compatible with those past uses of these two bits that pre-date ECN. The potential dangers of this lack of backwards compatibility are discussed in Section 22. Upon the receipt by an ECN-Capable transport of a single CE packet, the congestion control algorithms followed at the end-systems MUST be essentially the same as the congestion control response to a *single* dropped packet. For example, for ECN-Capable TCP the source TCP is required to halve its congestion window for any window of data containing either a packet drop or an ECN indication. One reason for requiring that the congestion-control response to the CE packet be essentially the same as the response to a dropped packet is to accommodate the incremental deployment of ECN in both end- systems and in routers. Some routers may drop ECN-Capable packets (e.g., using the same AQM policies for congestion detection) while other routers set the CE codepoint, for equivalent levels of congestion. Similarly, a router might drop a non-ECN-Capable packet but set the CE codepoint in an ECN-Capable packet, for equivalent
levels of congestion. If there were different congestion control responses to a CE codepoint than to a packet drop, this could result in unfair treatment for different flows. An additional goal is that the end-systems should react to congestion at most once per window of data (i.e., at most once per round-trip time), to avoid reacting multiple times to multiple indications of congestion within a round-trip time. For a router, the CE codepoint of an ECN-Capable packet SHOULD only be set if the router would otherwise have dropped the packet as an indication of congestion to the end nodes. When the router's buffer is not yet full and the router is prepared to drop a packet to inform end nodes of incipient congestion, the router should first check to see if the ECT codepoint is set in that packet's IP header. If so, then instead of dropping the packet, the router MAY instead set the CE codepoint in the IP header. An environment where all end nodes were ECN-Capable could allow new criteria to be developed for setting the CE codepoint, and new congestion control mechanisms for end-node reaction to CE packets. However, this is a research issue, and as such is not addressed in this document. When a CE packet (i.e., a packet that has the CE codepoint set) is received by a router, the CE codepoint is left unchanged, and the packet is transmitted as usual. When severe congestion has occurred and the router's queue is full, then the router has no choice but to drop some packet when a new packet arrives. We anticipate that such packet losses will become relatively infrequent when a majority of end-systems become ECN-Capable and participate in TCP or other compatible congestion control mechanisms. In an ECN-Capable environment that is adequately-provisioned, packet losses should occur primarily during transients or in the presence of non- cooperating sources. The above discussion of when CE may be set instead of dropping a packet applies by default to all Differentiated Services Per-Hop Behaviors (PHBs) [RFC 2475]. Specifications for PHBs MAY provide more specifics on how a compliant implementation is to choose between setting CE and dropping a packet, but this is NOT REQUIRED. A router MUST NOT set CE instead of dropping a packet when the drop that would occur is caused by reasons other than congestion or the desire to indicate incipient congestion to end nodes (e.g., a diffserv edge node may be configured to unconditionally drop certain classes of traffic to prevent them from entering its diffserv domain).
We expect that routers will set the CE codepoint in response to incipient congestion as indicated by the average queue size, using the RED algorithms suggested in [FJ93, RFC2309]. To the best of our knowledge, this is the only proposal currently under discussion in the IETF for routers to drop packets proactively, before the buffer overflows. However, this document does not attempt to specify a particular mechanism for active queue management, leaving that endeavor, if needed, to other areas of the IETF. While ECN is inextricably tied up with the need to have a reasonable active queue management mechanism at the router, the reverse does not hold; active queue management mechanisms have been developed and deployed independent of ECN, using packet drops as indications of congestion in the absence of ECN in the IP architecture.5.1. ECN as an Indication of Persistent Congestion
We emphasize that a *single* packet with the CE codepoint set in an IP packet causes the transport layer to respond, in terms of congestion control, as it would to a packet drop. The instantaneous queue size is likely to see considerable variations even when the router does not experience persistent congestion. As such, it is important that transient congestion at a router, reflected by the instantaneous queue size reaching a threshold much smaller than the capacity of the queue, not trigger a reaction at the transport layer. Therefore, the CE codepoint should not be set by a router based on the instantaneous queue size. For example, since the ATM and Frame Relay mechanisms for congestion indication have typically been defined without an associated notion of average queue size as the basis for determining that an intermediate node is congested, we believe that they provide a very noisy signal. The TCP-sender reaction specified in this document for ECN is NOT the appropriate reaction for such a noisy signal of congestion notification. However, if the routers that interface to the ATM network have a way of maintaining the average queue at the interface, and use it to come to a reliable determination that the ATM subnet is congested, they may use the ECN notification that is defined here. We continue to encourage experiments in techniques at layer 2 (e.g., in ATM switches or Frame Relay switches) to take advantage of ECN. For example, using a scheme such as RED (where packet marking is based on the average queue length exceeding a threshold), layer 2 devices could provide a reasonably reliable indication of congestion. When all the layer 2 devices in a path set that layer's own Congestion Experienced codepoint (e.g., the EFCI bit for ATM, the FECN bit in Frame Relay) in this reliable manner, then the interface router to the layer 2 network could copy the state of that layer 2
Congestion Experienced codepoint into the CE codepoint in the IP header. We recognize that this is not the current practice, nor is it in current standards. However, encouraging experimentation in this manner may provide the information needed to enable evolution of existing layer 2 mechanisms to provide a more reliable means of congestion indication, when they use a single bit for indicating congestion.5.2. Dropped or Corrupted Packets
For the proposed use for ECN in this document (that is, for a transport protocol such as TCP for which a dropped data packet is an indication of congestion), end nodes detect dropped data packets, and the congestion response of the end nodes to a dropped data packet is at least as strong as the congestion response to a received CE packet. To ensure the reliable delivery of the congestion indication of the CE codepoint, an ECT codepoint MUST NOT be set in a packet unless the loss of that packet in the network would be detected by the end nodes and interpreted as an indication of congestion. Transport protocols such as TCP do not necessarily detect all packet drops, such as the drop of a "pure" ACK packet; for example, TCP does not reduce the arrival rate of subsequent ACK packets in response to an earlier dropped ACK packet. Any proposal for extending ECN- Capability to such packets would have to address issues such as the case of an ACK packet that was marked with the CE codepoint but was later dropped in the network. We believe that this aspect is still the subject of research, so this document specifies that at this time, "pure" ACK packets MUST NOT indicate ECN-Capability. Similarly, if a CE packet is dropped later in the network due to corruption (bit errors), the end nodes should still invoke congestion control, just as TCP would today in response to a dropped data packet. This issue of corrupted CE packets would have to be considered in any proposal for the network to distinguish between packets dropped due to corruption, and packets dropped due to congestion or buffer overflow. In particular, the ubiquitous deployment of ECN would not, in and of itself, be a sufficient development to allow end-nodes to interpret packet drops as indications of corruption rather than congestion.5.3. Fragmentation
ECN-capable packets MAY have the DF (Don't Fragment) bit set. Reassembly of a fragmented packet MUST NOT lose indications of congestion. In other words, if any fragment of an IP packet to be reassembled has the CE codepoint set, then one of two actions MUST be taken:
* Set the CE codepoint on the reassembled packet. However, this MUST NOT occur if any of the other fragments contributing to this reassembly carries the Not-ECT codepoint. * The packet is dropped, instead of being reassembled, for any other reason. If both actions are applicable, either MAY be chosen. Reassembly of a fragmented packet MUST NOT change the ECN codepoint when all of the fragments carry the same codepoint. We would note that because RFC 2481 did not specify reassembly behavior, older ECN implementations conformant with that Experimental RFC do not necessarily perform reassembly correctly, in terms of preserving the CE codepoint in a fragment. The sender could avoid the consequences of this behavior by setting the DF bit in ECN- Capable packets. Situations may arise in which the above reassembly specification is insufficiently precise. For example, if there is a malicious or broken entity in the path at or after the fragmentation point, packet fragments could carry a mixture of ECT(0), ECT(1), and/or Not-ECT codepoints. The reassembly specification above does not place requirements on reassembly of fragments in this case. In situations where more precise reassembly behavior would be required, protocol specifications SHOULD instead specify that DF MUST be set in all ECN-capable packets sent by the protocol.