3. Motivating Arguments
This section is informative. It justifies the recommendations made in the previous section.3.1. Avoiding Perverse Incentives to (Ab)use Smaller Packets
Increasingly, it is being recognised that a protocol design must take care not to cause unintended consequences by giving the parties in the protocol exchange perverse incentives [Evol_cc] [RFC3426]. Given there are many good reasons why larger path maximum transmission units (PMTUs) would help solve a number of scaling issues, we do not want to create any bias against large packets that is greater than their true cost. Imagine a scenario where the same bit rate of packets will contribute the same to bit congestion of a link irrespective of whether it is sent as fewer larger packets or more smaller packets. A protocol design that caused larger packets to be more likely to be dropped than smaller ones would be dangerous in both of the following cases: Malicious transports: A queue that gives an advantage to small packets can be used to amplify the force of a flooding attack. By sending a flood of small packets, the attacker can get the queue to discard more large-packet traffic, allowing more attack traffic to get through to cause further damage. Such a queue allows attack traffic to have a disproportionately large effect on regular traffic without the attacker having to do much work. Non-malicious transports: Even if an application designer is not actually malicious, if over time it is noticed that small packets tend to go faster, designers will act in their own interest and use smaller packets. Queues that give advantage to small packets create an evolutionary pressure for applications or transports to send at the same bit rate but break their data stream down into tiny segments to reduce their drop rate. Encouraging a high volume of tiny packets might in turn unnecessarily overload a completely unrelated part of the system, perhaps more limited by header processing than bandwidth. Imagine that two unresponsive flows arrive at a bit-congestible transmission link each with the same bit rate, say 1 Mbps, but one consists of 1,500 B and the other 60 B packets, which are 25x smaller. Consider a scenario where gentle RED [gentle_RED] is used,
along with the variant of RED we advise against, i.e., where the RED algorithm is configured to adjust the drop probability of packets in proportion to each packet's size (byte-mode packet drop). In this case, RED aims to drop 25x more of the larger packets than the smaller ones. Thus, for example, if RED drops 25% of the larger packets, it will aim to drop 1% of the smaller packets (but, in practice, it may drop more as congestion increases; see Appendix B.4 of [RFC4828]). Even though both flows arrive with the same bit rate, the bit rate the RED queue aims to pass to the line will be 750 kbps for the flow of larger packets but 990 kbps for the smaller packets (because of rate variations, it will actually be a little less than this target). Note that, although the byte-mode drop variant of RED amplifies small-packet attacks, tail-drop queues amplify small-packet attacks even more (see Security Considerations in Section 6). Wherever possible, neither should be used.3.2. Small != Control
Dropping fewer control packets considerably improves performance. It is tempting to drop small packets with lower probability in order to improve performance, because many control packets tend to be smaller (TCP SYNs and ACKs, DNS queries and responses, SIP messages, HTTP GETs, etc). However, we must not give control packets preference purely by virtue of their smallness, otherwise it is too easy for any data source to get the same preferential treatment simply by sending data in smaller packets. Again, we should not create perverse incentives to favour small packets rather than to favour control packets, which is what we intend. Just because many control packets are small does not mean all small packets are control packets. So, rather than fix these problems in the network, we argue that the transport should be made more robust against losses of control packets (see Section 4.2.3).3.3. Transport-Independent Network
TCP congestion control ensures that flows competing for the same resource each maintain the same number of segments in flight, irrespective of segment size. So under similar conditions, flows with different segment sizes will get different bit rates. To counter this effect, it seems tempting not to follow our recommendation, and instead for the network to bias congestion notification by packet size in order to equalise the bit rates of
flows with different packet sizes. However, in order to do this, the queuing algorithm has to make assumptions about the transport, which become embedded in the network. Specifically: o The queuing algorithm has to assume how aggressively the transport will respond to congestion (see Section 4.2.4). If the network assumes the transport responds as aggressively as TCP NewReno, it will be wrong for Compound TCP and differently wrong for Cubic TCP, etc. To achieve equal bit rates, each transport then has to guess what assumption the network made, and work out how to replace this assumed aggressiveness with its own aggressiveness. o Also, if the network biases congestion notification by packet size, it has to assume a baseline packet size -- all proposed algorithms use the local MTU (for example, see the byte-mode loss probability formula in Table 1). Then if the non-Reno transports mentioned above are trying to reverse engineer what the network assumed, they also have to guess the MTU of the congested link. Even though reducing the drop probability of small packets (e.g., RED's byte-mode drop) helps ensure TCP flows with different packet sizes will achieve similar bit rates, we argue that this correction should be made to any future transport protocols based on TCP, not to the network in order to fix one transport, no matter how predominant it is. Effectively, favouring small packets is reverse engineering of network equipment around one particular transport protocol (TCP), contrary to the excellent advice in [RFC3426], which asks designers to question "Why are you proposing a solution at this layer of the protocol stack, rather than at another layer?" In contrast, if the network never takes packet size into account, the transport can be certain it will never need to guess any assumptions that the network has made. And the network passes two pieces of information to the transport that are sufficient in all cases: i) congestion notification on the packet and ii) the size of the packet. Both are available for the transport to combine (by taking packet size into account when responding to congestion) or not. Appendix B checks that these two pieces of information are sufficient for all relevant scenarios. When the network does not take packet size into account, it allows transport protocols to choose whether or not to take packet size into account. However, if the network were to bias congestion notification by packet size, transport protocols would have no choice; those that did not take into account packet size themselves would unwittingly become dependent on packet size, and those that already took packet size into account would end up taking it into account twice.
3.4. Partial Deployment of AQM
In overview, the argument in this section runs as follows: o Because the network does not and cannot always drop packets in proportion to their size, it shouldn't be given the task of making drop signals depend on packet size at all. o Transports on the other hand don't always want to make their rate response proportional to the size of dropped packets, but if they want to, they always can. The argument is similar to the end-to-end argument that says "Don't do X in the network if end systems can do X by themselves, and they want to be able to choose whether to do X anyway". Actually the following argument is stronger; in addition it says "Don't give the network task X that could be done by the end systems, if X is not deployed on all network nodes, and end systems won't be able to tell whether their network is doing X, or whether they need to do X themselves." In this case, the X in question is "making the response to congestion depend on packet size". We will now re-run this argument reviewing each step in more depth. The argument applies solely to drop, not to ECN marking. A queue drops packets for either of two reasons: a) to signal to host congestion controls that they should reduce the load and b) because there is no buffer left to store the packets. Active queue management tries to use drops as a signal for hosts to slow down (case a) so that drops due to buffer exhaustion (case b) should not be necessary. AQM is not universally deployed in every queue in the Internet; many cheap Ethernet bridges, software firewalls, NATs on consumer devices, etc implement simple tail-drop buffers. Even if AQM were universal, it has to be able to cope with buffer exhaustion (by switching to a behaviour like tail drop), in order to cope with unresponsive or excessive transports. For these reasons networks will sometimes be dropping packets as a last resort (case b) rather than under AQM control (case a). When buffers are exhausted (case b), they don't naturally drop packets in proportion to their size. The network can only reduce the probability of dropping smaller packets if it has enough space to store them somewhere while it waits for a larger packet that it can drop. If the buffer is exhausted, it does not have this choice. Admittedly tail drop does naturally drop somewhat fewer small packets, but exactly how few depends more on the mix of sizes than
the size of the packet in question. Nonetheless, in general, if we wanted networks to do size-dependent drop, we would need universal deployment of (packet-size dependent) AQM code, which is currently unrealistic. A host transport cannot know whether any particular drop was a deliberate signal from an AQM or a sign of a queue shedding packets due to buffer exhaustion. Therefore, because the network cannot universally do size-dependent drop, it should not do it all. Whereas universality is desirable in the network, diversity is desirable between different transport-layer protocols -- some, like standards track TCP congestion control [RFC5681], may not choose to make their rate response proportionate to the size of each dropped packet, while others will (e.g., TCP-Friendly Rate Control for Small Packets (TFRC-SP) [RFC4828]).3.5. Implementation Efficiency
Biasing against large packets typically requires an extra multiply and divide in the network (see the example byte-mode drop formula in Table 1). Taking packet size into account at the transport rather than in the network ensures that neither the network nor the transport needs to do a multiply operation -- multiplication by packet size is effectively achieved as a repeated add when the transport adds to its count of marked bytes as each congestion event is fed to it. Also, the work to do the biasing is spread over many hosts, rather than concentrated in just the congested network element. These aren't principled reasons in themselves, but they are a happy consequence of the other principled reasons.4. A Survey and Critique of Past Advice
This section is informative, not normative. The original 1993 paper on RED [RED93] proposed two options for the RED active queue management algorithm: packet mode and byte mode. Packet mode measured the queue length in packets and dropped (or marked) individual packets with a probability independent of their size. Byte mode measured the queue length in bytes and marked an individual packet with probability in proportion to its size (relative to the maximum packet size). In the paper's outline of further work, it was stated that no recommendation had been made on whether the queue size should be measured in bytes or packets, but noted that the difference could be significant.
When RED was recommended for general deployment in 1998 [RFC2309], the two modes were mentioned implying the choice between them was a question of performance, referring to a 1997 email [pktByteEmail] for advice on tuning. A later addendum to this email introduced the insight that there are in fact two orthogonal choices: o whether to measure queue length in bytes or packets (Section 4.1), and o whether the drop probability of an individual packet should depend on its own size (Section 4.2). The rest of this section is structured accordingly.4.1. Congestion Measurement Advice
The choice of which metric to use to measure queue length was left open in RFC 2309. It is now well understood that queues for bit- congestible resources should be measured in bytes, and queues for packet-congestible resources should be measured in packets [pktByteEmail]. Congestion in some legacy bit-congestible buffers is only measured in packets not bytes. In such cases, the operator has to take into account a typical mix of packet sizes when setting the thresholds. Any AQM algorithm on such a buffer will be oversensitive to high proportions of small packets, e.g., a DoS attack, and under-sensitive to high proportions of large packets. However, there is no need to make allowances for the possibility of such a legacy in future protocol design. This is safe because any under-sensitivity during unusual traffic mixes cannot lead to congestion collapse given that the buffer will eventually revert to tail drop, which discards proportionately more large packets.4.1.1. Fixed-Size Packet Buffers
The question of whether to measure queues in bytes or packets seems to be well understood. However, measuring congestion is confusing when the resource is bit-congestible but the queue into the resource is packet-congestible. This section outlines the approach to take. Some, mostly older, queuing hardware allocates fixed-size buffers in which to store each packet in the queue. This hardware forwards packets to the line in one of two ways: o With some hardware, any fixed-size buffers not completely filled by a packet are padded when transmitted to the wire. This case should clearly be treated as packet-congestible, because both
queuing and transmission are in fixed MTU-size units. Therefore, the queue length in packets is a good model of congestion of the link. o More commonly, hardware with fixed-size packet buffers transmits packets to the line without padding. This implies a hybrid forwarding system with transmission congestion dependent on the size of packets but queue congestion dependent on the number of packets, irrespective of their size. Nonetheless, there would be no queue at all unless the line had become congested -- the root cause of any congestion is too many bytes arriving for the line. Therefore, the AQM should measure the queue length as the sum of all the packet sizes in bytes that are queued up waiting to be serviced by the line, irrespective of whether each packet is held in a fixed-size buffer. In the (unlikely) first case where use of padding means the queue should be measured in packets, further confusion is likely because the fixed buffers are rarely all one size. Typically, pools of different-sized buffers are provided (Cisco uses the term 'buffer carving' for the process of dividing up memory into these pools [IOSArch]). Usually, if the pool of small buffers is exhausted, arriving small packets can borrow space in the pool of large buffers, but not vice versa. However, there is no need to consider all this complexity, because the root cause of any congestion is still line overload -- buffer consumption is only the symptom. Therefore, the length of the queue should be measured as the sum of the bytes in the queue that will be transmitted to the line, including any padding. In the (unusual) case of transmission with padding, this means the sum of the sizes of the small buffers queued plus the sum of the sizes of the large buffers queued. We will return to borrowing of fixed-size buffers when we discuss biasing the drop/marking probability of a specific packet because of its size in Section 4.2.1. But here, we can repeat the simple rule for how to measure the length of queues of fixed buffers: no matter how complicated the buffering scheme is, ultimately a transmission line is nearly always bit-congestible so the number of bytes queued up waiting for the line measures how congested the line is, and it is rarely important to measure how congested the buffering system is.4.1.2. Congestion Measurement without a Queue
AQM algorithms are nearly always described assuming there is a queue for a congested resource and the algorithm can use the queue length to determine the probability that it will drop or mark each packet. But not all congested resources lead to queues. For instance, power-
limited resources are usually bit-congestible if energy is primarily required for transmission rather than header processing, but it is rare for a link protocol to build a queue as it approaches maximum power. Nonetheless, AQM algorithms do not require a queue in order to work. For instance, spectrum congestion can be modelled by signal quality using the target bit-energy-to-noise-density ratio. And, to model radio power exhaustion, transmission-power levels can be measured and compared to the maximum power available. [ECNFixedWireless] proposes a practical and theoretically sound way to combine congestion notification for different bit-congestible resources at different layers along an end-to-end path, whether wireless or wired, and whether with or without queues. In wireless protocols that use request to send / clear to send (RTS / CTS) control, such as some variants of IEEE802.11, it is reasonable to base an AQM on the time spent waiting for transmission opportunities (TXOPs) even though the wireless spectrum is usually regarded as congested by bits (for a given coding scheme). This is because requests for TXOPs queue up as the spectrum gets congested by all the bits being transferred. So the time that TXOPs are queued directly reflects bit congestion of the spectrum.4.2. Congestion Notification Advice
4.2.1. Network Bias When Encoding
4.2.1.1. Advice on Packet-Size Bias in RED
The previously mentioned email [pktByteEmail] referred to by [RFC2309] advised that most scarce resources in the Internet were bit-congestible, which is still believed to be true (Section 1.1). But it went on to offer advice that is updated by this memo. It said that drop probability should depend on the size of the packet being considered for drop if the resource is bit-congestible, but not if it is packet-congestible. The argument continued that if packet drops were inflated by packet size (byte-mode dropping), "a flow's fraction of the packet drops is then a good indication of that flow's fraction of the link bandwidth in bits per second". This was consistent with a referenced policing mechanism being worked on at the time for detecting unusually high bandwidth flows, eventually published in 1999 [pBox]. However, the problem could and should have been solved by making the policing mechanism count the volume of bytes randomly dropped, not the number of packets.
A few months before RFC 2309 was published, an addendum was added to the above archived email referenced from the RFC, in which the final paragraph seemed to partially retract what had previously been said. It clarified that the question of whether the probability of dropping/marking a packet should depend on its size was not related to whether the resource itself was bit-congestible, but a completely orthogonal question. However, the only example given had the queue measured in packets but packet drop depended on the size of the packet in question. No example was given the other way round. In 2000, Cnodder et al. [REDbyte] pointed out that there was an error in the part of the original 1993 RED algorithm that aimed to distribute drops uniformly, because it didn't correctly take into account the adjustment for packet size. They recommended an algorithm called RED_4 to fix this. But they also recommended a further change, RED_5, to adjust the drop rate dependent on the square of the relative packet size. This was indeed consistent with one implied motivation behind RED's byte-mode drop -- that we should reverse engineer the network to improve the performance of dominant end-to-end congestion control mechanisms. This memo makes a different recommendations in Section 2. By 2003, a further change had been made to the adjustment for packet size, this time in the RED algorithm of the ns2 simulator. Instead of taking each packet's size relative to a 'maximum packet size', it was taken relative to a 'mean packet size', intended to be a static value representative of the 'typical' packet size on the link. We have not been able to find a justification in the literature for this change; however, Eddy and Allman conducted experiments [REDbias] that assessed how sensitive RED was to this parameter, amongst other things. This changed algorithm can often lead to drop probabilities of greater than 1 (which gives a hint that there is probably a mistake in the theory somewhere). On 10-Nov-2004, this variant of byte-mode packet drop was made the default in the ns2 simulator. It seems unlikely that byte-mode drop has ever been implemented in production networks (Appendix A); therefore, any conclusions based on ns2 simulations that use RED without disabling byte-mode drop are likely to behave very differently from RED in production networks.4.2.1.2. Packet-Size Bias Regardless of AQM
The byte-mode drop variant of RED (or a similar variant of other AQM algorithms) is not the only possible bias towards small packets in queuing systems. We have already mentioned that tail-drop queues naturally tend to lock out large packets once they are full.
But also, queues with fixed-size buffers reduce the probability that small packets will be dropped if (and only if) they allow small packets to borrow buffers from the pools for larger packets (see Section 4.1.1). Borrowing effectively makes the maximum queue size for small packets greater than that for large packets, because more buffers can be used by small packets while less will fit large packets. Incidentally, the bias towards small packets from buffer borrowing is nothing like as large as that of RED's byte-mode drop. Nonetheless, fixed-buffer memory with tail drop is still prone to lock out large packets, purely because of the tail-drop aspect. So, fixed-size packet buffers should be augmented with a good AQM algorithm and packet-mode drop. If an AQM is too complicated to implement with multiple fixed buffer pools, the minimum necessary to prevent large-packet lockout is to ensure that smaller packets never use the last available buffer in any of the pools for larger packets.4.2.2. Transport Bias When Decoding
The above proposals to alter the network equipment to bias towards smaller packets have largely carried on outside the IETF process. Whereas, within the IETF, there are many different proposals to alter transport protocols to achieve the same goals, i.e., either to make the flow bit rate take into account packet size, or to protect control packets from loss. This memo argues that altering transport protocols is the more principled approach. A recently approved experimental RFC adapts its transport-layer protocol to take into account packet sizes relative to typical TCP packet sizes. This proposes a new small-packet variant of TCP- friendly rate control (TFRC [RFC5348]), which is called TFRC-SP [RFC4828]. Essentially, it proposes a rate equation that inflates the flow rate by the ratio of a typical TCP segment size (1,500 B including TCP header) over the actual segment size [PktSizeEquCC]. (There are also other important differences of detail relative to TFRC, such as using virtual packets [CCvarPktSize] to avoid responding to multiple losses per round trip and using a minimum inter-packet interval.) Section 4.5.1 of the TFRC-SP specification discusses the implications of operating in an environment where queues have been configured to drop smaller packets with proportionately lower probability than larger ones. But it only discusses TCP operating in such an environment, only mentioning TFRC-SP briefly when discussing how to define fairness with TCP. And it only discusses the byte-mode dropping version of RED as it was before Cnodder et al. pointed out that it didn't sufficiently bias towards small packets to make TCP independent of packet size.
So the TFRC-SP specification doesn't address the issue of whether the network or the transport _should_ handle fairness between different packet sizes. In Appendix B.4 of RFC 4828, it discusses the possibility of both TFRC-SP and some network buffers duplicating each other's attempts to deliberately bias towards small packets. But the discussion is not conclusive, instead reporting simulations of many of the possibilities in order to assess performance but not recommending any particular course of action. The paper originally proposing TFRC with virtual packets (VP-TFRC) [CCvarPktSize] proposed that there should perhaps be two variants to cater for the different variants of RED. However, as the TFRC-SP authors point out, there is no way for a transport to know whether some queues on its path have deployed RED with byte-mode packet drop (except if an exhaustive survey found that no one has deployed it! -- see Appendix A). Incidentally, VP-TFRC also proposed that byte-mode RED dropping should really square the packet-size compensation factor (like that of Cnodder's RED_5, but apparently unaware of it). Pre-congestion notification [RFC5670] is an IETF technology to use a virtual queue for AQM marking for packets within one Diffserv class in order to give early warning prior to any real queuing. The PCN- marking algorithms have been designed not to take into account packet size when forwarding through queues. Instead, the general principle has been to take the sizes of marked packets into account when monitoring the fraction of marking at the edge of the network, as recommended here.4.2.3. Making Transports Robust against Control Packet Losses
Recently, two RFCs have defined changes to TCP that make it more robust against losing small control packets [RFC5562] [RFC5690]. In both cases, they note that the case for these two TCP changes would be weaker if RED were biased against dropping small packets. We argue here that these two proposals are a safer and more principled way to achieve TCP performance improvements than reverse engineering RED to benefit TCP. Although there are no known proposals, it would also be possible and perfectly valid to make control packets robust against drop by requesting a scheduling class with lower drop probability, which would be achieved by re-marking to a Diffserv code point [RFC2474] within the same behaviour aggregate. Although not brought to the IETF, a simple proposal from Wischik [DupTCP] suggests that the first three packets of every TCP flow should be routinely duplicated after a short delay. It shows that this would greatly improve the chances of short flows completing
quickly, but it would hardly increase traffic levels on the Internet, because Internet bytes have always been concentrated in the large flows. It further shows that the performance of many typical applications depends on completion of long serial chains of short messages. It argues that, given most of the value people get from the Internet is concentrated within short flows, this simple expedient would greatly increase the value of the best-effort Internet at minimal cost. A similar but more extensive approach has been evaluated on Google servers [GentleAggro]. The proposals discussed in this sub-section are experimental approaches that are not yet in wide operational use, but they are existence proofs that transports can make themselves robust against loss of control packets. The examples are all TCP-based, but applications over non-TCP transports could mitigate loss of control packets by making similar use of Diffserv, data duplication, FEC, etc.4.2.4. Congestion Notification: Summary of Conflicting Advice
+-----------+-----------------+-----------------+-------------------+ | transport | RED_1 (packet- | RED_4 (linear | RED_5 (square | | cc | mode drop) | byte-mode drop) | byte-mode drop) | +-----------+-----------------+-----------------+-------------------+ | TCP or | s/sqrt(p) | sqrt(s/p) | 1/sqrt(p) | | TFRC | | | | | TFRC-SP | 1/sqrt(p) | 1/sqrt(s*p) | 1/(s*sqrt(p)) | +-----------+-----------------+-----------------+-------------------+ Table 2: Dependence of flow bit rate per RTT on packet size, s, and drop probability, p, when there is network and/or transport bias towards small packets to varying degrees Table 2 aims to summarise the potential effects of all the advice from different sources. Each column shows a different possible AQM behaviour in different queues in the network, using the terminology of Cnodder et al. outlined earlier (RED_1 is basic RED with packet- mode drop). Each row shows a different transport behaviour: TCP [RFC5681] and TFRC [RFC5348] on the top row with TFRC-SP [RFC4828] below. Each cell shows how the bits per round trip of a flow depends on packet size, s, and drop probability, p. In order to declutter the formulae to focus on packet-size dependence, they are all given per round trip, which removes any RTT term. Let us assume that the goal is for the bit rate of a flow to be independent of packet size. Suppressing all inessential details, the table shows that this should either be achievable by not altering the TCP transport in a RED_5 network, or using the small packet TFRC-SP
transport (or similar) in a network without any byte-mode dropping RED (top right and bottom left). Top left is the 'do nothing' scenario, while bottom right is the 'do both' scenario in which the bit rate would become far too biased towards small packets. Of course, if any form of byte-mode dropping RED has been deployed on a subset of queues that congest, each path through the network will present a different hybrid scenario to its transport. Whatever the case, we can see that the linear byte-mode drop column in the middle would considerably complicate the Internet. Even if one believes the network should be doing the biasing, linear byte- mode drop is a half-way house that doesn't bias enough towards small packets. Section 2 recommends that _all_ bias in network equipment towards small packets should be turned off -- if indeed any equipment vendors have implemented it -- leaving packet-size bias solely as the preserve of the transport layer (solely the leftmost, packet-mode drop column). In practice, it seems that no deliberate bias towards small packets has been implemented for production networks. Of the 19% of vendors who responded to a survey of 84 equipment vendors, none had implemented byte-mode drop in RED (see Appendix A for details).