Contributions on these Path Aware techniques were analyzed to arrive at the Lessons Learned captured in
Section 4.
Our expectation is that most readers will not need to read through this section carefully, but we wanted to record these hard-fought lessons as a service to others who may revisit this document, so they'll have the details close at hand.
The suggested references for Stream Transport are:
The first version of Stream Transport, ST [
IEN-119], was published in the late 1970s and was implemented and deployed on the ARPANET at small scale. It was used throughout the 1980s for experimental transmission of voice, video, and distributed simulation.
The second version of the ST specification (ST2) [
RFC 1190] [
RFC 1819] was an experimental connection-oriented internetworking protocol that operated at the same layer as connectionless IP. ST2 packets could be distinguished by their IP header version numbers (IP, at that time, used version number 4, while ST2 used version number 5).
ST2 used a control plane layered over IP to select routes and reserve capacity for real-time streams across a network path, based on a flow specification communicated by a separate protocol. The flow specification could be associated with QoS state in routers, producing an experimental resource reservation protocol. This allowed ST2 routers along a path to offer end-to-end guarantees, primarily to satisfy the QoS requirements for real-time services over the Internet.
Although implemented in a range of equipment, ST2 was not widely used after completion of the experiments. It did not offer the scalability and fate-sharing properties that have come to be desired by the Internet community.
The ST2 protocol is no longer in use.
As time passed, the trade-off between router processing and link capacity changed. Links became faster, and the cost of router processing became comparatively more expensive.
The ST2 control protocol used "hard state" -- once a route was established, and resources were reserved, routes and resources existed until they were explicitly released via signaling. A soft-state approach was thought superior to this hard-state approach and led to development of the IntServ model described in
Section 6.2.
The suggested references for IntServ are:
In 1994, when the IntServ architecture document [
RFC 1633] was published, real-time traffic was first appearing on the Internet. At that time, bandwidth was still a scarce commodity. Internet Service Providers built networks over DS3 (45 Mbps) infrastructure, and sub-rate (< 1 Mbps) access was common. Therefore, the IETF anticipated a need for a fine-grained QoS mechanism.
In the IntServ architecture, some applications can require service guarantees. Therefore, those applications use RSVP [
RFC 2205] to signal QoS reservations across network paths. Every router in the network that participates in IntServ maintains per-flow soft state to a) perform call admission control and b) deliver guaranteed service.
Applications use Flow Specifications (Flow Specs, or FLOWSPECs) [
RFC 2210] to describe the traffic that they emit. RSVP reserves capacity for traffic on a per-Flow-Spec basis.
Although IntServ has been used in enterprise and government networks, IntServ was never widely deployed on the Internet because of its cost. The following factors contributed to operational cost:
-
IntServ must be deployed on every router that is on a path where IntServ is to be used. Although it is possible to include a router that does not participate in IntServ along the path being controlled, if that router is likely to become a bottleneck, IntServ cannot be used to avoid that bottleneck along the path.
-
IntServ maintained per-flow state.
As IntServ was being discussed, the following occurred:
-
For many expected uses, it became more cost effective to solve the QoS problem by adding bandwidth. Between 1994 and 2000, Internet Service Providers upgraded their infrastructures from DS3 (45 Mbps) to OC-48 (2.4 Gbps). This meant that even if an endpoint was using IntServ in an IntServ-enabled network, its requests would rarely, if ever, be denied, so endpoints and Internet Service Providers had little reason to enable IntServ.
-
Diffserv [RFC 2475] offered a more cost-effective, albeit less fine-grained, solution to the QoS problem.
The following lessons were learned:
-
Any mechanism that requires every participating on-path router to maintain per-flow state is not likely to succeed, unless the additional cost for offering the feature can be recovered from the user.
-
Any mechanism that requires an operator to upgrade all of its routers is not likely to succeed, unless the additional cost for offering the feature can be recovered from the user.
In environments where IntServ has been deployed, trust relationships with endpoints are very different from trust relationships on the Internet itself. There are often clearly defined hierarchies in Service Level Agreements (SLAs) governing well-defined transport flows operating with predetermined capacity and latency requirements over paths where capacity or other attributes are constrained.
IntServ was never widely deployed to manage capacity across the Internet. However, the technique that it produced was deployed for reasons other than bandwidth management. RSVP is widely deployed as an MPLS signaling mechanism. BGP reuses the RSVP concept of Filter Specs to distribute firewall filters, although they are called "Flow Spec Component Types" in BGP [
RFC 5575].
The suggested references for Quick-Start TCP are:
-
"[Quick-Start for TCP and IP]" [RFC 4782]
-
"Determining an appropriate sending rate over an underutilized network path" [SAF07]
-
"Fast Startup Internet Congestion Control for Broadband Interactive Applications" [Sch11]
-
"Using Quick-Start to enhance TCP-friendly rate control performance in bidirectional satellite networks" [QS-SAT]
Quick-Start is defined in an Experimental RFC [
RFC 4782] and is a TCP extension that leverages support from the routers on the path to determine an allowed initial sending rate for a path through the Internet, either at the start of data transfers or after idle periods. Without information about the path, a sender cannot easily determine an appropriate initial sending rate. The default TCP congestion control therefore uses the safe but time-consuming slow-start algorithm [
RFC 5681]. With Quick-Start, connections are allowed to use higher initial sending rates if there is significant unused bandwidth along the path and if the sender and all of the routers along the path approve the request.
By examining the Time To Live (TTL) field in Quick-Start packets, a sender can determine if routers on the path have approved the Quick-Start request. However, this method is unable to take into account the routers hidden by tunnels or other network nodes invisible at the IP layer.
The protocol also includes a nonce that provides protection against cheating routers and receivers. If the Quick-Start request is explicitly approved by all routers along the path, the TCP host can send at up to the approved rate; otherwise, TCP would use the default congestion control. Quick-Start requires modifications in the involved end systems as well as in routers. Due to the resulting deployment challenges, Quick-Start was only proposed in [
RFC 4782] for controlled environments.
The Quick-Start mechanism is a lightweight, coarse-grained, in-band, network-assisted fast startup mechanism. The benefits are studied by simulation in a research paper [
SAF07] that complements the protocol specification. The study confirms that Quick-Start can significantly speed up mid-sized data transfers. That paper also presents router algorithms that do not require keeping per-flow state. Later studies [
Sch11] comprehensively analyze Quick-Start with a full Linux implementation and with a router fast-path prototype using a network processor. In both cases, Quick-Start could be implemented with limited additional complexity.
However, experiments with Quick-Start in [
Sch11] revealed several challenges:
-
Having information from the routers along the path can reduce the risk of congestion but cannot avoid it entirely. Determining whether there is unused capacity is not trivial in actual router and host implementations. Data about available capacity visible at the IP layer may be imprecise, and due to the propagation delay, information can already be outdated when it reaches a sender. There is a trade-off between the speedup of data transfers and the risk of congestion even with Quick-Start. This could be mitigated by only allowing Quick-Start to access a proportion of the unused capacity along a path.
-
For scalable router fast-path implementations, it is important to enable parallel processing of packets, as this is a widely used method, e.g., in network processors. One challenge is synchronization of information between packets that are processed in parallel, which should be avoided as much as possible.
-
Only some types of application traffic can benefit from Quick-Start. Capacity needs to be requested and discovered. The discovered capacity needs to be utilized by the flow, or it implicitly becomes available for other flows. Failing to use the requested capacity may have already reduced the pool of Quick-Start capacity that was made available to other competing Quick-Start requests. The benefit is greatest when senders use this only for bulk flows and avoid sending unnecessary Quick-Start requests, e.g., for flows that only send a small amount of data. Choosing an appropriate request size requires application-internal knowledge that is not commonly expressed by the transport API. How a sender can determine the rate for an initial Quick-Start request is still a largely unsolved problem.
There is no known deployment of Quick-Start for TCP or other IETF transports.
Some lessons can be learned from Quick-Start. Despite being a very lightweight protocol, Quick-Start suffers from poor incremental deployment properties regarding both a) the required modifications in network infrastructure and b) its interactions with applications. Except for corner cases, congestion control can be quite efficiently performed end to end in the Internet, and in modern stacks there is not much room for significant improvement by additional network support.
After publication of the Quick-Start specification, there have been large-scale experiments with an initial window of up to 10 segments [
RFC 6928]. This alternative "IW10" approach can also ramp up data transfers faster than the standard congestion control, but it only requires sender-side modifications. As a result, this approach can be easier and incrementally deployed in the Internet. While theoretically Quick-Start can outperform "IW10", the improvement in completion time for data transfer times can, in many cases, be small. After publication of [
RFC 6928], most modern TCP stacks have increased their default initial window.
The suggested reference for ICMP Source Quench is:
The ICMP Source Quench message [
RFC 0792] allowed an on-path router to request the source of a flow to reduce its sending rate. This method allowed a router to provide an early indication of impending congestion on a path to the sources that contribute to that congestion.
This method was deployed in Internet routers over a period of time; the reaction of endpoints to receiving this signal has varied. For low-speed links, with low multiplexing of flows the method could be used to regulate (momentarily reduce) the transmission rate. However, the simple signal does not scale with link speed or with the number of flows sharing a link.
The approach was overtaken by the evolution of congestion control methods in TCP [
RFC 2001], and later also by other IETF transports. Because these methods were based upon measurement of the end-to-end path and an algorithm in the endpoint, they were able to evolve and mature more rapidly than methods relying on interactions between operational routers and endpoint stacks.
After ICMP Source Quench was specified, the IETF began to recommend that transports provide end-to-end congestion control [
RFC 2001]. The Source Quench method has been obsoleted by the IETF [
RFC 6633], and both hosts and routers must now silently discard this message.
This method had several problems.
First, [
RFC 0792] did not sufficiently specify how the sender would react to the ICMP Source Quench signal from the path (e.g., [
RFC 1016]). There was ambiguity in how the sender should utilize this additional information. This could lead to unfairness in the way that receivers (or routers) responded to this message.
Second, while the message did provide additional information, the Explicit Congestion Notification (ECN) mechanism [
RFC 3168] provided a more robust and informative signal for network nodes to provide early indication that a path has become congested.
The mechanism originated at a time when the Internet trust model was very different. Most endpoint implementations did not attempt to verify that the message originated from an on-path node before they utilized the message. This made it vulnerable to Denial-of-Service (DoS) attacks. In theory, routers might have chosen to use the quoted packet contained in the ICMP payload to validate that the message originated from an on-path node, but this would have increased per-packet processing overhead for each router along the path and would have required transport functionality in the router to verify whether the quoted packet header corresponded to a packet the router had sent. In addition,
Section 5.2 of
RFC 4443 noted ICMPv6-based attacks on hosts that would also have threatened routers processing ICMPv6 Source Quench payloads. As time passed, it became increasingly obvious that the lack of validation of the messages exposed receivers to a security vulnerability where the messages could be forged to create a tangible DoS opportunity.
The suggested references for TRIGTRAN are:
TCP [
RFC 0793] has a well-known weakness -- the end-to-end flow control mechanism has only a single signal, the loss of a segment, detected when no acknowledgment for the lost segment is received at the sender. There are multiple reasons why the sender might not have received an acknowledgment for the segment. To name several, the segment could have been trapped in a routing loop, damaged in transmission and failed checksum verification at the receiver, or lost because some intermediate device discarded the packet, or any of a variety of other things could have happened to the acknowledgment on the way back from the receiver to the sender. TCP implementations since the late 1980s have made the "safe" decision and have interpreted the loss of a segment as evidence that the path between two endpoints may have become congested enough to exhaust buffers on intermediate hops, so that the TCP sender should "back off" -- reduce its sending rate until it knows that its segments are now being delivered without loss [
RFC 5681].
The thinking behind TRIGTRAN was that if a path completely stopped working because a link along the path was "down", somehow something along the path could signal TCP when that link returned to service, and the sending TCP could retry immediately, without waiting for a full retransmission timeout (RTO) period.
The early dreams for TRIGTRAN were dashed because of an assumption that TRIGTRAN triggers would be unauthenticated. This meant that any "safe" TRIGTRAN mechanism would have relied on a mechanism such as setting the IPv4 TTL or IPv6 Hop Count to 255 at a sender and testing that it was 254 upon receipt, so that a receiver could verify that a signal was generated by an adjacent sender known to be on the path being used and not some unknown sender that might not even be on the path (e.g., "[
The Generalized TTL Security Mechanism (GTSM)]" [
RFC 5082]). This situation is very similar to the case for ICMP Source Quench messages as described in
Section 6.4, which were also unauthenticated and could be sent by an off-path attacker, resulting in deprecation of ICMP Source Quench message processing [
RFC 6633].
TRIGTRAN's scope shrunk from "the path is down" to "the first-hop link is down."
But things got worse.
Because TRIGTRAN triggers would only be provided when the first-hop link was "down", TRIGTRAN triggers couldn't replace normal TCP retransmission behavior if the path failed because some link further along the network path was "down". So TRIGTRAN triggers added complexity to an already-complex TCP state machine and did not allow any existing complexity to be removed.
There was also an issue that the TRIGTRAN signal was not sent in response to a specific host that had been sending packets and was instead a signal that stimulated a response by any sender on the link. This needs to scale when there are multiple flows trying to use the same resource, yet the sender of a trigger has no understanding of how many of the potential traffic sources will respond by sending packets -- if recipients of the signal "back off" their responses to a trigger to improve scaling, then that immediately mitigates the benefit of the signal.
Finally, intermediate forwarding nodes required modification to provide TRIGTRAN triggers, but operators couldn't charge for TRIGTRAN triggers, so there was no way to recover the cost of modifying, testing, and deploying updated intermediate nodes.
Two TRIGTRAN BOFs were held, at IETF 55 [
TRIGTRAN-55] and IETF 56 [
TRIGTRAN-56], but this work was not chartered, and there was no interest in deploying TRIGTRAN unless it was chartered and standardized in the IETF.
The reasons why this work was not chartered, much less deployed, provide several useful lessons for researchers.
-
TRIGTRAN started with a plausible value proposition, but networking realities in the early 2000s forced reductions in scope that led directly to reductions in potential benefits but no corresponding reductions in costs and complexity.
-
These reductions in scope were the direct result of an inability for hosts to trust or authenticate TRIGTRAN signals they received from the network.
-
Operators did not believe they could charge for TRIGTRAN signaling, because first-hop links didn't fail frequently and TRIGTRAN provided no reduction in operating expenses, so there was little incentive to purchase and deploy TRIGTRAN-capable network equipment.
It is also worth noting that the targeted environment for TRIGTRAN in the late 1990s contained links with a relatively small number of directly connected hosts -- for instance, cellular or satellite links. The transport community was well aware of the dangers of sender synchronization based on multiple senders receiving the same stimulus at the same time, but the working assumption for TRIGTRAN was that there wouldn't be enough senders for this to be a meaningful problem. In the 2010s, it was common for a single "link" to support many senders and receivers, likely requiring TRIGTRAN senders to wait some random amount of time before sending after receiving a TRIGTRAN signal, which would have reduced the benefits of TRIGTRAN even more.
The suggested reference for Shim6 is:
The IPv6 routing architecture [
RFC 1887] assumed that most sites on the Internet would be identified by Provider Assigned IPv6 prefixes, so that Default-Free Zone routers only contained routes to other providers, resulting in a very small IPv6 global routing table.
For a single-homed site, this could work well. A multihomed site with only one upstream provider could also work well, although BGP multihoming from a single upstream provider was often a premium service (costing more than twice as much as two single-homed sites), and if the single upstream provider went out of service, all of the multihomed paths could fail simultaneously.
IPv4 sites often multihomed by obtaining Provider Independent prefixes and advertising these prefixes through multiple upstream providers. With the assumption that any multihomed IPv4 site would also multihome in IPv6, it seemed likely that IPv6 routing would be subject to the same pressures to announce Provider Independent prefixes, resulting in an IPv6 global routing table that exhibited the same explosive growth as the IPv4 global routing table. During the early 2000s, work began on a protocol that would provide multihoming for IPv6 sites without requiring sites to advertise Provider Independent prefixes into the IPv6 global routing table.
This protocol, called "Shim6", allowed two endpoints to exchange multiple addresses ("Locators") that all mapped to the same endpoint ("Identity"). After an endpoint learned multiple Locators for the other endpoint, it could send to any of those Locators with the expectation that those packets would all be delivered to the endpoint with the same Identity. Shim6 was an example of an "Identity/Locator Split" protocol.
Shim6, as defined in [
RFC 5533] and related RFCs, provided a workable solution for IPv6 multihoming using Provider Assigned prefixes, including capability discovery and negotiation, and allowing end-to-end application communication to continue even in the face of path failure, because applications don't see Locator failures and continue to communicate with the same Identity using a different Locator.
Note that the problem being addressed was "site multihoming", but Shim6 was providing "host multihoming". That meant that the decision about what path would be used was under host control, not under edge router control.
Although more work could have been done to provide a better technical solution, the biggest impediments to Shim6 deployment were operational and business considerations. These impediments were discussed at multiple network operator group meetings, including [
Shim6-35] at [
NANOG-35].
The technical issues centered around concerns that Shim6 relied on the host to track all the connections, while also tracking Identity/Locator mappings in the kernel and tracking failures to recognize that an available path has failed.
The operational issues centered around concerns that operators were performing traffic engineering on traffic aggregates. With Shim6, these operator traffic engineering policies must be pushed down to individual hosts.
In addition, operators would have no visibility or control over the decision of hosts choosing to switch to another path. They expressed concerns that relying on hosts to steer traffic exposed operator networks to oscillation based on feedback loops, if hosts moved from path to path frequently. Given that Shim6 was intended to support multihoming across operators, operators providing only one of the paths would have even less visibility as traffic suddenly appeared and disappeared on their networks.
In addition, firewalls that expected to find a TCP or UDP transport-level protocol header in the IP payload would see a Shim6 Identity header instead, and they would not perform transport-protocol-based firewalling functions because the firewall's normal processing logic would not look past the Identity header. The firewall would perform its default action, which would most likely be to drop packets that don't match any processing rule.
The business issues centered on reducing or removing the ability to sell BGP multihoming service to their own customers, which is often more expensive than two single-homed connectivity services.
It is extremely important to take operational concerns into account when a Path Aware protocol is making decisions about path selection that may conflict with existing operational practices and business considerations.
During discussions in the PANRG session at IETF 103 [
PANRG-103-Min], Lars Eggert, past Transport Area Director, pointed out that during charter discussions for the Multipath TCP Working Group [
MP-TCP], operators expressed concerns that customers could use Multipath TCP to load-share TCP connections across operators simultaneously and compare passive performance measurements across network paths in real time, changing the balance of power in those business relationships. Although the Multipath TCP Working Group was chartered, this concern could have acted as an obstacle to deployment.
Operator objections to Shim6 were focused on technical concerns, but this concern could have also been an obstacle to Shim6 deployment if the technical concerns had been overcome.
The suggested references for Next Steps in Signaling (NSIS) are:
The NSIS Working Group worked on signaling techniques for network-layer resources (e.g., QoS resource reservations, Firewall and NAT traversal).
When RSVP [
RFC 2205] was used in deployments, a number of questions came up about its perceived limitations and potential missing features. The issues noted in the NSIS Working Group charter [
NSIS-CHARTER-2001] include interworking between domains with different QoS architectures, mobility and roaming for IP interfaces, and complexity. Later, the lack of security in RSVP was also recognized [
RFC 4094].
The NSIS Working Group was chartered to tackle those issues and initially focused on QoS signaling as its primary use case. However, over time a new approach evolved that introduced a modular architecture using two application-specific signaling protocols: a) the NSIS Signaling Layer Protocol (NSLP) on top of b) a generic signaling transport protocol (the NSIS Transport Layer Protocol (NTLP)).
NTLP is defined in [
RFC 5971]. Two types of NSLPs are defined: an NSLP for QoS signaling [
RFC 5974] and an NSLP for NATs/firewalls [
RFC 5973].
The obstacles for deployment can be grouped into implementation-related aspects and operational aspects.
-
Implementation-related aspects:
Although NSIS provides benefits with respect to flexibility, mobility, and security compared to other network signaling techniques, hardware vendors were reluctant to deploy this solution, because it would require additional implementation effort and would result in additional complexity for router implementations.
NTLP mainly operates as a path-coupled signaling protocol, i.e., its messages are processed at the control plane of each intermediate node that is also forwarding the data flows. This requires a mechanism to intercept signaling packets while they are forwarded in the same manner (especially along the same path) as data packets. NSIS uses the IPv4 and IPv6 Router Alert Option (RAO) to allow for interception of those path-coupled signaling messages, and this technique requires router implementations to correctly understand and implement the handling of RAOs, e.g., to only process packets with RAOs of interest and to leave packets with irrelevant RAOs in the fast forwarding processing path (a comprehensive discussion of these issues can be found in [RFC 6398]). The latter was an issue with some router implementations at the time of standardization.
Another reason is that path-coupled signaling protocols that interact with routers and request manipulation of state at these routers (or any other network element in general) are under scrutiny: a packet (or sequence of packets) out of the mainly untrusted data path is requesting creation and manipulation of network state. This is seen as potentially dangerous (e.g., opens up a DoS threat to a router's control plane) and difficult for an operator to control. Path-coupled signaling approaches were considered problematic (see also Section 3 of RFC 6398). There are recommendations on how to secure NSIS nodes and deployments (e.g., [RFC 5981]).
-
Operational Aspects:
NSIS not only required trust between customers and their provider, but also among different providers. In particular, QoS signaling techniques would require some kind of dynamic SLA support that would imply (potentially quite complex) bilateral negotiations between different Internet Service Providers. This complexity was not considered to be justified, and increasing the bandwidth (and thus avoiding bottlenecks) was cheaper than actively managing network resource bottlenecks by using path-coupled QoS signaling techniques. Furthermore, an end-to-end path typically involves several provider domains, and these providers need to closely cooperate in cases of failures.
One goal of NSIS was to decrease the complexity of the signaling protocol, but a path-coupled signaling protocol comes with the intrinsic complexity of IP-based networks, beyond the complexity of the signaling protocol itself. Sources of intrinsic complexity include:
-
the presence of asymmetric routes between endpoints and routers.
-
the lack of security and trust at large in the Internet infrastructure.
-
the presence of different trust boundaries.
-
the effects of best-effort networks (e.g., robustness to packet loss).
-
divergence from the fate-sharing principle (e.g., state within the network).
Any path-coupled signaling protocol has to deal with these realities.
Operators view the use of IPv4 and IPv6 Router Alert Options (RAOs) to signal routers along the path from end systems with suspicion, because these end systems are usually not authenticated and heavy use of RAOs can easily increase the CPU load on routers that are designed to process most packets using a hardware "fast path" and diverting packets containing RAOs to a slower, more capable processor.
The suggested reference for IPv6 Flow Labels is:
IPv6 specifies a 20-bit Flow Label field [
RFC 6437], included in the fixed part of the IPv6 header and hence present in every IPv6 packet. An endpoint sets the value in this field to one of a set of pseudorandomly assigned values. If a packet is not part of any flow, the flow label value is set to zero [
RFC 3697]. A number of Standards Track and Best Current Practice RFCs (e.g., [
RFC 8085], [
RFC 6437], [
RFC 6438]) encourage IPv6 endpoints to set a non-zero value in this field. A multiplexing transport could choose to use multiple flow labels to allow the network to either independently forward its subflows or use one common value for the traffic aggregate. The flow label is present in all fragments. IPsec was originally put forward as one important use case for this mechanism and does encrypt the field [
RFC 6438].
Once set, the flow label can provide information that can help inform network nodes about subflows present at the transport layer, without needing to interpret the setting of upper-layer protocol fields [
RFC 6294]. This information can also be used to coordinate how aggregates of transport subflows are grouped when queued in the network and to select appropriate per-flow forwarding when choosing between alternate paths [
RFC 6438] (e.g., for Equal-Cost Multipath (ECMP) routing and Link Aggregation Groups (LAGs)).
Despite the field being present in every IPv6 packet, the mechanism did not receive as much use as originally envisioned. One reason is that to be useful it requires engagement by two different stakeholders:
-
Endpoint Implementation:
For network nodes along a path to utilize the flow label, there needs to be a non-zero value inserted in the field [RFC 6437] at the sending endpoint. There needs to be an incentive for an endpoint to set an appropriate non-zero value. The value should appropriately reflect the level of aggregation the traffic expects to be provided by the network. However, this requires the stack to know granularity at which flows should be identified (or, conversely, which flows should receive aggregated treatment), i.e., which packets carry the same flow label. Therefore, setting a non-zero value may result in additional choices that need to be made by an application developer.
Although the original flow label standard [RFC 3697] forbids any encoding of meaning into the flow label value, the opportunity to use the flow label as a covert channel or to signal other meta-information may have raised concerns about setting a non-zero value [RFC 6437].
Before methods are widely deployed to use this method, there could be no incentive for an endpoint to set the field.
-
Operational support in network nodes:
A benefit can only be realized when a network node along the path also uses this information to inform its decisions. Network equipment (routers and/or middleboxes) need to include appropriate support in order to utilize the field when making decisions about how to classify flows or forward packets. The use of any optional feature in a network node also requires corresponding updates to operational procedures and therefore is normally only introduced when the cost can be justified.
A benefit from utilizing the flow label is expected to be increased quality of experience for applications -- but this comes at some operational cost to an operator and requires endpoints to set the field.
The flow label is a general-purpose header field for use by the path. Multiple uses have been proposed. One candidate use was to reduce the complexity of forwarding decisions. However, modern routers can use a "fast path", often taking advantage of hardware to accelerate processing. The method can assist in more complex forwarding, such as ECMP routing and load balancing.
Although [
RFC 6437] recommended that endpoints should by default choose uniformly distributed labels for their traffic, the specification permitted an endpoint to choose to set a zero value. This ability of endpoints to choose to set a flow label of zero has had consequences on deployability:
-
Before wide-scale support by endpoints, it would be impossible to rely on a non-zero flow label being set. Network nodes therefore would need to also employ other techniques to realize equivalent functions. An example of a method is one assuming semantics of the source port field to provide entropy input to a network-layer hash. This use of a 5-tuple to classify a packet represents a layering violation [RFC 6294]. When other methods have been deployed, they increase the cost of deploying standards-based methods, even though they may offer less control to endpoints and result in potential interaction with other uses/interpretation of the field.
-
Even though the flow label is specified as an end-to-end field, some network paths have been observed to not transparently forward the flow label. This could result from non-conformant equipment or could indicate that some operational networks have chosen to reuse the protocol field for other (e.g., internal) purposes. This results in lack of transparency, and a deployment hurdle to endpoints expecting that they can set a flow label that is utilized by the network. The more recent practice of "greasing" [GREASE] would suggest that a different outcome could have been achieved if endpoints were always required to set a non-zero value.
-
[RFC 1809] noted that setting the choice of the flow label value can depend on the expectations of the traffic generated by an application, which suggests that an API should be presented to control the setting or policy that is used. However, many currently available APIs do not have this support.
A growth in the use of encrypted transports (e.g., QUIC [
RFC 9000]) seems likely to raise issues similar to those discussed above and could motivate renewed interest in utilizing the flow label.
The suggested references for Explicit Congestion Notification (ECN) are:
In the early 1990s, the large majority of Internet traffic used TCP as its transport protocol, but TCP had no way to detect path congestion before the path was so congested that packets were being dropped. These congestion events could affect all senders using a path, either by "lockout", where long-lived flows monopolized the queues along a path, or by "full queues", where queues remain full, or almost full, for a long period of time.
In response to this situation, "Active Queue Management" (AQM) was deployed in the network. A number of AQM disciplines have been deployed, but one common approach was that routers dropped packets when a threshold buffer length was reached, so that transport protocols like TCP that were responsive to loss would detect this loss and reduce their sending rates. Random Early Detection (RED) was one such proposal in the IETF. As the name suggests, a router using RED as its AQM discipline that detected time-averaged queue lengths passing a threshold would choose incoming packets probabilistically to be dropped [
RFC 2309].
Researchers suggested providing "explicit congestion notifications" to senders when routers along the path detected that their queues were building, giving senders an opportunity to "slow down" as if a loss had occurred, giving path queues time to drain, while the path still had sufficient buffer capacity to accommodate bursty arrivals of packets from other senders. This was proposed as an experiment in [
RFC 2481] and standardized in [
RFC 3168].
A key aspect of ECN was the use of IP header fields rather than IP options to carry explicit congestion notifications, since the proponents recognized that
Many routers process the "regular" headers in IP packets more efficiently than they process the header information in IP options.
Unlike most of the Path Aware technologies included in this document, the story of ECN continues to the present day and encountered a large number of Lessons Learned during that time. The early history of ECN (non-)deployment provides Lessons Learned that were not captured by other contributions in
Section 6, so that is the emphasis in this section of the document.
ECN deployment relied on three factors -- support in client implementations, support in router implementations, and deployment decisions in operational networks.
The proponents of ECN did so much right, anticipating many of the Lessons Learned now recognized in
Section 4. They recognized the need to support incremental deployment (
Section 4.2). They considered the impact on router throughput (
Section 4.8). They even considered trust issues between end nodes and the network, for both non-compliant end nodes (
Section 4.10) and non-compliant routers (
Section 4.9).
They were rewarded with ECN being implemented in major operating systems, for both end nodes and routers. A number of implementations are listed under "Implementation and Deployment of ECN" at [
SallyFloyd].
What they did not anticipate was routers that would crash when they saw bits 6 and 7 in the IPv4 Type of Service (TOS) octet [
RFC 0791] / IPv6 Traffic Class field [
RFC 2460], which [
RFC 2481] redefined to be "Currently Unused", being set to a non-zero value.
As described in [
vista-impl] ("IGD" stands for "Intermediate Gateway Device"),
IGD problem #1: one of the most popular versions from one of the most popular vendors.When a data packet arrives with either ECT(0) or ECT(1) (indicating successful ECN capability negotiation) indicated, router crashed.Cannot be recovered at TCP layer [sic]
This implementation, which would be run on a significant percentage of Internet end nodes, was shipped with ECN disabled, as was true for several of the other implementations listed under "Implementation and Deployment of ECN" at [
SallyFloyd]. Even if subsequent router vendors fixed these implementations, ECN was still disabled on end nodes, and given the trade-off between the benefits of enabling ECN (somewhat better behavior during congestion) and the risks of enabling ECN (possibly crashing a router somewhere along the path), ECN tended to stay disabled on implementations that supported ECN for decades afterwards.
Of the contributions included in
Section 6, ECN may be unique in providing these lessons:
-
Even if you do everything right, you may trip over implementation bugs in devices you know nothing about, that will cause severe problems that prevent successful deployment of your Path Aware technology.
-
After implementations disable your Path Aware technology, it may take years, or even decades, to convince implementers to re-enable it by default.
These two lessons, taken together, could be summarized as "you get one chance to get it right."
During discussion of ECN at [
PANRG-110], we noted that "you get one chance to get it right" isn't quite correct today, because operating systems on so many host systems are frequently updated, and transport protocols like QUIC [
RFC 9000] are being implemented in user space and can be updated without touching installed operating systems. Neither of these factors were true in the early 2000s.
We think that these restatements of the ECN Lessons Learned are more useful for current implementers:
-
Even if you do everything right, you may trip over implementation bugs in devices you know nothing about, that will cause severe problems that prevent successful deployment of your Path Aware technology. Testing before deployment isn't enough to ensure successful deployment. It is also necessary to "deploy gently", which often means deploying for a small subset of users to gain experience and implementing feedback mechanisms to detect that user experience is being degraded.
-
After implementations disable your Path Aware technology, it may take years, or even decades, to convince implementers to re-enable it by default. This might be based on the difficulty of distributing implementations that enable it by default, but it is just as likely to be based on the "bad taste in the mouth" that implementers have after an unsuccessful deployment attempt that degraded user experience.
With these expansions, the two lessons, taken together, could be more helpfully summarized as "plan for failure" -- anticipate what your next step will be, if initial deployment is unsuccessful.
ECN deployment was also hindered by non-deployment of AQM in many devices, because of operator interest in QoS features provided in the network, rather than using the network to assist end systems in providing for themselves. But that's another story, and the AQM Lessons Learned are already covered in other contributions in
Section 6.