As a network of networks, the Internet consists of a large variety of links and systems that support a wide variety of tasks and workloads. The service provided by the network varies from best-effort delivery among loosely connected components to highly predictable delivery within controlled environments (e.g., between physically connected nodes, within a tightly controlled data center). Each path through the network has a set of path properties, e.g., available capacity, delay, and packet loss. Given the range of networks that make up the Internet, these properties range from largely static to highly dynamic.
This document provides guidelines for developing an understanding of one path property: packet loss. In particular, we offer guidelines for developing and implementing time-based loss detectors that have been gradually learned over the last several decades. We focus on the general case where the loss properties of a path are (a) unknown a priori and (b) dynamically varying over time. Further, while there are numerous root causes of packet loss, we leverage the conservative notion that loss is an implicit indication of congestion [
RFC 5681]. While this stance is not always correct, as a general assumption it has historically served us well [
Jac88]. As we discuss further in
Section 2, the guidelines in this document should be viewed as a general default for unicast communication across best-effort networks and not as optimal -- or even applicable -- for all situations.
Given that packet loss is routine in best-effort networks, loss detection is a crucial activity for many protocols and applications and is generally undertaken for two major reasons:
- (1)
-
Ensuring reliable data delivery
This requires a data sender to develop an understanding of which transmitted packets have not arrived at the receiver. This knowledge allows the sender to retransmit missing data.
- (2)
-
Congestion control
As we mention above, packet loss is often taken as an implicit indication that the sender is transmitting too fast and is overwhelming some portion of the network path. Data senders can therefore use loss to trigger transmission rate reductions.
Various mechanisms are used to detect losses in a packet stream. Often, we use continuous or periodic acknowledgments from the recipient to inform the sender's notion of which pieces of data are missing. However, despite our best intentions and most robust mechanisms, we cannot place ultimate faith in receiving such acknowledgments but can only truly depend on the passage of time. Therefore, our ultimate backstop to ensuring that we detect all loss is a timeout. That is, the sender sets some expectation for how long to wait for confirmation of delivery for a given piece of data. When this time period passes without delivery confirmation, the sender concludes the data was lost in transit.
The specifics of time-based loss detection schemes represent a tradeoff between correctness and responsiveness. In other words, we wish to simultaneously:
-
wait long enough to ensure the detection of loss is correct, and
-
minimize the amount of delay we impose on applications (before repairing loss) and the network (before we reduce the congestion).
Serving both of these goals is difficult, as they pull in opposite directions [
AP99]. By not waiting long enough to accurately determine a packet has been lost, we may provide a needed retransmission in a timely manner but risk both sending unnecessary ("spurious") retransmissions and needlessly lowering the transmission rate. By waiting long enough that we are unambiguously certain a packet has been lost, we cannot repair losses in a timely manner and we risk prolonging network congestion.
Many protocols and applications -- such as TCP [
RFC 6298], SCTP [
RFC 4960], and SIP [
RFC 3261] -- use their own time-based loss detection mechanisms. At this point, our experience leads to a recognition that often specific tweaks that deviate from standardized time-based loss detectors do not materially impact network safety with respect to congestion control [
AP99]. Therefore, in this document we outline a set of high-level, protocol-agnostic requirements for time-based loss detection. The intent is to provide a safe foundation on which implementations have the flexibility to instantiate mechanisms that best realize their specific goals.