Step 1: Update RACK.min_RTT.
Use the RTT measurements obtained via [
RFC 6298] or [
RFC 7323] to update the estimated minimum RTT in RACK.min_RTT. The sender
SHOULD track a windowed min-filtered estimate of recent RTT measurements that can adapt when migrating to significantly longer paths rather than tracking a simple global minimum of all RTT measurements.
Step 2: Update the state for the most recently sent segment that has been delivered.
In this step, RACK updates the state that tracks the most recently sent segment that has been delivered: RACK.segment. RACK maintains its latest transmission timestamp in RACK.xmit_ts and its highest sequence number in RACK.end_seq. These two variables are used in later steps to estimate if some segments not yet delivered were likely lost. Given the information provided in an ACK, each segment cumulatively ACKed or SACKed is marked as delivered in the scoreboard. Because an ACK can also acknowledge retransmitted data segments and because retransmissions can be spurious, the sender needs to take care to avoid spurious inferences. For example, if the sender were to use timing information from a spurious retransmission, the RACK.rtt could be vastly underestimated.
To avoid spurious inferences, ignore a segment as invalid if any of its sequence range has been retransmitted before and if either of two conditions is true:
-
The Timestamp Echo Reply field (TSecr) of the ACK's timestamp option [RFC 7323], if available, indicates the ACK was not acknowledging the last retransmission of the segment.
-
The segment was last retransmitted less than RACK.min_rtt ago.
The second check is a heuristic when the TCP Timestamp option is not available or when the round-trip time is less than the TCP Timestamp clock granularity.
Among all the segments newly ACKed or SACKed by this ACK that pass the checks above, update the RACK.rtt to be the RTT sample calculated using this ACK. Furthermore, record the most recent Segment.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts. If Segment.xmit_ts equals RACK.xmit_ts (e.g., due to clock granularity limits), then compare Segment.end_seq and RACK.end_seq to break the tie when deciding whether to update the RACK.segment's associated state.
Step 2 may be summarized in pseudocode as:
RACK_sent_after(t1, seq1, t2, seq2):
If t1 > t2:
Return true
Else if t1 == t2 AND seq1 > seq2:
Return true
Else:
Return false
RACK_update():
For each Segment newly acknowledged, cumulatively or selectively,
in ascending order of Segment.xmit_ts:
rtt = Now() - Segment.xmit_ts
If Segment.retransmitted is TRUE:
If ACK.ts_option.echo_reply < Segment.xmit_ts:
Continue
If rtt < RACK.min_rtt:
Continue
RACK.rtt = rtt
If RACK_sent_after(Segment.xmit_ts, Segment.end_seq
RACK.xmit_ts, RACK.end_seq):
RACK.xmit_ts = Segment.xmit_ts
RACK.end_seq = Segment.end_seq
Step 3: Detect data segment reordering.
To detect reordering, the sender looks for original data segments being delivered out of order. To detect such cases, the sender tracks the highest sequence selectively or cumulatively acknowledged in the RACK.fack variable. ".fack" stands for the most "Forward ACK" (this term is adopted from [
FACK]). If a never-retransmitted segment that's below RACK.fack is (selectively or cumulatively) acknowledged, it has been delivered out of order. The sender sets RACK.reordering_seen to TRUE if such a segment is identified.
RACK_detect_reordering():
For each Segment newly acknowledged, cumulatively or selectively,
in ascending order of Segment.end_seq:
If Segment.end_seq > RACK.fack:
RACK.fack = Segment.end_seq
Else if Segment.end_seq < RACK.fack AND
Segment.retransmitted is FALSE:
RACK.reordering_seen = TRUE
Step 4: Update the RACK reordering window.
The RACK reordering window, RACK.reo_wnd, serves as an adaptive allowance for settling time before marking a segment as lost. This step documents a detailed algorithm that follows the principles outlined in the
section.
If no reordering has been observed based on the
previous step, then one way the sender can enter fast recovery is when the number of SACKed segments matches or exceeds DupThresh (similar to [
RFC 6675]). Furthermore, when no reordering has been observed, the RACK.reo_wnd is set to 0 both upon entering and during fast recovery or RTO recovery.
Otherwise, if some reordering has been observed, then RACK does not trigger fast recovery based on DupThresh.
Whether or not reordering has been observed, RACK uses the reordering window to assess whether any segments can be marked as lost. As a consequence, the sender also enters fast recovery when there are any number of SACKed segments, as long as the reorder window has passed for some non-SACKed segments.
When the reordering window is not set to 0, it starts with a conservative RACK.reo_wnd of RACK.min_RTT/4. This value was chosen because Linux TCP used the same factor in its implementation to delay Early Retransmit [
RFC 5827] to reduce spurious loss detections in the presence of reordering, and experience showed this worked reasonably well [
DMCG11].
However, the reordering detection in the previous step,
Step 3, has a self-reinforcing drawback when the reordering window is too small to cope with the actual reordering. When that happens, RACK could spuriously mark reordered segments as lost, causing them to be retransmitted. In turn, the retransmissions can prevent the necessary conditions for
Step 3 to detect reordering since this mechanism requires ACKs or SACKs only for segments that have never been retransmitted. In some cases, such scenarios can persist, causing RACK to continue to spuriously mark segments as lost without realizing the reordering window is too small.
To avoid the issue above, RACK dynamically adapts to higher degrees of reordering using DSACK options from the receiver. Receiving an ACK with a DSACK option indicates a possible spurious retransmission, suggesting that RACK.reo_wnd may be too small. The RACK.reo_wnd increases linearly for every round trip in which the sender receives some DSACK option so that after N round trips in which a DSACK is received, the RACK.reo_wnd becomes (N+1) * min_RTT / 4, with an upper-bound of SRTT.
If the reordering is temporary, then a large adapted reordering window would unnecessarily delay loss recovery later. Therefore, RACK persists using the inflated RACK.reo_wnd for up to 16 loss recoveries, after which it resets RACK.reo_wnd to its starting value, min_RTT / 4. The downside of resetting the reordering window is the risk of triggering spurious fast recovery episodes if the reordering remains high. The rationale for this approach is to bound such spurious recoveries to approximately once every 16 recoveries (less than 7%).
To track the linear scaling factor for the adaptive reordering window, RACK uses the variable RACK.reo_wnd_mult, which is initialized to 1 and adapts with the observed reordering.
The following pseudocode implements the above algorithm for updating the RACK reordering window:
RACK_update_reo_wnd():
/* DSACK-based reordering window adaptation */
If RACK.dsack_round is not None AND
SND.UNA >= RACK.dsack_round:
RACK.dsack_round = None
/* Grow the reordering window per round that sees DSACK.
Reset the window after 16 DSACK-free recoveries */
If RACK.dsack_round is None AND
any DSACK option is present on latest received ACK:
RACK.dsack_round = SND.NXT
RACK.reo_wnd_mult += 1
RACK.reo_wnd_persist = 16
Else if exiting Fast or RTO recovery:
RACK.reo_wnd_persist -= 1
If RACK.reo_wnd_persist <= 0:
RACK.reo_wnd_mult = 1
If RACK.reordering_seen is FALSE:
If in Fast or RTO recovery:
Return 0
Else if RACK.segs_sacked >= DupThresh:
Return 0
Return min(RACK.reo_wnd_mult * RACK.min_RTT / 4, SRTT)
Step 5: Detect losses.
For each segment that has not been SACKed, RACK considers that segment lost if another segment that was sent later has been delivered and the reordering window has passed. RACK considers the reordering window to have passed if the RACK.segment was sent a sufficient time after the segment in question, if a sufficient time has elapsed since the RACK.segment was S/ACKed, or some combination of the two. More precisely, RACK marks a segment as lost if:
RACK.xmit_ts >= Segment.xmit_ts
AND
RACK.xmit_ts - Segment.xmit_ts + (now - RACK.ack_ts) >= RACK.reo_wnd
Solving this second condition for "now", the moment at which a segment is marked as lost, yields:
now >= Segment.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts)
Then (RACK.ack_ts - RACK.xmit_ts) is the round-trip time of the most recently (re)transmitted segment that's been delivered. When segments are delivered in order, the most recently (re)transmitted segment that's been delivered is also the most recently delivered; hence, RACK.rtt == RACK.ack_ts - RACK.xmit_ts. But if segments were reordered, then the segment delivered most recently was sent before the most recently (re)transmitted segment. Hence, RACK.rtt > (RACK.ack_ts - RACK.xmit_ts).
Since RACK.RTT >= (RACK.ack_ts - RACK.xmit_ts), the previous equation reduces to saying that the sender can declare a segment lost when:
now >= Segment.xmit_ts + RACK.reo_wnd + RACK.rtt
In turn, that is equivalent to stating that a RACK sender should declare a segment lost when:
Segment.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0
Note that if the value on the left-hand side is positive, it represents the remaining wait time before the segment is deemed lost. But this risks a timeout (RTO) if no more ACKs come back (e.g., due to losses or application-limited transmissions) to trigger the marking. For timely loss detection, it is
RECOMMENDED that the sender install a reordering timer. This timer expires at the earliest moment when RACK would conclude that all the unacknowledged segments within the reordering window were lost.
The following pseudocode implements the algorithm above. When an ACK is received or the RACK reordering timer expires, call RACK_detect_loss_and_arm_timer(). The algorithm breaks timestamp ties by using the TCP sequence space since high-speed networks often have multiple segments with identical timestamps.
RACK_detect_loss():
timeout = 0
RACK.reo_wnd = RACK_update_reo_wnd()
For each segment, Segment, not acknowledged yet:
If RACK_sent_after(RACK.xmit_ts, RACK.end_seq,
Segment.xmit_ts, Segment.end_seq):
remaining = Segment.xmit_ts + RACK.rtt +
RACK.reo_wnd - Now()
If remaining <= 0:
Segment.lost = TRUE
Segment.xmit_ts = INFINITE_TS
Else:
timeout = max(remaining, timeout)
Return timeout
RACK_detect_loss_and_arm_timer():
timeout = RACK_detect_loss()
If timeout != 0
Arm the RACK timer to call
RACK_detect_loss_and_arm_timer() after timeout
As an optimization, an implementation can choose to check only segments that have been sent before RACK.xmit_ts. This can be more efficient than scanning the entire SACK scoreboard, especially when there are many segments in flight. The implementation can use a separate doubly linked list ordered by Segment.xmit_ts, insert a segment at the tail of the list when it is (re)transmitted, and remove a segment from the list when it is delivered or marked as lost. In Linux TCP, this optimization improved CPU usage by orders of magnitude during some fast recovery episodes on high-speed WAN networks.