This document mainly describes the issues behind serving stale data and intentionally does not provide a formal algorithm. The concept is not overly complex, and the details are best left to resolver authors to implement in their codebases. The processing of serve-stale is a local operation, and consistent variables between deployments are not needed for interoperability. However, we would like to highlight the impact of various implementation choices, starting with the timers involved.
The most obvious of these is the maximum stale timer. If this variable is too large, it could cause excessive cache memory usage, but if it is too small, the serve-stale technique becomes less effective, as the record may not be in the cache to be used if needed. Shorter values, even less than a day, can effectively handle the vast majority of outages. Longer values, as much as a week, give time for monitoring systems to notice a resolution problem and for human intervention to fix it; operational experience has been that sometimes the right people can be hard to track down and unfortunately slow to remedy the situation.
Increased memory consumption could be mitigated by prioritizing removal of stale records over non-expired records during cache exhaustion. Eviction strategies could consider additional factors, including the last time of use or the popularity of a record, to retain active but stale records. A feature to manually flush only stale records could also be useful.
The client response timer is another variable that deserves consideration. If this value is too short, there exists the risk that stale answers may be used even when the authoritative server is actually reachable but slow; this may result in undesirable answers being returned. Conversely, waiting too long will negatively impact user experience.
The balance for the failure recheck timer is responsiveness in detecting the renewed availability of authorities versus the extra resource use for resolution. If this variable is set too large, stale answers may continue to be returned even after the authoritative server is reachable; per
RFC 2308,
Section 7, this should be no more than 5 minutes. If this variable is too small, authoritative servers may be targeted with a significant amount of excess traffic.
Regarding the TTL to set on stale records in the response, historically TTLs of 0 seconds have been problematic for some implementations, and negative values can't effectively be communicated to existing software. Other very short TTLs could lead to congestive collapse as TTL-respecting clients rapidly try to refresh. The recommended value of 30 seconds not only sidesteps those potential problems with no practical negative consequences, it also rate-limits further queries from any client that honors the TTL, such as a forwarding resolver.
As for the change to treat a TTL with the high-order bit set as positive and then clamping it, as opposed to [
RFC 2181] treating it as zero, the rationale here is basically one of engineering simplicity versus an inconsequential operational history. Negative TTLs had no rational intentional meaning that wouldn't have been satisfied by just sending 0 instead, and similarly there was realistically no practical purpose for sending TTLs of 2^25 seconds (1 year) or more. There's also no record of TTLs in the wild having the most significant bit set in the DNS Operations, Analysis, and Research Center's (DNS-OARC's) "Day in the Life" samples [
DITL]. With no apparent reason for operators to use them intentionally, that leaves either errors or non-standard experiments as explanations as to why such TTLs might be encountered, with neither providing an obviously compelling reason as to why having the leading bit set should be treated differently from having any of the next eleven bits set and then capped per
Section 4.
Another implementation consideration is the use of stale nameserver addresses for lookups. This is mentioned explicitly because, in some resolvers, getting the addresses for nameservers is a separate path from a normal cache lookup. If authoritative server addresses are not able to be refreshed, resolution can possibly still be successful if the authoritative servers themselves are up. For instance, consider an attack on a top-level domain that takes its nameservers offline; serve-stale resolvers that had expired glue addresses for subdomains within that top-level domain would still be able to resolve names within those subdomains, even those it had not previously looked up.
The directive in
Section 4 that only NoError and NXDomain responses should invalidate any previously associated answer stems from the fact that no other RCODEs that a resolver normally encounters make any assertions regarding the name in the question or any data associated with it. This comports with existing resolver behavior where a failed lookup (say, during prefetching) doesn't impact the existing cache state. Some authoritative server operators have said that they would prefer stale answers to be used in the event that their servers are responding with errors like ServFail instead of giving true authoritative answers. Implementers
MAY decide to return stale answers in this situation.
Since the goal of serve-stale is to provide resiliency for all obvious errors to refresh data, these other RCODEs are treated as though they are equivalent to not getting an authoritative response. Although NXDomain for a previously existing name might well be an error, it is not handled that way because there is no effective way to distinguish operator intent for legitimate cases versus error cases.
During discussion in the IETF, it was suggested that, if all authorities return responses with an RCODE of Refused, it may be an explicit signal to take down the zone from servers that still have the zone's delegation pointed to them. Refused, however, is also overloaded to mean multiple possible failures that could represent transient configuration failures. Operational experience has shown that purposely returning Refused is a poor way to achieve an explicit takedown of a zone compared to either updating the delegation or returning NXDomain with a suitable SOA for extended negative caching. Implementers
MAY nonetheless consider whether to treat all authorities returning Refused as preempting the use of stale data.