This section explains how IP fragmentation introduces fragility to Internet communication.
Virtual reassembly is a procedure in which a device conceptually reassembles a packet, forwards its fragments, and discards the reassembled copy. In [
RFC 6346] and [
RFC 6888], virtual reassembly is required in order to correctly translate fragment addresses. It could be useful to address the problems in Sections [
3.2], [
3.3], [
3.4], and [
3.5].
Virtual reassembly is computationally expensive and holds state for indeterminate periods of time. Therefore, it is prone to errors and
Section 3.7.
IP fragmentation causes problems for routers that implement policy-based routing.
When a router receives a packet, it identifies the next hop on route to the packet's destination and forwards the packet to that next hop. In order to identify the next hop, the router interrogates a local data structure called the Forwarding Information Base (FIB).
Normally, the FIB contains destination-based entries that map a destination prefix to a next hop. Policy-based routing allows destination-based and policy-based entries to coexist in the same FIB. A policy-based FIB entry maps multiple fields, drawn from either the IP or transport-layer header, to a next hop.
Entry |
Type |
Dest. Prefix |
Next Hdr / Dest. Port |
Next Hop |
1 |
Destination-based |
2001:db8::1/128 |
Any / Any |
2001:db8:2::2 |
2 |
Policy-based |
2001:db8::1/128 |
TCP / 80 |
2001:db8:3::3 |
Table 1: Policy-Based Routing FIB
Assume that a router maintains the FIB in
Table 1. The first FIB entry is destination-based. It maps a destination prefix 2001:db8::1/128 to a next hop 2001:db8:2::2. The second FIB entry is policy-based. It maps the same destination prefix 2001:db8::1/128 and a destination port (TCP / 80) to a different next hop (2001:db8:3::3). The second entry is more specific than the first.
When the router receives the first fragment of a packet that is destined for TCP port 80 on 2001:db8::1, it interrogates the FIB. Both FIB entries satisfy the query. The router selects the second FIB entry because it is more specific and forwards the packet to 2001:db8:3::3.
When the router receives the second fragment of the packet, it interrogates the FIB again. This time, only the first FIB entry satisfies the query, because the second fragment contains no indication that the packet is destined for TCP port 80. Therefore, the router selects the first FIB entry and forwards the packet to 2001:db8:2::2.
Policy-based routing is also known as filter-based forwarding.
IP fragmentation causes problems for Network Address Translation (NAT) devices. When a NAT device detects a new, outbound flow, it maps that flow's source port and IP address to another source port and IP address. Having created that mapping, the NAT device translates:
-
The source IP address and source port on each outbound packet.
-
The destination IP address and destination port on each inbound packet.
[
RFC 6346] and [
RFC 6888] are two common NAT strategies. In both approaches, the NAT device must virtually reassemble fragmented packets in order to translate and forward each fragment.
As discussed in more detail in
Section 3.7, IP fragmentation causes problems for stateless firewalls whose rules include TCP and UDP ports. Because port information is only available in the first fragment and not available in the subsequent fragments, the firewall is limited to the following options:
-
Accept all subsequent fragments, possibly admitting certain classes of attack.
-
Block all subsequent fragments, possibly blocking legitimate traffic.
Neither option is attractive.
IP fragmentation causes problems for Equal-Cost Multipath (ECMP), Link Aggregate Groups (LAG), and other stateless load-distribution technologies. In order to assign a packet or packet fragment to a link, an intermediate node executes a hash (i.e., load-distributing) algorithm. The following paragraphs describe a commonly deployed hash algorithm.
If the packet or packet fragment contains a transport-layer header, the algorithm accepts the following 5-tuple as input:
-
IP Source Address.
-
IP Destination Address.
-
IPv4 Protocol or IPv6 Next Header.
-
transport-layer source port.
-
transport-layer destination port.
If the packet or packet fragment does not contain a transport-layer header, the algorithm accepts only the following 3-tuple as input:
-
IP Source Address.
-
IP Destination Address.
-
IPv4 Protocol or IPv6 Next Header.
Therefore, non-fragmented packets belonging to a flow can be assigned to one link while fragmented packets belonging to the same flow can be divided between that link and another. This can cause suboptimal load distribution.
[
RFC 6438] offers a partial solution to this problem for IPv6 devices only. According to [
RFC 6438]:
At intermediate routers that perform load distribution, the hash algorithm used to determine the outgoing component-link in an ECMP and/or LAG toward the next hop MUST minimally include the 3-tuple {dest addr, source addr, flow label} and MAY also include the remaining components of the 5-tuple.
If the algorithm includes only the 3-tuple {dest addr, source addr, flow label}, it will assign all fragments belonging to a packet to the same link. (See [
RFC 6437] and [
RFC 7098]).
In order to avoid the problem described above, implementations
SHOULD implement the recommendations provided in
Section 6.4 of this document.
IPv4 fragmentation is not sufficiently robust for use under some conditions in today's Internet. At high data rates, the 16-bit IP identification field is not large enough to prevent duplicate IDs, resulting in frequent incorrectly assembled IP fragments, and the TCP and UDP checksums are insufficient to prevent the resulting corrupted datagrams from being delivered to upper-layer protocols. [
RFC 4963] describes some easily reproduced experiments demonstrating the problem and discusses some of the operational implications of these observations.
These reassembly issues do not occur as frequently in IPv6 because the IPv6 identification field is 32 bits long.
Security researchers have documented several attacks that exploit IP fragmentation. The following are examples:
-
Overlapping fragment attacks [RFC 1858] [RFC 3128] [RFC 5722].
-
Resource exhaustion attacks.
-
Attacks based on predictable fragment identification values [RFC 7739].
-
Evasion of Network Intrusion Detection Systems (NIDS) [Ptacek1998].
In the overlapping fragment attack, an attacker constructs a series of packet fragments. The first fragment contains an IP header, a transport-layer header, and some transport-layer payload. This fragment complies with local security policy and is allowed to pass through a stateless firewall. A second fragment, having a nonzero offset, overlaps with the first fragment. The second fragment also passes through the stateless firewall. When the packet is reassembled, the transport-layer header from the first fragment is overwritten by data from the second fragment. The reassembled packet does not comply with local security policy. Had it traversed the firewall in one piece, the firewall would have rejected it.
A stateless firewall cannot protect against the overlapping fragment attack. However, destination nodes can protect against the overlapping fragment attack by implementing the procedures described in
RFC 1858,
RFC 3128, and
RFC 8200. These reassembly procedures detect the overlap and discard the packet.
The fragment reassembly algorithm is a stateful procedure in an otherwise stateless protocol. Therefore, it can be exploited by resource exhaustion attacks. An attacker can construct a series of fragmented packets with one fragment missing from each packet so that the reassembly is impossible. Thus, this attack causes resource exhaustion on the destination node, possibly denying reassembly services to other flows. This type of attack can be mitigated by flushing fragment reassembly buffers when necessary, at the expense of possibly dropping legitimate fragments.
Each IP fragment contains an "Identification" field that destination nodes use to reassemble fragmented packets. Some implementations set the Identification field to a predictable value, thus making it easy for an attacker to forge malicious IP fragments that would cause the reassembly procedure for legitimate packets to fail.
NIDS aims at identifying malicious activity by analyzing network traffic. Ambiguity in the possible result of the fragment reassembly process may allow an attacker to evade these systems. Many of these systems try to mitigate some of these evasion techniques (e.g., by computing all possible outcomes of the fragment reassembly process, at the expense of increased processing requirements).
As mentioned in
Section 2.3, upper-layer protocols can be configured to rely on PMTUD. Because PMTUD relies upon the network to deliver ICMP PTB messages, those protocols also rely on the networks to deliver ICMP PTB messages.
According to [
RFC 4890], ICMPv6 PTB messages must not be filtered. However, ICMP PTB delivery is not reliable. It is subject to both transient and persistent loss.
Transient loss of ICMP PTB messages can cause transient PMTU black holes. When the conditions contributing to transient loss abate, the network regains its ability to deliver ICMP PTB messages and connectivity between the source and destination nodes is restored.
Section 3.8.1 of this document describes conditions that lead to transient loss of ICMP PTB messages.
Persistent loss of ICMP PTB messages can cause persistent black holes. Sections [
3.8.2], [
3.8.3], and [
3.8.4] of this document describe conditions that lead to persistent loss of ICMP PTB messages.
The problem described in this section is specific to PMTUD. It does not occur when the upper-layer protocol obtains its PMTU estimate from PLPMTUD or from any other source.
The following factors can contribute to transient loss of ICMP PTB messages:
-
Network congestion.
-
Packet corruption.
-
Transient routing loops.
-
ICMP rate limiting.
The effect of rate limiting may be severe, as
RFC 4443 recommends strict rate limiting of ICMPv6 traffic.
Incorrect implementation of security policy can cause persistent loss of ICMP PTB messages.
For example, assume that a Customer Premises Equipment (CPE) router implements the following zone-based security policy:
-
Allow any traffic to flow from the inside zone to the outside zone.
-
Do not allow any traffic to flow from the outside zone to the inside zone unless it is part of an existing flow (i.e., it was elicited by an outbound packet).
When a correct implementation of the above-mentioned security policy receives an ICMP PTB message, it examines the ICMP PTB payload in order to determine whether the original packet (i.e., the packet that elicited the ICMP PTB message) belonged to an existing flow. If the original packet belonged to an existing flow, the implementation allows the ICMP PTB to flow from the outside zone to the inside zone. If not, the implementation discards the ICMP PTB message.
When an incorrect implementation of the above-mentioned security policy receives an ICMP PTB message, it discards the packet because its source address is not associated with an existing flow.
The security policy described above has been implemented incorrectly on many consumer CPE routers.
Anycast can cause persistent loss of ICMP PTB messages. Consider the example below:
A DNS client sends a request to an anycast address. The network routes that DNS request to the nearest instance of that anycast address (i.e., a DNS server). The DNS server generates a response and sends it back to the DNS client. While the response does not exceed the DNS server's PMTU estimate, it does exceed the actual PMTU.
A downstream router drops the packet and sends an ICMP PTB message the packet's source (i.e., the anycast address). The network routes the ICMP PTB message to the anycast instance closest to the downstream router. That anycast instance may not be the DNS server that originated the DNS response. It may be another DNS server with the same anycast address. The DNS server that originated the response may never receive the ICMP PTB message and may never update its PMTU estimate.
Unidirectional routing can cause persistent loss of ICMP PTB messages. Consider the example below:
A source node sends a packet to a destination node. All intermediate nodes maintain a route to the destination node but do not maintain a route to the source node. In this case, when an intermediate node encounters an MTU issue, it cannot send an ICMP PTB message to the source node.
In
RFC 7872, researchers sampled Internet paths to determine whether they would convey packets that contain IPv6 extension headers. Sampled paths terminated at popular Internet sites (e.g., popular web, mail, and DNS servers).
The study revealed that at least 28% of the sampled paths did not convey packets containing the IPv6 Fragment extension header. In most cases, fragments were dropped in the destination autonomous system. In other cases, the fragments were dropped in transit autonomous systems.
Another [
Huston] confirmed this finding. It reported that 37% of sampled endpoints used IPv6-capable DNS resolvers that were incapable of receiving a fragmented IPv6 response.
It is difficult to determine why network operators drop fragments. Possible causes follow:
-
Hardware inability to process fragmented packets.
-
Failure to change vendor defaults.
-
Unintentional misconfiguration.
-
Intentional configuration (e.g., network operators consciously chooses to drop IPv6 fragments in order to address the issues raised in Sections [3.2] through [3.8], above.)