4. Internet Protocol Mechanisms
4.1. Fragment Reassembly
To accommodate networks with different Maximum Transmission Units (MTUs), the Internet Protocol provides a mechanism for the fragmentation of IP packets by end-systems (hosts) and/or intermediate-systems (routers). Reassembly of fragments is performed only by the end-systems. [Cerf1974] provides the rationale for why packet reassembly is not performed by intermediate-systems. During the last few decades, IP fragmentation and reassembly has been exploited in a number of ways, to perform actions such as evading NIDSs, bypassing firewall rules, and performing DoS attacks. [Bendi1998] and [Humble1998] are examples of the exploitation of these issues for performing DoS attacks. [CERT1997] and [CERT1998b] document these issues. [Anderson2001] is a survey of fragmentation attacks. [US-CERT2001] is an example of the exploitation of IP fragmentation to bypass firewall rules. [CERT1999] describes the implementation of fragmentation attacks in Distributed Denial-of-Service (DDoS) attack tools. The problem with IP fragment reassembly basically has to do with the complexity of the function, in a number of aspects: o Fragment reassembly is a stateful operation for a stateless protocol (IP). The IP module at the host performing the reassembly function must allocate memory buffers both for temporarily storing the received fragments and to perform the reassembly function. Attackers can exploit this fact to exhaust memory buffers at the system performing the fragment reassembly. o The fragmentation and reassembly mechanisms were designed at a time in which the available bandwidths were very different from the bandwidths available nowadays. With the current available bandwidths, a number of interoperability problems may arise, and these issues may be intentionally exploited by attackers to perform DoS attacks. o Fragment reassembly must usually be performed without any knowledge of the properties of the path the fragments follow. Without this information, hosts cannot make any educated guess on how long they should wait for missing fragments to arrive.
o The fragment reassembly algorithm, as described by the IETF specifications, is ambiguous, and allows for a number of interpretations, each of which has found place in different TCP/IP stack implementations. o The reassembly process is somewhat complex. Fragments may arrive out of order, duplicated, overlapping each other, etc. This complexity has lead to numerous bugs in different implementations of the IP protocol.4.1.1. Security Implications of Fragment Reassembly
4.1.1.1. Problems Related to Memory Allocation
When an IP datagram is received by an end-system, it will be temporarily stored in system memory, until the IP module processes it and hands it to the protocol machine that corresponds to the encapsulated protocol. In the case of fragmented IP packets, while the IP module may perform preliminary processing of the IP header (such as checking the header for errors and processing IP options), fragments must be kept in system buffers until all fragments are received and are reassembled into a complete Internet datagram. As mentioned above, because the Internet layer will not usually have information about the characteristics of the path between the system and the remote host, no educated guess can be made on the amount of time that should be waited for the other fragments to arrive. Therefore, the specifications recommend to wait for a period of time that is acceptable for virtually all the possible network scenarios in which the protocols might operate. After that time has elapsed, all the received fragments for the corresponding incomplete packet are discarded. The original IP Specification, RFC 791 [RFC0791], states that systems should wait for at least 15 seconds for the missing fragments to arrive. Systems that follow the "Example Reassembly Procedure" described in Section 3.2 of RFC 791 may end up using a reassembly timer of up to 4.25 minutes, with a minimum of 15 seconds. Section 3.3.2 ("Reassembly") of RFC 1122 corrected this advice, stating that the reassembly timeout should be a fixed value between 60 and 120 seconds.
However, the longer the system waits for the missing fragments to arrive, the longer the corresponding system resources must be tied to the corresponding packet. The amount of system memory is finite, and even with today's systems, it can still be considered a scarce resource. An attacker could take advantage of the uncomfortable situation the system performing fragment reassembly is in, by sending forged fragments that will never reassemble into a complete datagram. That is, an attacker would send many different fragments, with different IP IDs, without ever sending all the necessary fragments that would be needed to reassemble them into a full IP datagram. For each of the fragments, the IP module would allocate resources and tie them to the corresponding fragment, until the reassembly timer for the corresponding packet expires. There are some implementation strategies which could increase the impact of this attack. For example, upon receipt of a fragment, some systems allocate a memory buffer that will be large enough to reassemble the whole datagram. While this might be beneficial in legitimate cases, this just amplifies the impact of the possible attacks, as a single small fragment could tie up memory buffers for the size of an extremely large (and unlikely) datagram. The implementation strategy suggested in RFC 815 [RFC0815] leads to such an implementation. The impact of the aforementioned attack may vary depending on some specific implementation details: o If the system does not enforce limits on the amount of memory that can be allocated by the IP module, then an attacker could tie all system memory to fragments, at which point the system would become unusable, perhaps crashing. o If the system enforces limits on the amount of memory that can be allocated by the IP module as a whole, then, when this limit is reached, all further IP packets that arrive would be discarded, until some fragments time out and free memory is available again. o If the system enforces limits on the amount memory that can be allocated for the reassembly of fragments, then, when this limit is reached, all further fragments that arrive would be discarded, until some fragment(s) time out and free memory is available again.
4.1.1.2. Problems That Arise from the Length of the IP Identification Field
The Internet Protocols are currently being used in environments that are quite different from the ones in which they were conceived. For instance, the availability of bandwidth at the time the Internet Protocol was designed was completely different from the availability of bandwidth in today's networks. The Identification field is a 16-bit field that is used for the fragmentation and reassembly function. In the event a datagram gets fragmented, all the corresponding fragments will share the same {Source Address, Destination Address, Protocol, Identification number} four-tuple. Thus, the system receiving the fragments will be able to uniquely identify them as fragments that correspond to the same IP datagram. At a given point in time, there must be at most only one packet in the network with a given four-tuple. If not, an Identification number "collision" might occur, and the receiving system might end up "mixing" fragments that correspond to different IP datagrams which simply reused the same Identification number. For example, sending over a 1 Gbit/s path a continuous stream of (UDP) packets of roughly 1 kb size that all get fragmented into two equally sized fragments of 576 octets each (minimum reassembly buffer size) would repeat the IP Identification values within less than 0.65 seconds (assuming roughly 10% link layer overhead); with shorter packets that still get fragmented, this figure could easily drop below 0.4 seconds. With a single IP packet dropped in this short time frame, packets would start to be reassembled wrongly and continuously once in such interval until an error detection and recovery algorithm at an upper layer lets the application back out. For each group of fragments whose Identification numbers "collide", the fragment reassembly will lead to corrupted packets. The IP payload of the reassembled datagram will be handed to the corresponding upper-layer protocol, where the error will (hopefully) be detected by some error detecting code (such as the TCP checksum) and the packet will be discarded. An attacker could exploit this fact to intentionally cause a system to discard all or part of the fragmented traffic it gets, thus performing a DoS attack. Such an attacker would simply establish a flow of fragments with different IP Identification numbers, to trash all or part of the IP Identification space. As a result, the receiving system would use the attacker's fragments for the reassembly of legitimate datagrams, leading to corrupted packets which would later (and hopefully) get dropped.
In most cases, use of a long fragment timeout will benefit the attacker, as forged fragments will keep the IP Identification space trashed for a longer period of time.4.1.1.3. Problems That Arise from the Complexity of the Reassembly Algorithm
As IP packets can be duplicated by the network, and each packet may take a different path to get to the destination host, fragments may arrive not only out of order and/or duplicated but also overlapping. This means that the reassembly process can be somewhat complex, with the corresponding implementation being not specifically trivial. [Shannon2001] analyzes the causes and attributes of fragment traffic measured in several types of WANs. During the years, a number of attacks have exploited bugs in the reassembly function of several operating systems, producing buffer overflows that have led, in most cases, to a crash of the attacked system.4.1.1.4. Problems That Arise from the Ambiguity of the Reassembly Process
Network Intrusion Detection Systems (NIDSs) typically monitor the traffic on a given network with the intent of identifying traffic patterns that might indicate network intrusions. In the presence of IP fragments, in order to detect illegitimate activity at the transport or application layers (i.e., any protocol layer above the network layer), a NIDS must perform IP fragment reassembly. In order to correctly assess the traffic, the result of the reassembly function performed by the NIDS should be the same as that of the reassembly function performed by the intended recipient of the packets. However, a number of factors make the result of the reassembly process ambiguous: o The IETF specifications are ambiguous as to what should be done in the event overlapping fragments were received. Thus, in the presence of overlapping data, the system performing the reassembly function is free to honor either the first set of data received, the latest copy received, or any other copy received in between.
o As the specifications do not enforce any specific fragment timeout value, different systems may choose different values for the fragment timeout. This means that given a set of fragments received at some specified time intervals, some systems will reassemble the fragments into a full datagram, while others may timeout the fragments and therefore drop them. o As mentioned before, as the fragment buffers get full, a DoS condition will occur unless some action is taken. Many systems flush part of the fragment buffers when some threshold is reached. Thus, depending on fragment load, timing issues, and flushing policy, a NIDS may get incorrect assumptions about how (and if) fragments are being reassembled by their intended recipient. As originally discussed by [Ptacek1998], these issues can be exploited by attackers to evade intrusion detection systems. There exist freely available tools to forcefully fragment IP datagrams so as to help evade Intrusion Detection Systems. Frag router [Song1999] is an example of such a tool; it allows an attacker to perform all the evasion techniques described in [Ptacek1998]. Ftester [Barisani2006] is a tool that helps to audit systems regarding fragmentation issues.4.1.1.5. Problems That Arise from the Size of the IP Fragments
One approach to fragment filtering involves keeping track of the results of applying filter rules to the first fragment (i.e., the fragment with a Fragment Offset of 0), and applying them to subsequent fragments of the same packet. The filtering module would maintain a list of packets indexed by the Source Address, Destination Address, Protocol, and Identification number. When the initial fragment is seen, if the MF bit is set, a list item would be allocated to hold the result of filter access checks. When packets with a non-zero Fragment Offset come in, look up the list element with a matching Source Address/Destination Address/Protocol/ Identification and apply the stored result (pass or block). When a fragment with a zero MF bit is seen, free the list element. Unfortunately, the rules of this type of packet filter can usually be bypassed. [RFC1858] describes the details of the involved technique.4.1.2. Possible Security Improvements
4.1.2.1. Memory Allocation for Fragment Reassembly
A design choice usually has to be made as to how to allocate memory to reassemble the fragments of a given packet. There are basically two options:
o Upon receipt of the first fragment, allocate a buffer that will be large enough to concatenate the payload of each fragment. o Upon receipt of the first fragment, create the first node of a linked list to which each of the following fragments will be linked. When all fragments have been received, copy the IP payload of each of the fragments (in the correct order) to a separate buffer that will be handed to the protocol being encapsulated in the IP payload. While the first of the choices might seem to be the most straightforward, it implies that even when a single small fragment of a given packet is received, the amount of memory that will be allocated for that fragment will account for the size of the complete IP datagram, thus using more system resources than what is actually needed. Furthermore, the only situation in which the actual size of the whole datagram will be known is when the last fragment of the packet is received first, as that is the only packet from which the total size of the IP datagram can be asserted. Otherwise, memory should be allocated for the largest possible packet size (65535 bytes). The IP module should also enforce a limit on the amount of memory that can be allocated for IP fragments, as well as a limit on the number of fragments that at any time will be allowed in the system. This will basically limit the resources spent on the reassembly process, and prevent an attacker from trashing the whole system memory. Furthermore, the IP module should keep a different buffer for IP fragments than for complete IP datagrams. This will basically separate the effects of fragment attacks on non-fragmented traffic. Most TCP/IP implementations, such as that in Linux and those in BSD- derived systems, already implement this. [Jones2002] analyzes the amount of memory that may be needed for the fragment reassembly buffer depending on a number of network characteristics.4.1.2.2. Flushing the Fragment Buffer
In the case of those attacks that aim to consume the memory buffers used for fragments, and those that aim to cause a collision of IP Identification numbers, there are a number of countermeasures that can be implemented.
Even with these countermeasures in place, there is still the issue of what to do when the buffer pool used for IP fragments gets full. Basically, if the fragment buffer is full, no instance of communication that relies on fragmentation will be able to progress. Unfortunately, there are not many options for reacting to this situation. If nothing is done, all the instances of communication that rely on fragmentation will experience a denial of service. Thus, the only thing that can be done is flush all or part of the fragment buffer, on the premise that legitimate traffic will be able to make use of the freed buffer space to allow communication flows to progress. There are a number of factors that should be taken into consideration when flushing the fragment buffers. First, if a fragment of a given packet (i.e., fragment with a given Identification number) is flushed, all the other fragments that correspond to the same datagram should be flushed. As in order for a packet to be reassembled all of its fragments must be received by the system performing the reassembly function, flushing only a subset of the fragments of a given packet would keep the corresponding buffers tied to fragments that would never reassemble into a complete datagram. Additionally, care must be taken so that, in the event that subsequent buffer flushes need to be performed, it is not always the same set of fragments that get dropped, as such a behavior would probably cause a selective DoS to the traffic flows to which that set of fragments belongs. Many TCP/IP implementations define a threshold for the number of fragments that, when reached, triggers a fragment-buffer flush. Some systems flush 1/2 of the fragment buffer when the threshold is reached. As mentioned before, the idea of flushing the buffer is to create some free space in the fragment buffer, on the premise that this will allow for new and legitimate fragments to be processed by the IP module, thus letting communication survive the overwhelming situation. On the other hand, the idea of flushing a somewhat large portion of the buffer is to avoid flushing always the same set of packets.4.1.2.3. A More Selective Fragment Buffer Flushing Strategy
One of the difficulties in implementing countermeasures for the fragmentation attacks described throughout Section 4.1 is that it is difficult to perform validation checks on the received fragments. For instance, the fragment on which validity checks could be performed, the first fragment, may be not the first fragment to arrive at the destination host.
Fragments cannot only arrive out of order because of packet reordering performed by the network, but also because the system (or systems) that fragmented the IP datagram may indeed transmit the fragments out of order. A notable example of this is the Linux TCP/IP stack, which transmits the fragments in reverse order. This means that we cannot enforce checks on the fragments for which we allocate reassembly resources, as the first fragment we receive for a given packet may be some other fragment than the first one (the one with an Fragment Offset of 0). However, at the point in which we decide to free some space in the fragment buffer, some refinements can be done to the flushing policy. The first thing we would like to do is to stop different types of traffic from interfering with each other. This means, in principle, that we do not want fragmented UDP traffic to interfere with fragmented TCP traffic. In order to implement this traffic separation for the different protocols, a different fragment buffer pool would be needed, in principle, for each of the 256 different protocols that can be encapsulated in an IP datagram. We believe a trade-off is to implement two separate fragment buffers: one for IP datagrams that encapsulate IPsec packets and another for the rest of the traffic. This basically means that traffic not protected by IPsec will not interfere with those flows of communication that are being protected by IPsec. The processing of each of these two different fragment buffer pools would be completely independent from each other. In the case of the IPsec fragment buffer pool, when the buffers needs to be flushed, the following refined policy could be applied: o First, for each packet for which the IPsec header has been received, check that the Security Parameters Index (SPI) field of the IPsec header corresponds to an existing IPsec Security Association (SA), and probably also check that the IPsec sequence number is valid. If the check fails, drop all the fragments that correspond to this packet. o Second, if still more fragment buffers need to be flushed, drop all the fragments that correspond to packets for which the full IPsec header has not yet been received. The number of packets for which this flushing is performed depends on the amount of free space that needs to be created. o Third, if after flushing packets with invalid IPsec information (First step), and packets on which validation checks could not be performed (Second step), there is still not enough space in the
fragment buffer, drop all the fragments that correspond to packets that passed the checks of the first step, until the necessary free space is created. The rationale behind this policy is that, at the point of flushing fragment buffers, we prefer to keep those packets on which we could successfully perform a number of validation checks, over those packets on which those checks failed, or the checks could not even be performed. By checking both the IPsec SPI and the IPsec sequence number, it is virtually impossible for an attacker that is off-path to perform a DoS attack to communication flows being protected by IPsec. Unfortunately, some IP implementations (such as that in Linux [Linux]), when performing fragmentation, send the corresponding fragments in reverse order. In such cases, at the point of flushing the fragment buffer, legitimate fragments will receive the same treatment as the possible forged fragments. This refined flushing policy provides an increased level of protection against this type of resource exhaustion attack, while not making the situation of out-of-order IPsec-secured traffic worse than with the simplified flushing policy described in the previous section.4.1.2.4. Reducing the Fragment Timeout
RFC 1122 [RFC1122] states that the reassembly timeout should be a fixed value between 60 and 120 seconds. The rationale behind these long timeout values is that they should accommodate any path characteristics, such as long-delay paths. However, it must be noted that this timer is really measuring inter-fragment delays, or, more specifically, fragment jitter. If all fragments take paths of similar characteristics, the inter- fragment delay will usually be, at most, a few seconds. Nevertheless, even if fragments take different paths of different characteristics, the recommended 60 to 120 seconds are, in practice, excessive. Some systems have already reduced the fragment timeout to 30 seconds [Linux]. The fragment timeout could probably be further reduced to approximately 15 seconds; although further research on this issue is necessary.
It should be noted that in network scenarios of long-delay and high- bandwidth (usually referred to as "Long-Fat Networks"), using a long fragment timeout would likely increase the probability of collision of IP ID numbers. Therefore, in such scenarios it is highly desirable to avoid the use of fragmentation with techniques such as PMTUD [RFC1191] or PLPMTUD [RFC4821].4.1.2.5. Countermeasure for Some NIDS Evasion Techniques
[Shankar2003] introduces a technique named "Active Mapping" that prevents evasion of a NIDS by acquiring sufficient knowledge about the network being monitored, to assess which packets will arrive at the intended recipient, and how they will be interpreted by it. [Novak2005] describes some techniques that are applied by the Snort [Snort] NIDS to avoid evasion.4.1.2.6. Countermeasure for Firewall-Rules Bypassing
One of the classical techniques to bypass firewall rules involves sending packets in which the header of the encapsulated protocol is fragmented. Even when it would be legal (as far as the IETF specifications are concerned) to receive such a packets, the MTUs of the network technologies used in practice are not that small to require the header of the encapsulated protocol to be fragmented (e.g., see [RFC2544]). Therefore, the system performing reassembly should drop all packets which fragment the upper-layer protocol header, and this event should be logged (e.g., a counter could be incremented to reflect the packet drop). Additionally, given that many middle-boxes such as firewalls create state according to the contents of the first fragment of a given packet, it is best that, in the event an end-system receives overlapping fragments, it honors the information contained in the fragment that was received first. RFC 1858 [RFC1858] describes the abuse of IP fragmentation to bypass firewall rules. RFC 3128 [RFC3128] corrects some errors in RFC 1858.4.2. Forwarding
4.2.1. Precedence-Ordered Queue Service
Section 5.3.3.1 of RFC 1812 [RFC1812] states that routers should implement precedence-ordered queue service. This means that when a packet is selected for output on a (logical) link, the packet of highest precedence that has been queued for that link is sent. Section 5.3.3.2 of RFC 1812 advises routers to default to maintaining strict precedence-ordered service.
Unfortunately, given that it is trivial to forge the IP precedence field of the IP header, an attacker could simply forge a high precedence number in the packets it sends to illegitimately get better network service. If precedence-ordered queued service is not required in a particular network infrastructure, it should be disabled, and thus all packets would receive the same type of service, despite the values in their Type of Service or Differentiated Services fields. When precedence-ordered queue service is required in the network infrastructure, in order to mitigate the attack vector discussed in the previous paragraph, edge routers or switches should be configured to police and remark the Type of Service or Differentiated Services values, according to the type of service at which each end-system has been allowed to send packets. Bullet 4 of Section 5.3.3.3 of RFC 1812 states that routers "MUST NOT change precedence settings on packets it did not originate". However, given the security implications of the Precedence field, it is fair for routers, switches, or other middle-boxes, particularly those in the network edge, to overwrite the Type of Service (or Differentiated Services) field of the packets they are forwarding, according to a configured network policy (this is the specified behavior for DS domains [RFC2475]). Sections 5.3.3.1 and 5.3.6 of RFC 1812 state that if precedence- ordered queue service is implemented and enabled, the router "MUST NOT discard a packet whose precedence is higher than that of a packet that is not discarded". While this recommendation makes sense given the semantics of the Precedence field, it is important to note that it would be simple for an attacker to send packets with forged high Precedence value to congest some internet router(s), and cause all (or most) traffic with a lower Precedence value to be discarded.4.2.2. Weak Type of Service
Section 5.2.4.3 of RFC 1812 describes the algorithm for determining the next-hop address (i.e., the forwarding algorithm). Bullet 3, "Weak TOS", addresses the case in which routes contain a "type of service" attribute. It states that in case a packet contains a non- default TOS (i.e., 0000), only routes with the same TOS or with the default TOS should be considered for forwarding that packet. However, this means that if among the longest match routes for a given packet are routes with some TOS other than the one contained in the received packet, and no routes with the default TOS, the corresponding packet would be dropped. This may or may not be a desired behavior.
An alternative for the case in which among the "longest match" routes there are only routes with non-default type of service that do not match the TOS contained in the received packet, would be to use a route with any other TOS. While this route would most likely not be able to address the type of service requested by packet, it would, at least, provide a "best effort" service. It must be noted that Section 5.3.2 of RFC 1812 allows routers to not honor the TOS field. Therefore, the proposed alternative behavior is still compliant with the IETF specifications. While officially specified in the RFC series, TOS-based routing is not widely deployed in the Internet.4.2.3. Impact of Address Resolution on Buffer Management
In the case of broadcast link-layer technologies, in order for a system to transfer an IP datagram it must usually first map an IP address to the corresponding link-layer address (for example, by means of the Address Resolution Protocol (ARP) [RFC0826]) . This means that while this operation is being performed, the packets that would require such a mapping would need to be kept in memory. This may happen both in the case of hosts and in the case of routers. This situation might be exploited by an attacker, which could send a large amount of packets to a non-existent host that would supposedly be directly connected to the attacked router. While trying to map the corresponding IP address into a link-layer address, the attacked router would keep in memory all the packets that would need to make use of that link-layer address. At the point in which the mapping function times out, depending on the policy implemented by the attacked router, only the packet that triggered the call to the mapping function might be dropped. In that case, the same operation would be repeated for every packet destined to the non-existent host. Depending on the timeout value for the mapping function, this situation might lead the router to run out of free buffer space, with the consequence that incoming legitimate packets would have to be dropped, or that legitimate packets already stored in the router's buffers might get dropped. Both of these situations would lead either to a complete DoS or to a degradation of the network service. One countermeasure to this problem would be to drop, at the point the mapping function times out, all the packets destined to the address that timed out. In addition, a "negative cache entry" might be kept in the module performing the matching function, so that for some amount of time, the mapping function would return an error when the IP module requests to perform a mapping for some address for which the mapping has recently timed out.
A common implementation strategy for routers is that when a packet is received that requires an ARP resolution to be performed before the packet can be forwarded, the packet is dropped and the router is then engaged in the ARP procedure.4.2.4. Dropping Packets
In some scenarios, it may be necessary for a host or router to drop packets from the output queue. In the event that one of such packets happens to be an IP fragment, and there were other fragments of the same packet in the queue, those other fragments should also be dropped. The rationale for this policy is that it is nonsensical to spend system resources on those other fragments, because, as long as one fragment is missing, it will be impossible for the receiving system to reassemble them into a complete IP datagram. Some systems have been known to drop just a subset of fragments of a given datagram, leading to a denial-of-service condition, as only a subset of all the fragments of the packets were actually transferred to the next hop.