3. Semantic Free Index
Many of today's distributed network indexes are semantic. The semantic index is human-readable. For example, it might associate information with other keywords, a document, a database key, or even an administrative domain. It makes it easy to associate objects with particular network providers, companies, or organizations, as evidenced in the Domain Name System (DNS). However, it can also trigger legal tussles and frustrate content replication and migration [216]. Distributed Hash Tables (DHTs) have been proposed to provide semantic-free, data-centric references. DHTs enable one to find an object's persistent key in a very large, changing set of hosts. They are typically designed for [23]: a) low degree. If each node keeps routing information for only a small number of other nodes, the impact of high node arrival and departure rates is contained; b) low hop count. The hops and delay introduced by the extra indirection are minimized; c) greedy routing. Nodes independently calculate a short path to the target. At each hop, the query moves closer to the target; and d) robustness. A path to the target can be found even when links or nodes fail.3.1. Origins
To understand the origins of recent DHTs, one needs to look to three contributions from the 1990s. The first two -- Plaxton, Rajaraman, and Richa (PRR) [30] and Consistent Hashing [49] -- were published within one month of each other. The third, the Scalable Distributed Data Structure (SDDS) [52], was curiously ignored in significant structured P2P designs despite having some similar goals [2, 6, 7]. It has been briefly referenced in other P2P papers [46, 284-287].3.1.1. Plaxton, Rajaraman, and Richa (PRR)
PRR is the most recent of the three. It influenced the designs of Pastry [2], Tapestry [6], and Chord [7]. The value of PRR is that it can locate objects using fixed-length routing tables [6]. Objects and nodes are assigned a semantic-free address, for example a 160-bit key. Every node is effectively the root of a spanning tree. A message routes toward an object by matching longer address suffixes, until it encounters either the object's root node or another node
with a 'nearby' copy. It can route around link and node failure by matching nodes with a related suffix. The scheme has several disadvantages [6]: global knowledge is needed to construct the overlay; an object's root node is a single point of failure; nodes cannot be inserted and deleted; and there is no mechanism for queries to avoid congestion hot spots.3.1.2. Consistent Hashing
Consistent Hashing [288] strongly influenced the designs of Chord [7] and Koorde [37]. Karger, et al. introduced Consistent Hashing in the context of the Web-caching problem [49]. Web servers could conceivably use standard hashing to place objects across a network of caches. Clients could use the approach to find the objects. For normal hashing, most object references would be moved when caches are added or deleted. On the other hand, Consistent Hashing is "smooth" -- when caches are added or deleted, the minimum number of object references move so as to maintain load balancing. Consistent Hashing also ensures that the total number of caches responsible for a particular object is limited. Whereas Litwin's Linear Hashing (LH*) scheme requires 'buckets' to be added one at a time in sequence [50], Consistent Hashing allows them to be added in any order [49]. There is an open Consistent Hashing problem pertaining to the fraction of items moved when a node is inserted [165]. Extended Consistent Hashing was recently proposed to randomize queries over the spread of caches to significantly reduce the load variance [289]. Interestingly, Karger [49] referred to an older DHT algorithm by Devine that used "a novel autonomous location discovery algorithm that learns the buckets' locations instead of using a centralized directory" [51].3.1.3. Scalable Distributed Data Structures (LH*)
In turn, Devine's primary point of reference was Litwin's work on SDDSs and the associated LH* algorithm [52]. An SDDS satisfies three design requirements: files grow to new servers only when existing servers are well loaded; there is no centralized directory; and the basic operations like insert, search, and split never require atomic updates to multiple clients. Honicky and Miller suggested the first requirement could be considered a limitation since expansion to new servers is not under administrative control [286]. Litwin recently noted numerous similarities and differences between LH* and Chord [290]. He found that both implement key search. Although LH* refers to clients and servers, nodes can operate as peers in both. Chord 'splits' nodes when a new node is inserted, while LH* schedules 'splits' to avoid overload. Chord requests travel O(log n) hops, while LH* client requests need, at most, two hops to find the target. Chord stores a small number of 'fingers' at each node. LH* servers
store N/2 to N addresses while LH* clients store 1 to N addresses. This trade-off between hop count and the size of the index affects system robustness, and bears striking similarity to recent one- and two-hop P2P schemes in Section 2. The arrival and departure of LH* clients does not disrupt LH* server metadata at all. Given the size of the index, the arrival and departure of LH* servers are likely to cause more churn than that of Chord nodes. Unlike Chord, LH* has a single point of failure, the split coordinator. It can be replicated. Alternatively, it can be removed in later LH* variants, though details have not been progressed for lack of practical need [290].3.2. Dependability
We make four overall observations about their dependability. Dependability metrics fall into two categories: static dependability, a measure of performance before recovery mechanisms take over; and dynamic dependability, for the most likely case in massive networks where there is continual failure and recovery ("churn").3.2.1. Static Dependability
Observation A: Static dependability comparisons show that no O(log n) DHT geometry is significantly more dependable than the other O(log n) geometries. Gummadi, et al. compared the tree, hypercube, butterfly, ring, XOR, and hybrid geometries. In such geometries, nodes generally know about O(log n) neighbors and route to a destination in O(log n) hops, where N is the number of nodes in the overlay. Gummadi, et al. asked "Why not the ring?" They concluded that only the ring and XOR geometries permit flexible choice of both neighbors and alternative routes [24]. Loguinov, et al. added the de Bruijn graph to their comparison [36]. They concluded that the classical analyses, for example the probability that a particular node becomes disconnected, yield no major differences between the resilience of Chord, CAN, and de Bruijn graphs. Using bisection width (the minimum edge count between two equal partitions) and path overlap (the likelihood that backup paths will encounter the same failed nodes or links as the primary path), they argued for the superior resilience of the de Bruijn graph. In short, ring, XOR, and de Bruijn graphs all permit flexible choice of alternative paths, but only in de Bruijn are the alternate paths independent of each other [36].
3.2.2. Dynamic Dependability
Observation B: Dynamic dependability comparisons show that DHT dependability is sensitive to the underlying topology maintenance algorithms. Li, et al. give the best comparison to date of several leading DHTs during churn [291]. They relate the disparate configuration parameters of Tapestry, Chord, Kademlia, Kelips, and OneHop to fundamental design choices. For each of these DHTs, they plotted the optimal performance in terms of lookup latency (milliseconds) and fraction of failed lookups. The results led to several important insights about the underlying algorithms, for example: increasing routing table size is more cost-effective than increasing the rate of periodic stabilization; learning about new nodes during the lookup process sometimes eliminates the need for stabilization; and parallel lookups reduce latency due to timeouts more effectively than faster stabilization. Similarly, Zhuang, et al. compared keep-alive algorithms for DHT failure detection [292]. Such algorithmic comparisons can significantly improve the dependability of DHT designs. In Figure 2, we propose a taxonomy for the topology maintenance algorithms that influence dependability. The algorithms can be classified by how nodes join and leave, how they first detect failures, how they share information about topology updates, and how they react when they receive information about topology updates.
Normal Updates Joins (passive; active) [293] Leaves (passive; active) [293] Fault Detection [292] Maintenance Proactive (periodic or keep-alive probes) Reactive (correction-on-use, correction-on-failure) [294] Report Negative (all dead nodes, nodes recently failed) Positive (all live nodes; nodes recently recovered) [292] Topology Sharing: yes/ no [292] Multicast Tree (explicit, implicit) [267, 295] Gossip (timeouts; number of contacts) [39] Corrective Action Routing Rerouting actions (reroute once; route in parallel [291]; reject) Routing timeouts (TCP-style, virtual coordinates) [296] Topology Update action (evict/ replace/ tag node) Update timeliness (immediate, periodic[296], delayed [297]) Figure 2: Topology Maintenance in Distributed Hash Tables3.2.3. Ephemeral or Stable Nodes -- O(log n) or O(1) Hops
Observation C: Most DHTs use O(log n) geometries to suit ephemeral nodes. The O(1) hop DHTs suit stable nodes and deserve more research attention. Most of the DHTs in Section 3.5 assume that nodes are ephemeral, with expected lifetimes of one to two hours. Therefore, they mostly use an O(log n) geometry. The common assumption is that maintenance of full routing tables in the O(1) hop DHTs will consume excessive bandwidth when nodes are continually joining and leaving. The corollary is that, when they run on stable infrastructure servers [298], most of the DHTs in Section 3.5 are less than optimal -- lookups take many more hops than necessary, wasting latency and bandwidth budgets. The O(1) hop DHTs suit stable deployments and high lookup rates. For a churning 1024-node network, Li, et al. concluded that OneHop is superior to Chord, Tapestry, Kademlia, and Kelips in terms of latency and lookup success rate [291]. For a 3000-node network, they concluded that "OneHop is only preferable to Chord when the deployment scenario allows a communication cost
greater than 20 bytes per node per second" [291]. This apparent limitation needs to be put in context. They assumed that each node issues only one lookup every 10 minutes and has a lifetime of only 60 minutes. It seems reasonable to expect that in some deployments, nodes will have a lifetime of weeks or more, a maintenance bandwidth of tens of kilobits per second, and a load of hundreds of lookups per second. O(1) hop DHTs are superior in such situations. OneHop can scale at least to many tens of thousands of nodes [267]. The recent O(1) hop designs [267, 295] are vastly outnumbered by the O(log n) DHTs in Section 3.5. Research on the algorithms of Figure 2 will also yield improvements in the dependability of the O(1) hop DHTs.3.2.4. Simulation and Proof
Observation D: Although not yet a mature science, the study of DHT dependability is helped by recent simulation and formal development tools. While there are recent reference architectures [294, 298], much of the DHT literature in Section 3.5 does not lend itself to repeatable, comparative studies. The best comparative work to date [291] relies on the Peer-to-Peer Simulator (P2PSIM) [299]. At the time of writing, it supports more DHT geometries than any other simulator. As the study of DHTs matures, we can expect to see the simulation emphasis shift from geometric comparison to a comparison of the algorithms of Figure 2. P2P correctness proofs generally rely on less-than-complete formal specifications of system invariants and events [7, 45, 300]. Li and Plaxton expressed concern that "when many joins and leaves happen concurrently, it is not clear whether the neighbor tables will remain in a 'good' state" [47]. While acknowledging that guaranteeing consistency in a failure-prone network is impossible, Lynch, Malkhi, et al. sketched amendments to the Chord algorithm to guarantee atomicity [301]. More recently, Gilbert, Lynch, et al. gave a new algorithm for atomic read/write memory in a churning distributed network, suggesting it to be a good match for P2P [302]. Lynch and Stoica show in an enhancement to Chord that lookups are provably correct when there is a limited rate of joins and failures [303]. Fault Tolerant Active Rings is a protocol for active joins and leaves that was formally specified and proven using B-method tools [304]. A good starting point for a formal DHT development would be the numerous informal API specifications [22, 305, 306]. Such work could be informed by other efforts to formally specify routing invariants [307, 308].
3.3. Latency
The key metrics for DHT latency are: 1) Shortest-Path Distance and Diameter. In graph theory, the shortest-path distance is the minimum number of edges in any path between two vertices of the graph. Diameter is the largest of all shortest-path distances in a graph [309]. Networking synonyms for distance on a DHT are "hop count" and "lookup length". 2) Latency and Latency Stretch. Two types of latency are relevant here -- network-layer latency and overlay latency. Network-layer latency has been referred to as "proximity" or "locality" [24]. Stretch is the cost of an overlay path between two nodes, divided by the cost of the direct network path between those nodes [310]. Latency stretch is also known as the "relative delay penalty" [311].3.3.1. Hop Count and the O(1)-Hop DHTs
Hop count gives an approximate indication of path latency. O(1)-hop DHTs have path latencies lower than the O(log n)-hop DHTs [291]. This significant advantage is often overlooked on account of concern about the messaging costs to maintain large routing tables (Section 3.2.3). Such concern is justified when the mean node lifetime is only a few hours and the mean lookup interval per node is more than a few seconds (the classic profile of a P2P file-sharing node). However, for a large, practical operating range (node lifetimes of days or more, lookup rates of over tens of lookups per second per node, up to ~100,000 nodes), the total messaging cost in O(1) hop DHTs is lower than in O(log n) DHTs [312]. Lookups and routing table maintenance contribute to the total messaging cost. If a deployment fits this operating range, then O(1)-hop DHTs will give lower path latencies and lower total messaging costs. An additional merit of the O(1)-hop DHTs is that they yield lower lookup failure rates than their O(log N)-hop counterparts [291]. Low hop count can be achieved in two ways: each node has a large O(N) index of nodes; or the object references can be replicated on many nodes. Beehive [313], Kelips [39], LAND [310], and Tulip [314] are examples of the latter category. Beehive achieves O(1) hops on average and O(log n) hops in the worst case, by proactive replication of popular objects. Kelips replicates the 'file index'. It incurs O(sqrt(N)) storage costs for both the node index and the file index. LAND uses O(log n) reference pointers for each stored object and an O(log n) index to achieve a worst-case 1+e stretch, where 0<e. The Kelips-like Tulip [314] requires 2 hops per lookup. Each node
maintains 2sqrt(N)log(N) links to other nodes and objects are replicated on O(sqrt(N)) nodes. The DHTs with a large O(N) node index can be divided into two groups: those for which the index is always O(N); and those for which the index opportunistically ranges from O(log n) to O(N). Linear Hashing (LH*) servers [52], OneHop [267], and 1h-Calot [295] fall into the former category. EpiChord [315] and Accordion [316] are examples of the latter.3.3.2. Proximity and the O(log n)-Hop DHTs
If one chooses not to use single-hop DHTs, hop count is a weak indicator of end-to-end path latency. Some hops may incur large delays because of intercontinental or satellite links. Consequently, numerous DHT designs minimize path latency by considering the proximity of nodes. Gummadi, et al. classified the proximity methods as follows [24]: 1) Proximity Neighbor Selection (PNS). The nodes in the routing table are chosen based on the latency of the direct hop to those nodes. The latency may be explicitly measured [317], or it may be estimated using one of several synthetic coordinate systems [150, 154, 318]. As a lower bound on PNS performance, Dabek, et al. showed that lookups on O(log n) DHTs take at least 1.5 times the average roundtrip time of the underlying network [154]. 2) Proximity Route Selection (PRS). At lookup time, the choice of the next-hop node relies on the latency of the direct hop to that node. PRS is less effective than PNS, though it may complement it [24]. Some of the routing geometries in Section 3.5 do not support PNS and/or PRS [24]. 3) Proximity Identifier Selection (PIS). Node identifiers indicate geographic position. PIS frustrates load balancing, increases the risk of correlated failures, and is not often used [24]. The proximity study by Gummadi, et al. assumed recursive routing, though they suggested that PNS would also be superior to PRS with iterative routing [24]. Dabek, et al. found that recursive lookups take 0.6 times as long as iterative lookups [150]. Beyond the explicit use of proximity information, redundancy can help to avoid slow paths and servers. One may increase the number of replicas [150], use parallel lookups [291, 316], use alternate routes on failure [150], or use multiple gateway nodes to enter the DHT [317].
3.4. Multicasting
3.4.1. Multicasting vs. Broadcasting
"Multicasting" here means sending a message to a subset of an overlay's nodes. Nodes explicitly join and leave this subset, called a "multicast group". "Broadcasting" here is a special case of multicasting in which a message is sent to all nodes in the overlay. Broadcasting relies on overlay membership messages -- it does not need extra group membership messaging. Castro, et al. said multicasting on structured overlays is either "flooding" (one overlay per group) or "tree-based" (one tree per group) [319]. These are synonyms for broadcasting and multicasting respectively. The first DHT-based designs for multicasting were CAN multicast [320], Scribe [241], Bayeux [242], and i3 [231]. They were based on CAN [8], Pastry [2], Tapestry [31], and Chord [7] respectively. El- Ansary, et al. devised the first DHT-based broadcasting scheme [321]. It was based on Chord. Multicast trees can be constructed using reverse-path forwarding or forward-path forwarding. Scribe uses reverse-path forwarding [241]. Bayeux uses forward-path forwarding [242]. Borg, a multicast design based on Pastry, uses a combination of forward-path and reverse-path forwarding to minimize latency [237].3.4.2. Motivation for DHT-based Multicasting
Multicasting complements DHT search capability. DHTs naturally support exact match queries. With multicasting, they can support more complex queries. Multicasting also enables the dissemination and collection of global information. Consider, for example, aggregation queries like minimum, maximum, count, sum, and average (Section 5.4). A node at the root of a dissemination tree might multicast such a query [322]. The leaf nodes return local results towards the root node. Successive parents aggregate the result so that eventually the root node can compute the global result. Such queries may help to monitor the capacity and health of the overlay itself. Why bother with structured overlays for multicasting? In Section 2.1, we saw that Gnutella can multicast complex queries without them [4]. Castro, et al. posed the question, "Should we build Gnutella on a structured overlay?" [259]. While acknowledging that their study was preliminary, they did conclude that "we see no reason to build Gnutella on top of an unstructured overlay" [259]. The supposedly high maintenance costs of structured overlays were outweighed by
query cost savings. The structured overlay ensured that nodes were only visited once during a complex query. It also helped to accurately limit the total number of nodes visited. Pai, et al. acknowledged that multicast trees based on structured overlays contribute to simple routing rules, low delay and low delay variation [323]. However, they opted for unstructured, gossip-based multicasting for reliability reasons: data loss near the tree root affects all subtended nodes; interior node failures must be repaired quickly; interior nodes are obliged to disseminate more than their fair share of traffic, giving leaf nodes a "free ride". The most promising research direction is to improve on the Bimodal Multicasting approach [324]. It combines the bandwidth efficiency and low latency of structured, best-effort multicasting trees with the reliability of unstructured gossip protocols.3.4.3. Design Issues
None of the early structured overlay multicast designs addressed all of the following issues [325]: 1) Heterogeneous Node Capacity. Nodes differ in their processing, memory, and network capacity. Multicast throughput is largely determined by the node with smallest throughput [325]. To limit the multicasting load on a node, one might cap its out-degree. If the same node receives further join requests, it refers them to its children ("pushdown") [240]. Bharambe, et al. explored several pushdown strategies but found them inadequate to deal with heterogeneity [326]. They concluded that the heterogeneity issue remains open, and should be addressed before deploying DHTs for high-bandwidth multicasting applications. Independently, Zhang et al. partially tackled heterogeneity by allowing nodes in their CAM-Chord and CAM-Koorde designs to vary out-degree according to the node's capacity [325]. However, they made no mention of the "pushdown" issue -- they did not describe topology maintenance when the out-degree limit is reached. 2) Reliability (Dynamic Membership). If a multicast tree is to be resilient, it must survive dynamic membership. There are several ways to deal with dynamic membership: ensure that the root node of the multicasting tree does not handle all requests to join or leave the multicast group [242]; use multiple interior-node- disjoint trees to avoid single points of failure in tree structures [322]; and split the root node into several replicas and partition members across them [241]. For example, Bayeux requires the root node to track all group membership changes whereas Scribe does not [241]. CAN-multicast uses a single, well-known host to bootstrap the join operations [320]. The earliest DHT-based broadcasting work by El-Ansary, et al. did not
address the issue of dynamic membership [321]. Ghodsi, et al. addressed it in a subsequent paper, though, giving two broadcast algorithms that accommodate routing table inconsistencies [327]. One algorithm achieves a more optimal multicasting network at the expense of greater correction overhead. Splitstream, based on Scribe and Pastry, redundantly striped content across multiple interior-node-disjoint multicast trees -- if one interior node fails, then only one stripe is lost [240]. 3) Large Any-Source Multicast Groups. Any group member should be allowed to send multicast messages. The group should scale to a very large number of hosts. CAN-based multicast was the first application-level multicast scheme to scale to groups of several thousands of nodes without restricting the service model to a single source [320]. Bayeux scales to large groups but has a single root node for each multicast group. It supports the any- source model only by having the root node operate as a reflector for multiple senders [242].3.5. Routing Geometries
In Sections 3.5.1 to 3.5.6, we introduce the main geometries for simple key lookup and survey their robustness mechanisms.3.5.1. Plaxton Trees (Pastry, Tapestry)
Work began in March 2000 on a structured, fault-tolerant, wide-area Dynamic Object Location and Routing (DOLR) system called Tapestry [6, 155]. While DHTs fix replica locations, a DOLR API enables applications to control object placement [31]. Tapestry's basic location and routing scheme follows Plaxton, Rajaraman, and Richa (PRR) [30], but it remedies PRR's robustness shortcomings described in Section 3.1. Whereas each object has one root node in PRR, Tapestry uses several to avoid a single point of failure. Unlike PRR, it allows nodes to be inserted and deleted. Whereas PRR required a total ordering of nodes, Tapestry uses 'surrogate routing' to incrementally choose root nodes. The PRR algorithm does not address congestion, but Tapestry can put object copies close to nodes generating high query loads. PRR nodes only know of the nearest replica, whereas Tapestry nodes enable selection from a set of replicas (for example, to retrieve the most up to date). To detect routing faults, Tapestry uses TCP timeouts and UDP heartbeats for detection, sequential secondary neighbours for rerouting, and a 'second chance' window so that recovery can occur without the overhead of a full node insertion. Tapestry's dependability has been measured on a testbed of about 100 machines and on simulations of
about 1000 nodes. Successful routing rates and maintenance bandwidths were measured during instantaneous failures and ongoing churn [31]. Pastry, like Tapestry, uses Plaxton-like prefix routing [2]. As in Tapestry, Pastry nodes maintain O(log n) neighbours and route to a target in O(log n) hops. Pastry differs from Tapestry only in the method by which it handles network locality and replication [2]. Each Pastry node maintains a 'leaf set' and a 'routing table'. The leaf set contains l/2 node IDs on either side of the local node ID in the node ID space. The routing table, in row r, column c, points to the node ID with the same r-digit prefix as the local node, but with an r+1 digit of c. A Pastry node periodically probes leaf set and routing table nodes, with periodicity of Tls and Trt and a timeout Tout. Mahajan, Castry, et al. analyzed the reliability versus maintenance cost trade-offs in terms of the parameters l, Tls, Trt, and Tout [328]. They concluded that earlier concerns about excessive maintenance cost in a churning P2P network were unfounded, but suggested follow-up work for a wider range of reliability targets, maintenance costs, and probe periods. Rhea Geels, et al. concluded that existing DHTs fail at high churn rates [329]. Building on a Pastry implementation from Rice University, they found that most lookups fail to complete when there is excessive churn. They conjectured that short-lived nodes often leave the network with lookups that have not yet timed out, but no evidence was provided to confirm the theory. They identified three design issues that affect DHT performance under churn: reactive versus periodic recovery of peers; lookup timeouts; and choice of nearby neighbours. Since reactive recovery was found to add traffic to already congested links, the authors used periodic recovery in their design. For lookup timeouts, they advocated an exponentially weighted moving average of each neighbour's response time, over alternative fixed timeout or 'virtual coordinate' schemes. For selection of nearby neighbours, they found that 'global sampling' was more effective than simply sampling a 'neighbour's neighbours' or 'inverse neighbours'. Castro, Costa, et al. have refuted the suggestion that DHTs cannot cope with high churn rates [330]. By implementing methods for continuous detection and repair, their MSPastry implementation achieved shorter routing paths and a maintenance overhead of less than half a message per second per node. There have been more recent proposals based on these early Plaxton- like schemes. Kademlia uses a bit-wise exclusive or (XOR) metric for the 'distance' between 160-bit node identifiers [45]. Each node keeps a list of contact nodes for each section of the node space that is between 2^i and 2^(i+1) from itself (0.i<160). Longer-lived nodes are deliberately given preference on this list -- it has been found in Gnutella that the longer a node has been active, the more likely
it is to remain active. Like Kademlia, Willow uses the XOR metric [32]. It implements a Tree Maintenance Protocol to 'zipper' together broken segments of a tree. Where other schemes use DHT routing to inefficiently add new peers, Willow can merge disjoint or broken trees in O(log n) parallel operations.3.5.2. Rings (Chord, DKS)
Chord is the prototypical DHT ring, so we first sketch its operation. Chord maps nodes and keys to an identifier ring [7, 34]. Chord supports one main operation: find a node with the given key. It uses Consistent Hashing (Section 3.1) to minimize disruption of keys when nodes join and leave the network. However, Chord peers need only track O(log n) other peers, not all peers as in the original consistent hashing proposal [49]. It enables concurrent node insertions and deletions, improving on PRR. Compared to Pastry, it has a simpler join protocol. Each Chord peer tracks its predecessor, a list of successors, and a finger table. Using the finger table, each hop is at least half the remaining distance around the ring to the target node, giving an average lookup hop count of (1/2)log n(base 2). Each Chord node runs a periodic stabilization routine that updates predecessor and successor pointers to cater to newly added nodes. All successors of a given node need to fail for the ring to fail. Although a node departure could be treated the same as a failure, a departing Chord node first notifies the predecessor and successors, so as to improve performance. In their definitive paper, Chord's inventors critiqued its dependability under churn [34]. They provided proofs on the behaviour of the Chord network when nodes in a stable network fail, stressing that such proofs are inadequate in the general case of a perpetually churning network. An earlier paper had posed the question, "For lookups to be successful during churn, how regularly do the Chord stabilization routines need to run?" [331]. Stoica, Morris, et al. modeled a range of node join/departure rates and stabilization periods for a Chord network of 1000 nodes. They measured the number of timeouts (caused by a finger pointing to a departed node) and lookup failures (caused by nodes that temporarily point to the wrong successor during churn). They also modeled the 'lookup stretch', the ratio of the Chord lookup time to optimal lookup time on the underlying network. They demonstrated the latency advantage of recursive lookups over iterative lookups, but there remains room for delay reduction. For further work, the authors proposed to improve resilience to network partitions, using a small set of known nodes or 'remembered' random nodes. To reduce the number of messages per lookup, they suggested an increase in the size of each step around the ring, accomplished via a larger number of fingers at each node. Much of the paper assumed independent, equally
likely node failures. Analysis of correlated node failures, caused by massive site or backbone failures, will be more important in some deployments. The paper did not attempt to recommend a fixed optimal stabilization rate. Liben-Nowell, Balakrishnan, et al. had suggested that optimum stabilization rate might evolve according to measurements of peers' behaviour [331] -- such a mechanism has yet to be devised. Alima, El-Ansary, et al. considered the communication costs of Chord's stabilization routines, referred to as 'active correction', to be excessive [332]. Two other robustness issues also motivated their Distributed K-ary Search (DKS) design, which is similar to Chord. Firstly, the total system should evolve for an optimum balance between the number of peers, the lookup hop count, and the size of the routing table. Secondly, lookups should be reliable -- P2P algorithms should be able to guarantee a successful lookup for key/value pairs that have been inserted into the system. A similar lookup-correctness issue was raised elsewhere by one of Chord's authors; "Is it possible to augment the data structure to work even when nodes (and their associated finger lists) just disappear?" [333] Alima, El-Ansary, et al. asserted that P2Ps using active correction, like Chord, Pastry, and Tapestry, are unable to give such a guarantee. They propose an alternate 'correction-on-use' scheme, whereby expired routing entries are corrected by information piggybacking lookups and insertions. A prerequisite is that lookup and insertion rates are significantly higher than node arrival, departure, and failure rates. Correct lookups are guaranteed in the presence of simultaneous node arrivals or up to f concurrent node departures, where f is configurable.3.5.3. Tori (CAN)
Ratnasamy, Francis, et al. developed the Content-Addressable Network (CAN), another early DHT widely referenced alongside Tapestry, Pastry, and Chord [8, 334]. It is arranged as a virtual d-dimensional Cartesian coordinate space on a d-torus. Each node is responsible for a zone in this coordinate space. The designers used a heuristic thought to be important for large, churning P2P networks: keep the number of neighbours independent of system size. Consequently, its design differs significantly from Pastry, Tapestry, and Chord. Whereas they have O(log n) neighbours per node and O(log n) hops per lookup, CAN has O(d) neighbours and O(dn^(1/d)) hop count. When CAN's system-wide parameter d is set to log(n), CAN converges to their profile. If the number of nodes grows, a major rearrangement of the CAN network may be required [151]. The CAN designers considered building on PRR, but opted for the simple, low- state-per-node CAN algorithm instead. They had reasoned that a PRR- based design would not perform well under churn, given node
departures and arrivals would affect a logarithmic number of nodes [8]. There have been preliminary assessments of CAN's resilience. When a node leaves the CAN in an orderly fashion, it passes its own Virtual ID (VID), its neighbours' VIDs and IP addresses, and its key/value pairs to a takeover node. If a node leaves abruptly, its neighbours send recovery messages towards the designated takeover node. CAN ensures the recovery messages reach the takeover node, even if nodes die simultaneously, by maintaining a VID chain with Chord's stabilization algorithm. Some initial 'proof of concept' resilience simulations were run using the Network Simulator (NS) [335] for up to a few hundred nodes. Average hop counts and lookup failure probabilities were plotted against the total number of nodes for various node failure rates [8]. The CAN team documented several open research questions pertaining to state/hop count trade-offs, resilience, load, locality, and heterogeneous peers [44, 334].3.5.4. Butterflies (Viceroy)
Viceroy approximates a butterfly network [46]. It generally has constant degree like CAN. Like Chord, Tapestry, and Pastry, it has logarithmic diameter. It improves on these systems, inasmuch as its diameter is better than CAN and its degree is better than Chord, Tapestry, and Pastry. As with most DHTs, it utilizes Consistent Hashing. When a peer joins the Viceroy network, it takes a random but permanent 'identity' and selects its 'level' within the network. Each peer maintains general ring pointers ('predecessor' and 'successor'), level ring pointers ('nextonlevel' and 'prevonlevel'), and butterfly pointers ('left', 'right', and 'up'). When a peer departs, it normally passes its key pairs to a successor, and notifies other peers to find a replacement peer. The Viceroy paper scoped out the issue of robustness. It explicitly assumed that peers do not fail [46]. It assumed that join and leave operations do not overlap, so as to avoid the complication of concurrency mechanisms like locking. Kaashoek and Karger were somewhat critical of Viceroy's complexity [37]. They also pointed to its fault-tolerance blind spot. Li and Plaxton suggested that such constant-degree algorithms deserve further consideration [47]. They offered several pros and cons. The limited degree may increase the risk of a network partition, or inhibit use of local neighbours (for the simple reason that there are less of them). On the other hand, it may be easier to reason about the correctness of fixed-degree networks. One of the Viceroy authors has since proposed constant- degree peers in a two-tier, locality-aware DHT [310] -- the lower degree maintained by each lower-tier peer purportedly improves network adaptability. Another Viceroy author has since explored an
alternative bounded-degree graph for P2P, namely the de Bruijn graph [336].3.5.5. de Bruijn (D2B, Koorde, Distance Halving, ODRI)
De Bruijn graphs have had numerous refinements since their inception [337, 338]. Schlumberger was the first to use them for networking [339]. Two research teams independently devised the 'generalized' de Bruijn graph that accommodates a flexible number of nodes in the system [340, 341]. Rowley and Bose studied fault-tolerant rings overlaid on the de Bruijn graph [342]. Lee, Liu, et al. devised a two-level de Bruijn hierarchy, whereby clusters of local nodes are interconnected by a second-tier ring [343]. Many of the algorithms discussed previously are 'greedy' in that each time a query is forwarded, it moves closer to the destination. Unfortunately, greedy algorithms are generally suboptimal -- for a given degree, the routing distance is longer than necessary [344]. Unlike these earlier P2P designs, de Bruijn graphs of degree k achieve an asymptotically optimal diameter log n, where n is the number of nodes in the system and k can be varied to improve resilience. If there are O(log n) neighbours per node, the de Bruijn hop count is O(log n/log log n). To illustrate de Bruijn's practical advantage, consider a network with one million nodes of degree 20: Chord has a diameter of 20, while de Bruijn has a diameter of 5 [36]. In 2003, there were a quick succession of de Bruijn proposals -- D2B [345], Koorde [37], Distance Halving [132, 336], and the Optimal Diameter Routing Infrastructure (ODRI) [36]. Fraigniaud and Gauron began the D2B design by laying out an informal problem statement: keys should be evenly distributed; lookup latency should be small; traffic load should be evenly distributed; updates of routing tables and redistribution of keys should be fast when nodes join or leave the network. They defined a node's "congestion" to be the probability that a lookup will traverse it. Apart from its optimal de Bruijn diameter, they highlighted D2B's merits: a constant expected update time when nodes join and leave (O(log n) with high probability (w.h.p.)); the expected node congestion is O((log n)/n) (O(((log n)^2)/n) w.h.p.) [345]. D2B's resilience was discussed only in passing. Koorde extends Chord to attain the optimal de Bruijn degree/diameter trade-off above [37]. Unlike D2B, Koorde does not constrain the selection of node identifiers. Also unlike D2B, it caters to concurrent joins, by extension of Chord's functionality. Kaashoek and Karger investigated Koorde's resilience to a rather harsh failure scenario: "in order for a network to stay connected when all nodes fail with probability of 1/2, some nodes must have degree
omega(log n)" [37]. They sketched a mechanism to increase Koorde's degree for this more stringent fault tolerance, losing de Bruijn's constant degree advantage. Similarly, to achieve a constant-factor load balance, Koorde would have to sacrifice its degree optimality. They suggested that the ability to trade the degree, and hence the maintenance overhead, against the expected hop count may be important for churning systems. They also identified an open problem: find a load-balanced, degree optimal DHT. Datta, Girdzijauskas, et al. showed that for arbitrary key distributions, de Bruijn graphs fail to meet the dual goals of load balancing and search efficiency [346]. They posed the question, "(Is there) a constant routing table sized DHT which meets the conflicting goals of storage load balancing and search efficiency for an arbitrary and changing key distribution?" Distance Halving was also inspired by de Bruijn [336] and shares its optimal diameter. Naor and Wieder argued for a two-step "continuous-discrete" approach for its design. The correctness of its algorithms is proven in a continuous setting. The algorithms are then mapped to a discrete space. The source x and target y are points on the continuous interval [0,1). Data items are hashed to this same interval. <str> is a string that determines how messages leave any point on the ring: if bit t of the string is 0, the left leg is taken; if it is 1, the right leg is taken. <str> increases by one bit each hop, giving a sequence by which to step around the ring. A lookup has two phases. In the first, the lookup message containing the source, target, and the random string hops toward the midpoint of the source and target. On each hop, the distance between <str>(x) and <str>(y) is halved, by virtue of the specific 'left' and 'right' functions. In the second phase, the message steps 'backward' from the midpoint to the target, removing the last bit in <str> at each hop. 'Join' and 'leave' algorithms were outlined but there was no consideration of recovery times or message load on churn. Using the Distance Halving properties, the authors devised a caching scheme to relieve congestion in a large P2P network. They have also modified the algorithm to be more robust in the presence of random faults [132]. Solid comparisons of DHT resilience are scarce, but Loguinov, Kumar, et al. give just that in their ODRI paper [36]. They compare Chord, CAN, and de Bruijn in terms of routing performance, graph expansion and clustering. At the outset, they give the optimal diameter (the maximum hop count between any two nodes in the graph) and average hop count for graphs of fixed degree. De Bruijn graphs converge to both optima, and outperform Chord and CAN on both counts. These optima impact both delay and aggregate lookup load. They present two clustering measures (edge expansion and node expansion), which are interesting for resilience. Unfortunately, after decades of de Bruijn research, they have no exact solution. De Bruijn was shown to
be superior in terms of path overlap - "de Bruijn automatically selects backup paths that do not overlap with the best shortest path or with each other" [36].3.5.6. Skip Graphs
Skip Graphs have been pursued by two research camps [38, 41]. They augment the earlier Skip Lists [347, 348]. Unlike earlier balanced trees, the Skip List is probabilistic -- its insert and delete operations do not require tree rearrangements and so are faster by a constant factor. The Skip List consists of layers of ordered linked lists. All nodes participate in the bottom layer 0 list. Some of these nodes participate in the layer 1 list with some fixed probability. A subset of layer 1 nodes participate in the layer 2 list, and so on. A lookup can proceed quickly through the list by traversing the sparse upper layers until it is close to, or at, the target. Unfortunately, nodes in the upper layers of a Skip List are potential hot spots and single points of failure. Unlike Skip Lists, Skip Graphs provide multiple lists at each level for redundancy, and every node participates in one of the lists at each level. Each node in a Skip Graph has theta(log n) neighbours on average, like some of the preceding DHTs. The Skip Graph's primary edge over the DHTs is its support for prefix and proximity search. DHTs hash objects to a random point in the graph. Consequently, they give no guarantees over where the data is stored. Nor do they guarantee that the path to the data will stay within the one administration as far as possible [38]. Skip graphs, on the other hand, provide for location-sensitive name searches. For example, to find the document docname on the node user.company.com, the Skip Graph might step through its ordered lists for the prefix com.company.user [38]. Alternatively, to find an object with a numeric identifier, an algorithm might search the lowest layer of the Skip Graph for the first digit, the next layer for the next digit, in the same vein until all digits are resolved. Being ordered, Skip Graphs also facilitate range searches. In each of these examples, the Skip Graph can be arranged such that the path to the target, as far as possible, stays within an administrative boundary. If one administration is detached from the rest of the Skip Graph, routing can continue within each of the partitions. Mechanisms have been devised to merge disconnected segments [157], though at this stage, segments are re- merged one at a time. A parallel merge algorithm has been flagged for future work. The advantages of Skip Graphs come at a cost. To be able to provide range queries and data placement flexibility, Skip Graph nodes require many more pointers than their DHT counterparts. An increased number of pointers implies increased maintenance traffic. Another
shortcoming of at least one of the early proposals was that no algorithm was given to assign keys to machines. Consequently, there are no guarantees on system-wide load balancing or on the distance between adjacent keys [100]. Aspnes, Kirsch, et al. have recently devised a scheme to reduce the inter-machine pointer count from O(mlogm), where m is the number of data elements, to O(nlog n), where n is the number of nodes [100]. They proposed a two-layer scheme -- one layer for the Skip Graph itself and the second 'bucket layer'. Each machine is responsible for a number of buckets and each bucket elects a representative key. Nodes locally adjust their load. They accept additional keys if they are below their threshold or disperse keys to nearby nodes if they are above threshold. There appear to be numerous open issues: simulations have been done but analysis is outstanding; mechanisms are required to handle the arrival and departure of nodes; there were only brief hints as to how to handle nodes with different capacities.