RFC 4981

Survey of Research towards Robust Peer-to-Peer Networks: Search Methods

Pages: 91
Informational

Part 2 of 5 – Pages 15 to 33

RFC4981 - Page 15 prevText

3.  Semantic Free Index

   Many of today's distributed network indexes are semantic.  The
   semantic index is human-readable.  For example, it might associate
   information with other keywords, a document, a database key, or even
   an administrative domain.  It makes it easy to associate objects with
   particular network providers, companies, or organizations, as
   evidenced in the Domain Name System (DNS).  However, it can also
   trigger legal tussles and frustrate content replication and migration
   [216].

   Distributed Hash Tables (DHTs) have been proposed to provide
   semantic-free, data-centric references.  DHTs enable one to find an
   object's persistent key in a very large, changing set of hosts.  They
   are typically designed for [23]:

   a) low degree.  If each node keeps routing information for only a
      small number of other nodes, the impact of high node arrival and
      departure rates is contained;

   b) low hop count.  The hops and delay introduced by the extra
      indirection are minimized;

   c) greedy routing.  Nodes independently calculate a short path to the
      target.  At each hop, the query moves closer to the target; and

   d) robustness.  A path to the target can be found even when links or
      nodes fail.

3.1.  Origins

   To understand the origins of recent DHTs, one needs to look to three
   contributions from the 1990s.  The first two -- Plaxton, Rajaraman,
   and Richa (PRR) [30] and Consistent Hashing [49] -- were published
   within one month of each other.  The third, the Scalable Distributed
   Data Structure (SDDS) [52], was curiously ignored in significant
   structured P2P designs despite having some similar goals [2, 6, 7].
   It has been briefly referenced in other P2P papers [46, 284-287].

3.1.1.  Plaxton, Rajaraman, and Richa (PRR)

   PRR is the most recent of the three.  It influenced the designs of
   Pastry [2], Tapestry [6], and Chord [7].  The value of PRR is that it
   can locate objects using fixed-length routing tables [6].  Objects
   and nodes are assigned a semantic-free address, for example a 160-bit
   key.  Every node is effectively the root of a spanning tree.  A
   message routes toward an object by matching longer address suffixes,
   until it encounters either the object's root node or another node

RFC4981 - Page 16

   with a 'nearby' copy.  It can route around link and node failure by
   matching nodes with a related suffix.  The scheme has several
   disadvantages [6]: global knowledge is needed to construct the
   overlay; an object's root node is a single point of failure; nodes
   cannot be inserted and deleted; and there is no mechanism for queries
   to avoid congestion hot spots.

3.1.2.  Consistent Hashing

   Consistent Hashing [288] strongly influenced the designs of Chord [7]
   and Koorde [37].  Karger, et al. introduced Consistent Hashing in the
   context of the Web-caching problem [49].  Web servers could
   conceivably use standard hashing to place objects across a network of
   caches.  Clients could use the approach to find the objects.  For
   normal hashing, most object references would be moved when caches are
   added or deleted.  On the other hand, Consistent Hashing is "smooth"
   -- when caches are added or deleted, the minimum number of object
   references move so as to maintain load balancing.  Consistent Hashing
   also ensures that the total number of caches responsible for a
   particular object is limited.  Whereas Litwin's Linear Hashing (LH*)
   scheme requires 'buckets' to be added one at a time in sequence [50],
   Consistent Hashing allows them to be added in any order [49].  There
   is an open Consistent Hashing problem pertaining to the fraction of
   items moved when a node is inserted [165].  Extended Consistent
   Hashing was recently proposed to randomize queries over the spread of
   caches to significantly reduce the load variance [289].
   Interestingly, Karger [49] referred to an older DHT algorithm by
   Devine that used "a novel autonomous location discovery algorithm
   that learns the buckets' locations instead of using a centralized
   directory" [51].

3.1.3.  Scalable Distributed Data Structures (LH*)

   In turn, Devine's primary point of reference was Litwin's work on
   SDDSs and the associated LH* algorithm [52].  An SDDS satisfies three
   design requirements: files grow to new servers only when existing
   servers are well loaded; there is no centralized directory; and the
   basic operations like insert, search, and split never require atomic
   updates to multiple clients.  Honicky and Miller suggested the first
   requirement could be considered a limitation since expansion to new
   servers is not under administrative control [286].  Litwin recently
   noted numerous similarities and differences between LH* and Chord
   [290].  He found that both implement key search.  Although LH* refers
   to clients and servers, nodes can operate as peers in both.  Chord
   'splits' nodes when a new node is inserted, while LH* schedules
   'splits' to avoid overload.  Chord requests travel O(log n) hops,
   while LH* client requests need, at most, two hops to find the target.
   Chord stores a small number of 'fingers' at each node.  LH* servers

RFC4981 - Page 17

   store N/2 to N addresses while LH* clients store 1 to N addresses.
   This trade-off between hop count and the size of the index affects
   system robustness, and bears striking similarity to recent one- and
   two-hop P2P schemes in Section 2.  The arrival and departure of LH*
   clients does not disrupt LH* server metadata at all.  Given the size
   of the index, the arrival and departure of LH* servers are likely to
   cause more churn than that of Chord nodes.  Unlike Chord, LH* has a
   single point of failure, the split coordinator.  It can be
   replicated.  Alternatively, it can be removed in later LH* variants,
   though details have not been progressed for lack of practical need
   [290].

3.2.  Dependability

   We make four overall observations about their dependability.
   Dependability metrics fall into two categories: static dependability,
   a measure of performance before recovery mechanisms take over; and
   dynamic dependability, for the most likely case in massive networks
   where there is continual failure and recovery ("churn").

3.2.1.  Static Dependability

   Observation A: Static dependability comparisons show that no O(log n)
   DHT geometry is significantly more dependable than the other O(log n)
   geometries.

   Gummadi, et al. compared the tree, hypercube, butterfly, ring, XOR,
   and hybrid geometries.  In such geometries, nodes generally know
   about O(log n) neighbors and route to a destination in O(log n) hops,
   where N is the number of nodes in the overlay.  Gummadi, et al. asked
   "Why not the ring?"  They concluded that only the ring and XOR
   geometries permit flexible choice of both neighbors and alternative
   routes [24].  Loguinov, et al. added the de Bruijn graph to their
   comparison [36].  They concluded that the classical analyses, for
   example the probability that a particular node becomes disconnected,
   yield no major differences between the resilience of Chord, CAN, and
   de Bruijn graphs.  Using bisection width (the minimum edge count
   between two equal partitions) and path overlap (the likelihood that
   backup paths will encounter the same failed nodes or links as the
   primary path), they argued for the superior resilience of the de
   Bruijn graph.  In short, ring, XOR, and de Bruijn graphs all permit
   flexible choice of alternative paths, but only in de Bruijn are the
   alternate paths independent of each other [36].

RFC4981 - Page 18

3.2.2.  Dynamic Dependability

   Observation B: Dynamic dependability comparisons show that DHT
   dependability is sensitive to the underlying topology maintenance
   algorithms.

   Li, et al. give the best comparison to date of several leading DHTs
   during churn [291].  They relate the disparate configuration
   parameters of Tapestry, Chord, Kademlia, Kelips, and OneHop to
   fundamental design choices.  For each of these DHTs, they plotted the
   optimal performance in terms of lookup latency (milliseconds) and
   fraction of failed lookups.  The results led to several important
   insights about the underlying algorithms, for example: increasing
   routing table size is more cost-effective than increasing the rate of
   periodic stabilization; learning about new nodes during the lookup
   process sometimes eliminates the need for stabilization; and parallel
   lookups reduce latency due to timeouts more effectively than faster
   stabilization.  Similarly, Zhuang, et al. compared keep-alive
   algorithms for DHT failure detection [292].  Such algorithmic
   comparisons can significantly improve the dependability of DHT
   designs.

   In Figure 2, we propose a taxonomy for the topology maintenance
   algorithms that influence dependability.  The algorithms can be
   classified by how nodes join and leave, how they first detect
   failures, how they share information about topology updates, and how
   they react when they receive information about topology updates.

RFC4981 - Page 19

   Normal Updates
      Joins (passive; active) [293]
      Leaves (passive; active) [293]

   Fault Detection [292]
      Maintenance
         Proactive (periodic or keep-alive probes)
         Reactive (correction-on-use, correction-on-failure) [294]
      Report
         Negative (all dead nodes, nodes recently failed)
         Positive (all live nodes; nodes recently recovered) [292]

   Topology Sharing: yes/ no [292]
         Multicast Tree (explicit, implicit) [267, 295]
         Gossip (timeouts; number of contacts) [39]

   Corrective Action
      Routing
         Rerouting actions
            (reroute once; route in parallel [291]; reject)
         Routing timeouts
            (TCP-style, virtual coordinates) [296]
      Topology
         Update action (evict/ replace/ tag node)
         Update timeliness (immediate, periodic[296], delayed [297])

        Figure 2: Topology Maintenance in Distributed Hash Tables

3.2.3.  Ephemeral or Stable Nodes -- O(log n) or O(1) Hops

   Observation C: Most DHTs use O(log n) geometries to suit ephemeral
   nodes.  The O(1) hop DHTs suit stable nodes and deserve more research
   attention.

   Most of the DHTs in Section 3.5 assume that nodes are ephemeral, with
   expected lifetimes of one to two hours.  Therefore, they mostly use
   an O(log n) geometry.  The common assumption is that maintenance of
   full routing tables in the O(1) hop DHTs will consume excessive
   bandwidth when nodes are continually joining and leaving.  The
   corollary is that, when they run on stable infrastructure servers
   [298], most of the DHTs in Section 3.5 are less than optimal --
   lookups take many more hops than necessary, wasting latency and
   bandwidth budgets.  The O(1) hop DHTs suit stable deployments and
   high lookup rates.  For a churning 1024-node network, Li, et al.
   concluded that OneHop is superior to Chord, Tapestry, Kademlia, and
   Kelips in terms of latency and lookup success rate [291].  For a
   3000-node network, they concluded that "OneHop is only preferable to
   Chord when the deployment scenario allows a communication cost

RFC4981 - Page 20

   greater than 20 bytes per node per second" [291].  This apparent
   limitation needs to be put in context.  They assumed that each node
   issues only one lookup every 10 minutes and has a lifetime of only 60
   minutes.  It seems reasonable to expect that in some deployments,
   nodes will have a lifetime of weeks or more, a maintenance bandwidth
   of tens of kilobits per second, and a load of hundreds of lookups per
   second.  O(1) hop DHTs are superior in such situations.  OneHop can
   scale at least to many tens of thousands of nodes [267].  The recent
   O(1) hop designs [267, 295] are vastly outnumbered by the O(log n)
   DHTs in Section 3.5.  Research on the algorithms of Figure 2 will
   also yield improvements in the dependability of the O(1) hop DHTs.

3.2.4.  Simulation and Proof

   Observation D: Although not yet a mature science, the study of DHT
   dependability is helped by recent simulation and formal development
   tools.

   While there are recent reference architectures [294, 298], much of
   the DHT literature in Section 3.5 does not lend itself to repeatable,
   comparative studies.  The best comparative work to date [291] relies
   on the Peer-to-Peer Simulator (P2PSIM) [299].  At the time of
   writing, it supports more DHT geometries than any other simulator.
   As the study of DHTs matures, we can expect to see the simulation
   emphasis shift from geometric comparison to a comparison of the
   algorithms of Figure 2.

   P2P correctness proofs generally rely on less-than-complete formal
   specifications of system invariants and events [7, 45, 300].  Li and
   Plaxton expressed concern that "when many joins and leaves happen
   concurrently, it is not clear whether the neighbor tables will remain
   in a 'good' state" [47].  While acknowledging that guaranteeing
   consistency in a failure-prone network is impossible, Lynch, Malkhi,
   et al. sketched amendments to the Chord algorithm to guarantee
   atomicity [301].  More recently, Gilbert, Lynch, et al. gave a new
   algorithm for atomic read/write memory in a churning distributed
   network, suggesting it to be a good match for P2P [302].  Lynch and
   Stoica show in an enhancement to Chord that lookups are provably
   correct when there is a limited rate of joins and failures [303].
   Fault Tolerant Active Rings is a protocol for active joins and leaves
   that was formally specified and proven using B-method tools [304].  A
   good starting point for a formal DHT development would be the
   numerous informal API specifications [22, 305, 306].  Such work could
   be informed by other efforts to formally specify routing invariants
   [307, 308].

RFC4981 - Page 21

3.3.  Latency

   The key metrics for DHT latency are:

   1) Shortest-Path Distance and Diameter.  In graph theory, the
      shortest-path distance is the minimum number of edges in any path
      between two vertices of the graph.  Diameter is the largest of all
      shortest-path distances in a graph [309].  Networking synonyms for
      distance on a DHT are "hop count" and "lookup length".

   2) Latency and Latency Stretch.  Two types of latency are relevant
      here -- network-layer latency and overlay latency.  Network-layer
      latency has been referred to as "proximity" or "locality" [24].
      Stretch is the cost of an overlay path between two nodes, divided
      by the cost of the direct network path between those nodes [310].
      Latency stretch is also known as the "relative delay penalty"
      [311].

3.3.1.  Hop Count and the O(1)-Hop DHTs

   Hop count gives an approximate indication of path latency.  O(1)-hop
   DHTs have path latencies lower than the O(log n)-hop DHTs [291].
   This significant advantage is often overlooked on account of concern
   about the messaging costs to maintain large routing tables (Section
   3.2.3).  Such concern is justified when the mean node lifetime is
   only a few hours and the mean lookup interval per node is more than a
   few seconds (the classic profile of a P2P file-sharing node).
   However, for a large, practical operating range (node lifetimes of
   days or more, lookup rates of over tens of lookups per second per
   node, up to ~100,000 nodes), the total messaging cost in O(1) hop
   DHTs is lower than in O(log n) DHTs [312].  Lookups and routing table
   maintenance contribute to the total messaging cost.  If a deployment
   fits this operating range, then O(1)-hop DHTs will give lower path
   latencies and lower total messaging costs.  An additional merit of
   the O(1)-hop DHTs is that they yield lower lookup failure rates than
   their O(log N)-hop counterparts [291].

   Low hop count can be achieved in two ways: each node has a large O(N)
   index of nodes; or the object references can be replicated on many
   nodes.  Beehive [313], Kelips [39], LAND [310], and Tulip [314] are
   examples of the latter category.  Beehive achieves O(1) hops on
   average and O(log n) hops in the worst case, by proactive replication
   of popular objects.  Kelips replicates the 'file index'.  It incurs
   O(sqrt(N)) storage costs for both the node index and the file index.
   LAND uses O(log n) reference pointers for each stored object and an
   O(log n) index to achieve a worst-case 1+e stretch, where 0<e.  The
   Kelips-like Tulip [314] requires 2 hops per lookup.  Each node

RFC4981 - Page 22

   maintains 2sqrt(N)log(N) links to other nodes and objects are
   replicated on O(sqrt(N)) nodes.

   The DHTs with a large O(N) node index can be divided into two groups:
   those for which the index is always O(N); and those for which the
   index opportunistically ranges from O(log n) to O(N).  Linear Hashing
   (LH*) servers [52], OneHop [267], and 1h-Calot [295] fall into the
   former category.  EpiChord [315] and Accordion [316] are examples of
   the latter.

3.3.2.  Proximity and the O(log n)-Hop DHTs

   If one chooses not to use single-hop DHTs, hop count is a weak
   indicator of end-to-end path latency.  Some hops may incur large
   delays because of intercontinental or satellite links.  Consequently,
   numerous DHT designs minimize path latency by considering the
   proximity of nodes.  Gummadi, et al. classified the proximity methods
   as follows [24]:

   1) Proximity Neighbor Selection (PNS).  The nodes in the routing
      table are chosen based on the latency of the direct hop to those
      nodes.  The latency may be explicitly measured [317], or it may be
      estimated using one of several synthetic coordinate systems [150,
      154, 318].  As a lower bound on PNS performance, Dabek, et al.
      showed that lookups on O(log n) DHTs take at least 1.5 times the
      average roundtrip time of the underlying network [154].

   2) Proximity Route Selection (PRS).  At lookup time, the choice of
      the next-hop node relies on the latency of the direct hop to that
      node.  PRS is less effective than PNS, though it may complement it
      [24].  Some of the routing geometries in Section 3.5 do not
      support PNS and/or PRS [24].

   3) Proximity Identifier Selection (PIS).  Node identifiers indicate
      geographic position.  PIS frustrates load balancing, increases the
      risk of correlated failures, and is not often used [24].

   The proximity study by Gummadi, et al. assumed recursive routing,
   though they suggested that PNS would also be superior to PRS with
   iterative routing [24].  Dabek, et al. found that recursive lookups
   take 0.6 times as long as iterative lookups [150].

   Beyond the explicit use of proximity information, redundancy can help
   to avoid slow paths and servers.  One may increase the number of
   replicas [150], use parallel lookups [291, 316], use alternate routes
   on failure [150], or use multiple gateway nodes to enter the DHT
   [317].

RFC4981 - Page 23

3.4.  Multicasting

3.4.1.  Multicasting vs. Broadcasting

   "Multicasting" here means sending a message to a subset of an
   overlay's nodes.  Nodes explicitly join and leave this subset, called
   a "multicast group".  "Broadcasting" here is a special case of
   multicasting in which a message is sent to all nodes in the overlay.
   Broadcasting relies on overlay membership messages -- it does not
   need extra group membership messaging.  Castro, et al. said
   multicasting on structured overlays is either "flooding" (one overlay
   per group) or "tree-based" (one tree per group) [319].  These are
   synonyms for broadcasting and multicasting respectively.

   The first DHT-based designs for multicasting were CAN multicast
   [320], Scribe [241], Bayeux [242], and i3 [231].  They were based on
   CAN [8], Pastry [2], Tapestry [31], and Chord [7] respectively.  El-
   Ansary, et al. devised the first DHT-based broadcasting scheme [321].
   It was based on Chord.

   Multicast trees can be constructed using reverse-path forwarding or
   forward-path forwarding.  Scribe uses reverse-path forwarding [241].
   Bayeux uses forward-path forwarding [242].  Borg, a multicast design
   based on Pastry, uses a combination of forward-path and reverse-path
   forwarding to minimize latency [237].

3.4.2.  Motivation for DHT-based Multicasting

   Multicasting complements DHT search capability.  DHTs naturally
   support exact match queries.  With multicasting, they can support
   more complex queries.  Multicasting also enables the dissemination
   and collection of global information.

   Consider, for example, aggregation queries like minimum, maximum,
   count, sum, and average (Section 5.4).  A node at the root of a
   dissemination tree might multicast such a query [322].  The leaf
   nodes return local results towards the root node.  Successive parents
   aggregate the result so that eventually the root node can compute the
   global result.  Such queries may help to monitor the capacity and
   health of the overlay itself.

   Why bother with structured overlays for multicasting?  In Section
   2.1, we saw that Gnutella can multicast complex queries without them
   [4].  Castro, et al. posed the question, "Should we build Gnutella on
   a structured overlay?" [259].  While acknowledging that their study
   was preliminary, they did conclude that "we see no reason to build
   Gnutella on top of an unstructured overlay" [259].  The supposedly
   high maintenance costs of structured overlays were outweighed by

RFC4981 - Page 24

   query cost savings.  The structured overlay ensured that nodes were
   only visited once during a complex query.  It also helped to
   accurately limit the total number of nodes visited.  Pai, et al.
   acknowledged that multicast trees based on structured overlays
   contribute to simple routing rules, low delay and low delay variation
   [323].  However, they opted for unstructured, gossip-based
   multicasting for reliability reasons: data loss near the tree root
   affects all subtended nodes; interior node failures must be repaired
   quickly; interior nodes are obliged to disseminate more than their
   fair share of traffic, giving leaf nodes a "free ride".  The most
   promising research direction is to improve on the Bimodal
   Multicasting approach [324].  It combines the bandwidth efficiency
   and low latency of structured, best-effort multicasting trees with
   the reliability of unstructured gossip protocols.

3.4.3.  Design Issues

   None of the early structured overlay multicast designs addressed all
   of the following issues [325]:

   1) Heterogeneous Node Capacity.  Nodes differ in their processing,
      memory, and network capacity.  Multicast throughput is largely
      determined by the node with smallest throughput [325].  To limit
      the multicasting load on a node, one might cap its out-degree.  If
      the same node receives further join requests, it refers them to
      its children ("pushdown") [240].  Bharambe, et al. explored
      several pushdown strategies but found them inadequate to deal with
      heterogeneity [326].  They concluded that the heterogeneity issue
      remains open, and should be addressed before deploying DHTs for
      high-bandwidth multicasting applications.  Independently, Zhang et
      al. partially tackled heterogeneity by allowing nodes in their
      CAM-Chord and CAM-Koorde designs to vary out-degree according to
      the node's capacity [325].  However, they made no mention of the
      "pushdown" issue -- they did not describe topology maintenance
      when the out-degree limit is reached.

   2) Reliability (Dynamic Membership).  If a multicast tree is to be
      resilient, it must survive dynamic membership.  There are several
      ways to deal with dynamic membership: ensure that the root node of
      the multicasting tree does not handle all requests to join or
      leave the multicast group [242]; use multiple interior-node-
      disjoint trees to avoid single points of failure in tree
      structures [322]; and split the root node into several replicas
      and partition members across them [241].  For example, Bayeux
      requires the root node to track all group membership changes
      whereas Scribe does not [241].  CAN-multicast uses a single,
      well-known host to bootstrap the join operations [320].  The
      earliest DHT-based broadcasting work by El-Ansary, et al. did not

RFC4981 - Page 25

      address the issue of dynamic membership [321].  Ghodsi, et al.
      addressed it in a subsequent paper, though, giving two broadcast
      algorithms that accommodate routing table inconsistencies [327].
      One algorithm achieves a more optimal multicasting network at the
      expense of greater correction overhead.  Splitstream, based on
      Scribe and Pastry, redundantly striped content across multiple
      interior-node-disjoint multicast trees -- if one interior node
      fails, then only one stripe is lost [240].

   3) Large Any-Source Multicast Groups.  Any group member should be
      allowed to send multicast messages.  The group should scale to a
      very large number of hosts.  CAN-based multicast was the first
      application-level multicast scheme to scale to groups of several
      thousands of nodes without restricting the service model to a
      single source [320].  Bayeux scales to large groups but has a
      single root node for each multicast group.  It supports the any-
      source model only by having the root node operate as a reflector
      for multiple senders [242].

3.5.  Routing Geometries

   In Sections 3.5.1 to 3.5.6, we introduce the main geometries for
   simple key lookup and survey their robustness mechanisms.

3.5.1.  Plaxton Trees (Pastry, Tapestry)

   Work began in March 2000 on a structured, fault-tolerant, wide-area
   Dynamic Object Location and Routing (DOLR) system called Tapestry [6,
   155].  While DHTs fix replica locations, a DOLR API enables
   applications to control object placement [31].  Tapestry's basic
   location and routing scheme follows Plaxton, Rajaraman, and Richa
   (PRR) [30], but it remedies PRR's robustness shortcomings described
   in Section 3.1.  Whereas each object has one root node in PRR,
   Tapestry uses several to avoid a single point of failure.  Unlike
   PRR, it allows nodes to be inserted and deleted.  Whereas PRR
   required a total ordering of nodes, Tapestry uses 'surrogate routing'
   to incrementally choose root nodes.  The PRR algorithm does not
   address congestion, but Tapestry can put object copies close to nodes
   generating high query loads.  PRR nodes only know of the nearest
   replica, whereas Tapestry nodes enable selection from a set of
   replicas (for example, to retrieve the most up to date).  To detect
   routing faults, Tapestry uses TCP timeouts and UDP heartbeats for
   detection, sequential secondary neighbours for rerouting, and a
   'second chance' window so that recovery can occur without the
   overhead of a full node insertion.  Tapestry's dependability has been
   measured on a testbed of about 100 machines and on simulations of

RFC4981 - Page 26

   about 1000 nodes.  Successful routing rates and maintenance
   bandwidths were measured during instantaneous failures and ongoing
   churn [31].

   Pastry, like Tapestry, uses Plaxton-like prefix routing [2].  As in
   Tapestry, Pastry nodes maintain O(log n) neighbours and route to a
   target in O(log n) hops.  Pastry differs from Tapestry only in the
   method by which it handles network locality and replication [2].
   Each Pastry node maintains a 'leaf set' and a 'routing table'.  The
   leaf set contains l/2 node IDs on either side of the local node ID in
   the node ID space.  The routing table, in row r, column c, points to
   the node ID with the same r-digit prefix as the local node, but with
   an r+1 digit of c.  A Pastry node periodically probes leaf set and
   routing table nodes, with periodicity of Tls and Trt and a timeout
   Tout.  Mahajan, Castry, et al. analyzed the reliability versus
   maintenance cost trade-offs in terms of the parameters l, Tls, Trt,
   and Tout [328].  They concluded that earlier concerns about excessive
   maintenance cost in a churning P2P network were unfounded, but
   suggested follow-up work for a wider range of reliability targets,
   maintenance costs, and probe periods.  Rhea Geels, et al. concluded
   that existing DHTs fail at high churn rates [329].  Building on a
   Pastry implementation from Rice University, they found that most
   lookups fail to complete when there is excessive churn.  They
   conjectured that short-lived nodes often leave the network with
   lookups that have not yet timed out, but no evidence was provided to
   confirm the theory.  They identified three design issues that affect
   DHT performance under churn: reactive versus periodic recovery of
   peers; lookup timeouts; and choice of nearby neighbours.  Since
   reactive recovery was found to add traffic to already congested
   links, the authors used periodic recovery in their design.  For
   lookup timeouts, they advocated an exponentially weighted moving
   average of each neighbour's response time, over alternative fixed
   timeout or 'virtual coordinate' schemes.  For selection of nearby
   neighbours, they found that 'global sampling' was more effective than
   simply sampling a 'neighbour's neighbours' or 'inverse neighbours'.
   Castro, Costa, et al. have refuted the suggestion that DHTs cannot
   cope with high churn rates [330].  By implementing methods for
   continuous detection and repair, their MSPastry implementation
   achieved shorter routing paths and a maintenance overhead of less
   than half a message per second per node.

   There have been more recent proposals based on these early Plaxton-
   like schemes.  Kademlia uses a bit-wise exclusive or (XOR) metric for
   the 'distance' between 160-bit node identifiers [45].  Each node
   keeps a list of contact nodes for each section of the node space that
   is between 2^i and 2^(i+1) from itself (0.i<160).  Longer-lived nodes
   are deliberately given preference on this list -- it has been found
   in Gnutella that the longer a node has been active, the more likely

RFC4981 - Page 27

   it is to remain active.  Like Kademlia, Willow uses the XOR metric
   [32].  It implements a Tree Maintenance Protocol to 'zipper' together
   broken segments of a tree.  Where other schemes use DHT routing to
   inefficiently add new peers, Willow can merge disjoint or broken
   trees in O(log n) parallel operations.

3.5.2.  Rings (Chord, DKS)

   Chord is the prototypical DHT ring, so we first sketch its operation.
   Chord maps nodes and keys to an identifier ring [7, 34].  Chord
   supports one main operation: find a node with the given key.  It uses
   Consistent Hashing (Section 3.1) to minimize disruption of keys when
   nodes join and leave the network.  However, Chord peers need only
   track O(log n) other peers, not all peers as in the original
   consistent hashing proposal [49].  It enables concurrent node
   insertions and deletions, improving on PRR.  Compared to Pastry, it
   has a simpler join protocol.  Each Chord peer tracks its predecessor,
   a list of successors, and a finger table.  Using the finger table,
   each hop is at least half the remaining distance around the ring to
   the target node, giving an average lookup hop count of (1/2)log
   n(base 2).  Each Chord node runs a periodic stabilization routine
   that updates predecessor and successor pointers to cater to newly
   added nodes.  All successors of a given node need to fail for the
   ring to fail.  Although a node departure could be treated the same as
   a failure, a departing Chord node first notifies the predecessor and
   successors, so as to improve performance.

   In their definitive paper, Chord's inventors critiqued its
   dependability under churn [34].  They provided proofs on the
   behaviour of the Chord network when nodes in a stable network fail,
   stressing that such proofs are inadequate in the general case of a
   perpetually churning network.  An earlier paper had posed the
   question, "For lookups to be successful during churn, how regularly
   do the Chord stabilization routines need to run?" [331].  Stoica,
   Morris, et al. modeled a range of node join/departure rates and
   stabilization periods for a Chord network of 1000 nodes.  They
   measured the number of timeouts (caused by a finger pointing to a
   departed node) and lookup failures (caused by nodes that temporarily
   point to the wrong successor during churn).  They also modeled the
   'lookup stretch', the ratio of the Chord lookup time to optimal
   lookup time on the underlying network.  They demonstrated the latency
   advantage of recursive lookups over iterative lookups, but there
   remains room for delay reduction.  For further work, the authors
   proposed to improve resilience to network partitions, using a small
   set of known nodes or 'remembered' random nodes.  To reduce the
   number of messages per lookup, they suggested an increase in the size
   of each step around the ring, accomplished via a larger number of
   fingers at each node.  Much of the paper assumed independent, equally

RFC4981 - Page 28

   likely node failures.  Analysis of correlated node failures, caused
   by massive site or backbone failures, will be more important in some
   deployments.  The paper did not attempt to recommend a fixed optimal
   stabilization rate.  Liben-Nowell, Balakrishnan, et al. had suggested
   that optimum stabilization rate might evolve according to
   measurements of peers' behaviour [331] -- such a mechanism has yet to
   be devised.

   Alima, El-Ansary, et al. considered the communication costs of
   Chord's stabilization routines, referred to as 'active correction',
   to be excessive [332].  Two other robustness issues also motivated
   their Distributed K-ary Search (DKS) design, which is similar to
   Chord.  Firstly, the total system should evolve for an optimum
   balance between the number of peers, the lookup hop count, and the
   size of the routing table.  Secondly, lookups should be reliable --
   P2P algorithms should be able to guarantee a successful lookup for
   key/value pairs that have been inserted into the system.  A similar
   lookup-correctness issue was raised elsewhere by one of Chord's
   authors; "Is it possible to augment the data structure to work even
   when nodes (and their associated finger lists) just disappear?" [333]
   Alima, El-Ansary, et al. asserted that P2Ps using active correction,
   like Chord, Pastry, and Tapestry, are unable to give such a
   guarantee.  They propose an alternate 'correction-on-use' scheme,
   whereby expired routing entries are corrected by information
   piggybacking lookups and insertions.  A prerequisite is that lookup
   and insertion rates are significantly higher than node arrival,
   departure, and failure rates.  Correct lookups are guaranteed in the
   presence of simultaneous node arrivals or up to f concurrent node
   departures, where f is configurable.

3.5.3.  Tori (CAN)

   Ratnasamy, Francis, et al. developed the Content-Addressable Network
   (CAN), another early DHT widely referenced alongside Tapestry,
   Pastry, and Chord [8, 334].  It is arranged as a virtual
   d-dimensional Cartesian coordinate space on a d-torus.  Each node is
   responsible for a zone in this coordinate space.  The designers used
   a heuristic thought to be important for large, churning P2P networks:
   keep the number of neighbours independent of system size.
   Consequently, its design differs significantly from Pastry, Tapestry,
   and Chord.  Whereas they have O(log n) neighbours per node and O(log
   n) hops per lookup, CAN has O(d) neighbours and O(dn^(1/d)) hop
   count.  When CAN's system-wide parameter d is set to log(n), CAN
   converges to their profile.  If the number of nodes grows, a major
   rearrangement of the CAN network may be required [151].  The CAN
   designers considered building on PRR, but opted for the simple, low-
   state-per-node CAN algorithm instead.  They had reasoned that a PRR-
   based design would not perform well under churn, given node

RFC4981 - Page 29

   departures and arrivals would affect a logarithmic number of nodes
   [8].

   There have been preliminary assessments of CAN's resilience.  When a
   node leaves the CAN in an orderly fashion, it passes its own Virtual
   ID (VID), its neighbours' VIDs and IP addresses, and its key/value
   pairs to a takeover node.  If a node leaves abruptly, its neighbours
   send recovery messages towards the designated takeover node.  CAN
   ensures the recovery messages reach the takeover node, even if nodes
   die simultaneously, by maintaining a VID chain with Chord's
   stabilization algorithm.  Some initial 'proof of concept' resilience
   simulations were run using the Network Simulator (NS) [335] for up to
   a few hundred nodes.  Average hop counts and lookup failure
   probabilities were plotted against the total number of nodes for
   various node failure rates [8].  The CAN team documented several open
   research questions pertaining to state/hop count trade-offs,
   resilience, load, locality, and heterogeneous peers [44, 334].

3.5.4.  Butterflies (Viceroy)

   Viceroy approximates a butterfly network [46].  It generally has
   constant degree like CAN.  Like Chord, Tapestry, and Pastry, it has
   logarithmic diameter.  It improves on these systems, inasmuch as its
   diameter is better than CAN and its degree is better than Chord,
   Tapestry, and Pastry.  As with most DHTs, it utilizes Consistent
   Hashing.  When a peer joins the Viceroy network, it takes a random
   but permanent 'identity' and selects its 'level' within the network.
   Each peer maintains general ring pointers ('predecessor' and
   'successor'), level ring pointers ('nextonlevel' and 'prevonlevel'),
   and butterfly pointers ('left', 'right', and 'up').  When a peer
   departs, it normally passes its key pairs to a successor, and
   notifies other peers to find a replacement peer.

   The Viceroy paper scoped out the issue of robustness.  It explicitly
   assumed that peers do not fail [46].  It assumed that join and leave
   operations do not overlap, so as to avoid the complication of
   concurrency mechanisms like locking.  Kaashoek and Karger were
   somewhat critical of Viceroy's complexity [37].  They also pointed to
   its fault-tolerance blind spot.  Li and Plaxton suggested that such
   constant-degree algorithms deserve further consideration [47].  They
   offered several pros and cons.  The limited degree may increase the
   risk of a network partition, or inhibit use of local neighbours (for
   the simple reason that there are less of them).  On the other hand,
   it may be easier to reason about the correctness of fixed-degree
   networks.  One of the Viceroy authors has since proposed constant-
   degree peers in a two-tier, locality-aware DHT [310] -- the lower
   degree maintained by each lower-tier peer purportedly improves
   network adaptability.  Another Viceroy author has since explored an

RFC4981 - Page 30

   alternative bounded-degree graph for P2P, namely the de Bruijn graph
   [336].

3.5.5.  de Bruijn (D2B, Koorde, Distance Halving, ODRI)

   De Bruijn graphs have had numerous refinements since their inception
   [337, 338].  Schlumberger was the first to use them for networking
   [339].  Two research teams independently devised the 'generalized' de
   Bruijn graph that accommodates a flexible number of nodes in the
   system [340, 341].  Rowley and Bose studied fault-tolerant rings
   overlaid on the de Bruijn graph [342].  Lee, Liu, et al. devised a
   two-level de Bruijn hierarchy, whereby clusters of local nodes are
   interconnected by a second-tier ring [343].

   Many of the algorithms discussed previously are 'greedy' in that each
   time a query is forwarded, it moves closer to the destination.
   Unfortunately, greedy algorithms are generally suboptimal -- for a
   given degree, the routing distance is longer than necessary [344].
   Unlike these earlier P2P designs, de Bruijn graphs of degree k
   achieve an asymptotically optimal diameter log n, where n is the
   number of nodes in the system and k can be varied to improve
   resilience.  If there are O(log n) neighbours per node, the de Bruijn
   hop count is O(log n/log log n).  To illustrate de Bruijn's practical
   advantage, consider a network with one million nodes of degree 20:
   Chord has a diameter of 20, while de Bruijn has a diameter of 5 [36].
   In 2003, there were a quick succession of de Bruijn proposals -- D2B
   [345], Koorde [37], Distance Halving [132, 336], and the Optimal
   Diameter Routing Infrastructure (ODRI) [36].

   Fraigniaud and Gauron began the D2B design by laying out an informal
   problem statement: keys should be evenly distributed; lookup latency
   should be small; traffic load should be evenly distributed; updates
   of routing tables and redistribution of keys should be fast when
   nodes join or leave the network.  They defined a node's "congestion"
   to be the probability that a lookup will traverse it.  Apart from its
   optimal de Bruijn diameter, they highlighted D2B's merits: a constant
   expected update time when nodes join and leave (O(log n) with high
   probability (w.h.p.)); the expected node congestion is O((log n)/n)
   (O(((log n)^2)/n) w.h.p.) [345].  D2B's resilience was discussed only
   in passing.

   Koorde extends Chord to attain the optimal de Bruijn degree/diameter
   trade-off above [37].  Unlike D2B, Koorde does not constrain the
   selection of node identifiers.  Also unlike D2B, it caters to
   concurrent joins, by extension of Chord's functionality.  Kaashoek
   and Karger investigated Koorde's resilience to a rather harsh failure
   scenario: "in order for a network to stay connected when all nodes
   fail with probability of 1/2, some nodes must have degree

RFC4981 - Page 31

   omega(log n)" [37].  They sketched a mechanism to increase Koorde's
   degree for this more stringent fault tolerance, losing de Bruijn's
   constant degree advantage.  Similarly, to achieve a constant-factor
   load balance, Koorde would have to sacrifice its degree optimality.
   They suggested that the ability to trade the degree, and hence the
   maintenance overhead, against the expected hop count may be important
   for churning systems.  They also identified an open problem: find a
   load-balanced, degree optimal DHT.  Datta, Girdzijauskas, et al.
   showed that for arbitrary key distributions, de Bruijn graphs fail to
   meet the dual goals of load balancing and search efficiency [346].
   They posed the question, "(Is there) a constant routing table sized
   DHT which meets the conflicting goals of storage load balancing and
   search efficiency for an arbitrary and changing key distribution?"

   Distance Halving was also inspired by de Bruijn [336] and shares its
   optimal diameter.  Naor and Wieder argued for a two-step
   "continuous-discrete" approach for its design.  The correctness of
   its algorithms is proven in a continuous setting.  The algorithms are
   then mapped to a discrete space.  The source x and target y are
   points on the continuous interval [0,1).  Data items are hashed to
   this same interval.  <str> is a string that determines how messages
   leave any point on the ring: if bit t of the string is 0, the left
   leg is taken; if it is 1, the right leg is taken.  <str> increases by
   one bit each hop, giving a sequence by which to step around the ring.
   A lookup has two phases.  In the first, the lookup message containing
   the source, target, and the random string hops toward the midpoint of
   the source and target.  On each hop, the distance between <str>(x)
   and <str>(y) is halved, by virtue of the specific 'left' and 'right'
   functions.  In the second phase, the message steps 'backward' from
   the midpoint to the target, removing the last bit in <str> at each
   hop. 'Join' and 'leave' algorithms were outlined but there was no
   consideration of recovery times or message load on churn.  Using the
   Distance Halving properties, the authors devised a caching scheme to
   relieve congestion in a large P2P network.  They have also modified
   the algorithm to be more robust in the presence of random faults
   [132].

   Solid comparisons of DHT resilience are scarce, but Loguinov, Kumar,
   et al. give just that in their ODRI paper [36].  They compare Chord,
   CAN, and de Bruijn in terms of routing performance, graph expansion
   and clustering.  At the outset, they give the optimal diameter (the
   maximum hop count between any two nodes in the graph) and average hop
   count for graphs of fixed degree.  De Bruijn graphs converge to both
   optima, and outperform Chord and CAN on both counts.  These optima
   impact both delay and aggregate lookup load.  They present two
   clustering measures (edge expansion and node expansion), which are
   interesting for resilience.  Unfortunately, after decades of de
   Bruijn research, they have no exact solution.  De Bruijn was shown to

RFC4981 - Page 32

   be superior in terms of path overlap - "de Bruijn automatically
   selects backup paths that do not overlap with the best shortest path
   or with each other" [36].

3.5.6.  Skip Graphs

   Skip Graphs have been pursued by two research camps [38, 41].  They
   augment the earlier Skip Lists [347, 348].  Unlike earlier balanced
   trees, the Skip List is probabilistic -- its insert and delete
   operations do not require tree rearrangements and so are faster by a
   constant factor.  The Skip List consists of layers of ordered linked
   lists.  All nodes participate in the bottom layer 0 list.  Some of
   these nodes participate in the layer 1 list with some fixed
   probability.  A subset of layer 1 nodes participate in the layer 2
   list, and so on.  A lookup can proceed quickly through the list by
   traversing the sparse upper layers until it is close to, or at, the
   target.  Unfortunately, nodes in the upper layers of a Skip List are
   potential hot spots and single points of failure.  Unlike Skip Lists,
   Skip Graphs provide multiple lists at each level for redundancy, and
   every node participates in one of the lists at each level.

   Each node in a Skip Graph has theta(log n) neighbours on average,
   like some of the preceding DHTs.  The Skip Graph's primary edge over
   the DHTs is its support for prefix and proximity search.  DHTs hash
   objects to a random point in the graph.  Consequently, they give no
   guarantees over where the data is stored.  Nor do they guarantee that
   the path to the data will stay within the one administration as far
   as possible [38].  Skip graphs, on the other hand, provide for
   location-sensitive name searches.  For example, to find the document
   docname on the node user.company.com, the Skip Graph might step
   through its ordered lists for the prefix com.company.user [38].
   Alternatively, to find an object with a numeric identifier, an
   algorithm might search the lowest layer of the Skip Graph for the
   first digit, the next layer for the next digit, in the same vein
   until all digits are resolved.  Being ordered, Skip Graphs also
   facilitate range searches.  In each of these examples, the Skip Graph
   can be arranged such that the path to the target, as far as possible,
   stays within an administrative boundary.  If one administration is
   detached from the rest of the Skip Graph, routing can continue within
   each of the partitions.  Mechanisms have been devised to merge
   disconnected segments [157], though at this stage, segments are re-
   merged one at a time.  A parallel merge algorithm has been flagged
   for future work.

   The advantages of Skip Graphs come at a cost.  To be able to provide
   range queries and data placement flexibility, Skip Graph nodes
   require many more pointers than their DHT counterparts.  An increased
   number of pointers implies increased maintenance traffic.  Another

RFC4981 - Page 33

   shortcoming of at least one of the early proposals was that no
   algorithm was given to assign keys to machines.  Consequently, there
   are no guarantees on system-wide load balancing or on the distance
   between adjacent keys [100].  Aspnes, Kirsch, et al. have recently
   devised a scheme to reduce the inter-machine pointer count from
   O(mlogm), where m is the number of data elements, to O(nlog n), where
   n is the number of nodes [100].  They proposed a two-layer scheme --
   one layer for the Skip Graph itself and the second 'bucket layer'.
   Each machine is responsible for a number of buckets and each bucket
   elects a representative key.  Nodes locally adjust their load.  They
   accept additional keys if they are below their threshold or disperse
   keys to nearby nodes if they are above threshold.  There appear to be
   numerous open issues: simulations have been done but analysis is
   outstanding; mechanisms are required to handle the arrival and
   departure of nodes; there were only brief hints as to how to handle
   nodes with different capacities.

(page 33 continued on part 3)