Tech-invite3GPPspaceIETFspace
959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 4981

Survey of Research towards Robust Peer-to-Peer Networks: Search Methods

Pages: 91
Informational
Part 1 of 5 – Pages 1 to 14
None   None   Next

Top   ToC   RFC4981 - Page 1
Network Working Group                                          J. Risson
Request for Comments: 4981                                      T. Moors
Category: Informational                    University of New South Wales
                                                          September 2007


       Survey of Research towards Robust Peer-to-Peer Networks:
                            Search Methods

Status of This Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard of any kind.  Distribution of this
   memo is unlimited.

IESG Note

   This RFC is not a candidate for any level of Internet Standard.  The
   IETF disclaims any knowledge of the fitness of this RFC for any
   purpose and notes that the decision to publish is not based on IETF
   review apart from IESG review for conflict with IETF work.  The RFC
   Editor has chosen to publish this document at its discretion.  See
   RFC 3932 for more information.

Abstract

The pace of research on peer-to-peer (P2P) networking in the last five years warrants a critical survey. P2P has the makings of a disruptive technology -- it can aggregate enormous storage and processing resources while minimizing entry and scaling costs. Failures are common amongst massive numbers of distributed peers, though the impact of individual failures may be less than in conventional architectures. Thus, the key to realizing P2P's potential in applications other than casual file sharing is robustness. P2P search methods are first couched within an overall P2P taxonomy. P2P indexes for simple key lookup are assessed, including those based on Plaxton trees, rings, tori, butterflies, de Bruijn graphs, and skip graphs. Similarly, P2P indexes for keyword lookup, information retrieval and data management are explored. Finally, early efforts to optimize range, multi-attribute, join, and aggregation queries over P2P indexes are reviewed. Insofar as they are available in the primary literature, robustness mechanisms and metrics are highlighted throughout. However, the low-level mechanisms that most affect robustness are not well isolated in the literature. Recommendations are given for future research.
Top   ToC   RFC4981 - Page 2

Table of Contents

1. Introduction ....................................................3 1.1. Related Disciplines ........................................6 1.2. Structured and Unstructured Routing ........................7 1.3. Indexes and Queries ........................................9 2. Index Types ....................................................10 2.1. Local Index (Gnutella) ....................................10 2.2. Central Index (Napster) ...................................12 2.3. Distributed Index (Freenet) ...............................13 3. Semantic Free Index ............................................15 3.1. Origins ...................................................15 3.1.1. Plaxton, Rajaraman, and Richa (PRR) ................15 3.1.2. Consistent Hashing .................................16 3.1.3. Scalable Distributed Data Structures (LH*) .........16 3.2. Dependability .............................................17 3.2.1. Static Dependability ...............................17 3.2.2. Dynamic Dependability ..............................18 3.2.3. Ephemeral or Stable Nodes -- O(log n) or O(1) Hops ..........................................19 3.2.4. Simulation and Proof ...............................20 3.3. Latency ...................................................21 3.3.1. Hop Count and the O(1)-Hop DHTs ....................21 3.3.2. Proximity and the O(log n)-Hop DHTs ................22 3.4. Multicasting ..............................................23 3.4.1. Multicasting vs. Broadcasting ......................23 3.4.2. Motivation for DHT-based Multicasting ..............23 3.4.3. Design Issues ......................................24 3.5. Routing Geometries ........................................25 3.5.1. Plaxton Trees (Pastry, Tapestry) ...................25 3.5.2. Rings (Chord, DKS) .................................27 3.5.3. Tori (CAN) .........................................28 3.5.4. Butterflies (Viceroy) ..............................29 3.5.5. de Bruijn (D2B, Koorde, Distance Halving, ODRI) ....30 3.5.6. Skip Graphs ........................................32 4. Semantic Index .................................................33 4.1. Keyword Lookup ............................................34 4.1.1. Gnutella Enhancements ..............................36 4.1.2. Partition-by-Document, Partition-by-Keyword ........38 4.1.3. Partial Search, Exhaustive Search ..................39 4.2. Information Retrieval .....................................39 4.2.1. Vector Model (PlanetP, FASD, eSearch) ..............41 4.2.2. Latent Semantic Indexing (pSearch) .................43 4.2.3. Small Worlds .......................................43 5. Queries ........................................................44 5.1. Range Queries .............................................45 5.2. Multi-Attribute Queries ...................................48 5.3. Join Queries ..............................................50
Top   ToC   RFC4981 - Page 3
      5.4. Aggregation Queries .......................................50
   6. Security Considerations ........................................52
   7. Conclusions ....................................................52
   8. Acknowledgments ................................................53
   9. References .....................................................54
      9.1. Informative References ....................................54

1. Introduction

Peer-to-peer (P2P) networks are those that exhibit three characteristics: self-organization, symmetric communication, and distributed control [1]. A self-organizing P2P network "automatically adapts to the arrival, departure and failure of nodes" [2]. Communication is symmetric in that peers act as both clients and servers. It has no centralized directory or control point. USENET servers and BGP peers have these traits [3] but the emphasis here is on the flurry of research since 2000. Leading examples include Gnutella [4], Freenet [5], Pastry [2], Tapestry [6], Chord [7], the Content Addressable Network (CAN) [8], pSearch [9], and Edutella [10]. Some have suggested that peers are inherently unreliable [11]. Others have assumed well-connected, stable peers [12]. This critical survey of P2P academic literature is warranted, given the intensity of recent research. At the time of writing, one research database lists over 5,800 P2P publications [13]. One vendor surveyed P2P products and deployments [14]. There is also a tutorial survey of leading P2P systems [15]. DePaoli and Mariani recently reviewed the dependability of some early P2P systems at a high level [16]. The need for a critical survey was flagged in the peer-to-peer research group of the Internet Research Task Force (IRTF) [17]. P2P is potentially a disruptive technology with numerous applications, but this potential will not be realized unless it is demonstrated to be robust. A massively distributed search technique may yield numerous practical benefits for applications [18]. A P2P system has potential to be more dependable than architectures relying on a small number of centralized servers. It has potential to evolve better from small configurations -- the capital outlays for high performance servers can be reduced and spread over time if a P2P assembly of general purpose nodes is used. A similar argument motivated the deployment of distributed databases -- one thousand, off-the-shelf PC processors are more powerful and much less expensive than a large mainframe computer [19]. Storage and processing can be aggregated to achieve massive scale. Wasteful partitioning between servers or clusters can be avoided. As Gedik and Liu put it, if P2P is to find its way into applications other than casual file sharing, then reliability needs to be addressed [20].
Top   ToC   RFC4981 - Page 4
   The taxonomy of Figure 1 divides the entire body of P2P research
   literature along four lines: search, storage, security, and
   applications.  This survey concentrates on search aspects.  A P2P
   search network consists of an underlying index (Sections 2 to 4) and
   queries that propagate over that index (Section 5).

   Search [18, 21-29]
      Semantic-Free Indexes [2, 6, 7, 30-52]
         Plaxton Trees
         Rings
         Tori
         Butterflies
         de Bruijn Graphs
         Skip Graphs
      Semantic Indexes [4, 53-71]
         Keyword Lookup
         Peer Information Retrieval
         Peer Data Management
      Queries [20, 22, 23, 25, 32, 38, 41, 56, 72-100]
         Range Queries
         Multi-Attribute Queries
         Join Queries
         Aggregation Queries
         Continuous Queries
         Recursive Queries
         Adaptive Queries

   Storage
      Consistency & Replication [101-112]
         Eventual consistency
         Trade-offs
      Distribution [39, 42, 90, 92, 113-131]
         Epidemics, Bloom Filters
      Fault Tolerance [40, 105, 132-139]
         Erasure Coding
         Byzantine Agreement
      Locality [24, 43, 47, 140-160]
      Load Balancing [37, 86, 100, 107, 151, 161-171]
Top   ToC   RFC4981 - Page 5
   Security
      Character [172-182]
         Identity
         Reputation and Trust
         Incentives
      Goals [25, 27, 71, 183-197]
         Availability
         Authenticity
         Anonymity
         Access Control
         Fair Trading

   Applications [1, 198-200]
      Memory [32, 90, 142, 201-222]
         File Systems
         Web
         Content Delivery Networks
         Directories
      Service Discovery
      Publish / Subscribe ...
   Intelligence [223-228]
      GRID
      Security...
   Communication [12, 92, 119, 229-247]
      Multicasting
      Streaming Media
      Mobility
      Sensors...

            Figure 1: Classification of P2P Research Literature

   This survey is concerned with two questions.  The first, "How do P2P
   search networks work?"  This foundation is important given the pace
   and breadth of P2P research in the last five years.  In Section 2, we
   classify indexes as local, centralized and distributed.  Since
   distributed indexes are becoming dominant, they are given closer
   attention in Sections 3 and 4.  Section 3 compares distributed P2P
   indexes for simple key lookup; in particular, their origins (Section
   3.1), dependability (Section 3.2), latency (Section 3.3), and their
   support for multicast (Section 3.4).  It classifies those indexes
   according to their routing geometry (Section 3.5) -- Plaxton trees,
   rings, tori, butterflies, de Bruijn graphs and skip graphs.  Section
   4 reviews distributed P2P indexes supporting keyword lookup (Section
   4.1) and information retrieval (Section 4.2).  Section 5 probes the
   embryonic research on P2P queries; in particular, range queries
   (Section 5.1), multi-attribute queries (Section 5.2), join queries
   (Section 5.3), and aggregation queries (Section 5.4).
Top   ToC   RFC4981 - Page 6
   The second question, "How robust are P2P search networks?"  Insofar
   as it is available in the research literature, we tease out the
   robustness mechanisms and metrics throughout Sections 2 to 5.
   Unfortunately, robustness is often more sensitive to low-level design
   choices than it is to the broad P2P index structure, yet these
   underlying design choices are seldom isolated in the primary
   literature [248].  Furthermore, there has been little consensus on
   P2P robustness metrics (Section 3.2).  Section 8 gives
   recommendations to address these important gaps.

1.1. Related Disciplines

Peer-to-peer research draws upon numerous distributed systems disciplines. Networking researchers will recognize familiar issues of naming, routing, and congestion control. P2P designs need to address routing and security issues across network region boundaries [152]. Networking research has traditionally been host-centric. The Web's Universal Resource Identifiers are naturally tied to specific hosts, making object mobility a challenge [216]. P2P work is data-centric [249]. P2P systems for dynamic object location and routing have borrowed heavily from the distributed systems corpus. Some have used replication, erasure codes, and Byzantine agreement [111]. Others have used epidemics for durable peer group communication [39]. Similarly, P2P research is set to benefit from database research [250]. Database researchers will recognize the need to reapply Codd's principle of physical data independence, that is, to decouple data indexes from the applications that use the data [23]. It was the invention of appropriate indexing mechanisms and query optimizations that enabled data independence. Database indexes like B+ trees have an analog in P2P's distributed hash tables (DHTs). Wide-area, P2P query optimization is a ripe, but challenging, area for innovation. More flexible distribution of objects comes with increased security risks. There are opportunities for security researchers to deliver new methods for availability, file authenticity, anonymity, and access control [25]. Proactive and reactive mechanisms are needed to deal with large numbers of autonomous, distributed peers. To build robust systems from cooperating but self-interested peers, issues of identity, reputation, trust, and incentives need to be tackled. Although it is beyond the scope of this paper, robustness against malicious attacks also ought to be addressed [195]. Possibly the largest portion of P2P research has majored on basic routing structures [18], where research on algorithms comes to the
Top   ToC   RFC4981 - Page 7
   fore.  Should the overlay be "structured" or "unstructured"?  Are the
   two approaches competing or complementary?  Comparisons of the
   "structured" approaches (hypercubes, rings, toroids, butterflies, de
   Bruijn, and skip graphs) have weighed the amount of routing state per
   peer and the number of links per peer against overlay hop counts.
   While "unstructured" overlays initially used blind flooding and
   random walks, overheads usually trigger some structure, for example,
   super-peers and clusters.

   P2P applications rely on cooperation between these disciplines.
   Applications have included file sharing, directories, content
   delivery networks, email, distributed computation, publish-subscribe
   middleware, multicasting, and distributed authentication.  Which
   applications will be suited to which structures?  Are there adaptable
   mechanisms that can decouple applications from the underlying data
   structures?  What are the criteria for selection of applications
   amenable to a P2P design [1]?

   Robustness is emphasized throughout the survey.  We are particularly
   interested in two aspects.  The first, dependability, was a leading
   design goal for the original Internet [251].  It deserves the same
   status in P2P.  The measures of dependability are well established:
   reliability, a measure of the mean-time-to-failure (MTTF);
   availability, a measure of both the MTTF and the mean-time-to-repair
   (MTTR); maintainability; and safety [252].  The second aspect is the
   ability to accommodate variation in outcome, which one could call
   adaptability.  Its measures have yet to be defined.  In the context
   of the Internet, it was only recently acknowledged as a first-class
   requirement [253].  In P2P, it means planning for the tussles over
   resources and identity.  It means handling different kinds of queries
   and accommodating changeable application requirements with minimal
   intervention.  It means "organic scaling" [22], whereby the system
   grows gracefully, without a priori data center costs or architectural
   breakpoints.

   In the following section, we discuss one notable omission from the
   taxonomy of P2P networking in Figure 1 -- routing.

1.2. Structured and Unstructured Routing

P2P routing algorithms have been classified as "structured" or "unstructured". Peers in unstructured overlay networks join by connecting to any existing peers [254]. In structured overlays, the identifier of the joining peer determines the set of peers that it connects to [254]. Early instantiations of Gnutella were unstructured -- keyword queries were flooded widely [255]. Napster [256] had decentralized content and a centralized index, so it only partially satisfies the distributed control criteria for P2P systems.
Top   ToC   RFC4981 - Page 8
   Early structured algorithms included Plaxton, Rajaraman and Richa
   (PRR) [30], Pastry [2], Tapestry [31], Chord [7], and the Content
   Addressable Network [8].  Mishchke and Stiller recently classified
   P2P systems by the presence or absence of structure in routing tables
   and network topology [257].

   Some have cast unstructured and structured algorithms as competing
   alternatives.  Unstructured approaches have been called "first
   generation", implicitly inferior to the "second generation"
   structured algorithms [2, 31].  When generic key lookups are
   required, these structured, key-based routing schemes can guarantee
   location of a target within a bounded number of hops [23].  The
   broadcasting unstructured approaches, however, may have large routing
   costs, or fail to find available content [22].  Despite the apparent
   advantages of structured P2P, several research groups are still
   pursuing unstructured P2P.

   There have been two main criticisms of structured systems [61].  The
   first relates to peer transience, which in turn, affects robustness.
   Chawathe, et al. opined that highly transient peers are not well
   supported by DHTs [61].  P2P systems often exhibit "churn", with
   peers continually arriving and departing.  One objection to concerns
   about highly transient peers is that many applications use peers in
   well-connected parts of the network.  The Tapestry authors analyzed
   the impact of churn in a network of 1000 nodes [31].  Others opined
   that it is possible to maintain a robust DHT at relatively low cost
   [258].  Very few papers have quantitatively compared the resilience
   of structured systems.  Loguinov, Kumar, et al. claimed that there
   were only two such works [24, 36].

   The second criticism of structured systems is that they do not
   support keyword searches and complex queries as well as unstructured
   systems.  Given the current file-sharing deployments, keyword
   searches seem more important than exact-match key searches in the
   short term.  Paraphrased, "most queries are for hay, not needles"
   [61].

   More recently, some have justifiably seen unstructured and structured
   proposals as complementary, and have devised hybrid models [259].
   Their starting point was the observation that unstructured flooding
   or random walks are inefficient for data that is not highly
   replicated across the P2P network.  Structured graphs can find keys
   efficiently, irrespective of replication.  Castro, et al. proposed
   Structella, a hybrid of Gnutella built on top of Pastry [259].
   Another design used structured search for rare items and unstructured
   search for massively replicated items [54].
Top   ToC   RFC4981 - Page 9
   However, the "structured versus unstructured routing" taxonomy is
   becoming less useful, for two reasons, Firstly, most "unstructured"
   proposals have evolved and incorporated structure.  Consider the
   classic "unstructured" system, Gnutella [4].  For scalability, its
   peers are either ultrapeers or leaf nodes.  This hierarchy is
   augmented with a query routing protocol whereby ultrapeers receive a
   hashed summary of the resource names available at leaf nodes.
   Between ultrapeers, simple query broadcast is still used, though
   methods to reduce the query load here have been considered [260].
   Secondly, there are emerging schema-based P2P designs [59], with
   super-node hierarchies and structure within documents.  These are
   quite distinct from the structured DHT proposals.

1.3. Indexes and Queries

Given that most, if not all, P2P designs today assume some structure, a more instructive taxonomy would describe the structure. In this survey, we use a database taxonomy in lieu of the networking taxonomy, as suggested by Hellerstein, Cooper, and Garcia-Molina [23, 261]. The structure is determined by the type of index (Sections 2 , 3, and 4). Queries feature in lieu of routing (Section 5). The DHT algorithms implement a "semantic-free index" [216]. They are oblivious of whether keys represent document titles, meta-data, or text. Gnutella-like and schema-based proposals have a "semantic index". Index engineering is at the heart of P2P search methods. It captures a broad range of P2P issues, as demonstrated by the Search/Index Links model [261]. As Manber put it, "the most important of the tools for information retrieval is the index -- a collection of terms with pointers to places where information about documents can be found" [262]. Sen and Wang noted that a "P2P network" usually consists of connections between hosts for application-layer signaling, rather than for the data transfer itself [263]. Similarly, we concentrate on the "signaled" indexes and queries. Our focus here is the dependability and adaptability of the search network. Static dependability is a measure of how well queries route around failures in a network that is normally fault-free. Dynamic dependability gives an indication of query success when nodes and data are continually joining and leaving the P2P system. An adaptable index accommodates change in the data and query distribution. It enables data independence, in that it facilitates changes to the data layout without requiring changes to the applications that use the data [23]. An adaptable P2P system can support rich queries for a wide range of applications. Some applications benefit from simple, semantic-free key lookups [264]. Others require more complex, Structured Query Language (SQL)-like
Top   ToC   RFC4981 - Page 10
   queries to find documents with multiple keywords, or to aggregate or
   join query results from distributed relations [22].

2. Index Types

A P2P index can be local, centralized, or distributed. With a local index, a peer only keeps the references to its own data, and does not receive references for data at other nodes. The very early Gnutella design epitomized the local index (Section 2.1). In a centralized index, a single server keeps references to data on many peers. The classic example is Napster (Section 2.2). With distributed indexes, pointers towards the target reside at several nodes. One very early example is Freenet (Section 2.3). Distributed indexes are used in most P2P designs nowadays -- they dominate this survey. P2P indexes can also be classified as non-forwarding and forwarding. When queries are guided by a non-forwarding index, they jump to the node containing the target data in a single hop. There have been semantic and semantic-free one-hop schemes [138, 265, 266]. Where scalability to a massive number of peers is required, these schemes have been extended to two hops [267, 268]. More common are the forwarding P2Ps, where the number of hops varies with the total number of peers, often logarithmically. The related trade-offs between routing state, lookup latency, update bandwidth, and peer churn are critical to total system dependability.

2.1. Local Index (Gnutella)

P2Ps with a purely local data index are becoming rare. In such designs, peers flood queries widely and only index their own content. They enable rich queries - the search is not limited to a simple key lookup. However, they also generate a large volume of query traffic with no guarantee that a match will be found, even if it does exist on the network. For example, to find potential peers on the early instantiations of Gnutella, 'ping' messages were broadcast over the P2P network and the 'pong' responses were used to build the node index. Then, small 'query' messages, each with a list of keywords, are broadcast to peers that respond with matching filenames [4]. There have been numerous attempts to improve the scalability of local-index P2P networks. Gnutella uses fixed time-to-live (TTL) rings, where the query's TTL is set less than 7-10 hops [4]. Small TTLs reduce the network traffic and the load on peers, but also reduce the chances of a successful query hit. One paper reported, perhaps a little too bluntly, that the fixed "TTL-based mechanism does not work" [67]. To address this TTL selection problem, they proposed an expanding ring, known elsewhere as iterative deepening [29]. It uses successively larger TTL counters until there is a
Top   ToC   RFC4981 - Page 11
   match.  The flooding, ring, and expanding ring methods all increase
   network load with duplicated query messages.  A random walk, whereby
   an unduplicated query wanders about the network, does indeed reduce
   the network load but massively increases the search latency.  One
   solution is to replicate the query k times at each peer.  Called
   random k-walkers, this technique can be coupled with TTL limits, or
   periodic checks with the query originator, to cap the query load
   [67].  Adamic, Lukose, et al. suggested that the random walk searches
   be directed to nodes with a higher degree, that is, with larger
   numbers of inter-peer connections [269].  They assumed that higher-
   degree peers are also capable of higher query throughputs.  However,
   without some balancing design rule, such peers would be swamped with
   the entire P2P signaling traffic.  In addition to the above
   approaches, there is the 'directed breadth-first' algorithm [29].  It
   forwards queries within a subset of peers selected according to
   heuristics on previous performance, like the number of successful
   query results.  Another algorithm, called probabilistic flooding, has
   been modeled using percolation theory [270].

   Several measurement studies have investigated locally indexed P2Ps.
   Jovanovic noted Gnutella's power law behaviour [70].  Sen and Wang
   compared the performance of Gnutella, Fasttrack [271], and Direct
   Connect [263, 272, 273].  At the time, only Gnutella used local data
   indexes.  All three schemes now use distributed data indexes, with
   hierarchy in the form of Ultrapeers (Gnutella), Super-Nodes
   FastTrack), and Hubs (Direct Connect).  It was found that a very
   small percentage of peers have a very high degree and that the total
   system dependability is at the mercy of such peers.  While peer up-
   time and bandwidth were heavy-tailed, they did not fit well with the
   Zipf distribution.  Fortunately for Internet Service Providers,
   measures aggregated by IP prefix and Autonomous System (AS) were more
   stable than for individual IP addresses.  A study of University of
   Washington traffic found that Gnutella and Kazaa together contributed
   43% of the university's total TCP traffic [274].  They also reported
   a heavy-tailed distribution, with 600 external peers (out of 281,026)
   delivering 26% of Kazaa bytes to internal peers.  Furthermore,
   objects retrieved from the P2P network were typically three orders of
   magnitude larger than Web objects -- 300 objects contributed to
   almost half the total outbound Kazaa bandwidth.  Others reported
   Gnutella's topology mismatch, whereby only 2-5% of P2P connections
   link peers in the same Autonomous System (AS), despite over 40% of
   peers being in the top 10 ASs [65].  Together these studies
   underscore the significance of multimedia sharing applications.  They
   motivate interesting caching and locality solutions to the topology
   mismatch problem.

   These same studies bear out one main dependability lesson: total
   system dependability may be sensitive to the dependability of high-
Top   ToC   RFC4981 - Page 12
   degree peers.  The designers of Scamp translated this observation to
   the design heuristic, "have the degree of each node be of nearly
   equal size" [153].  They analyzed a system of N peers, with mean
   degree c.log(n), where link failures occur independently with
   probability e.  If d>0 is fixed and c>(1+d)/(-log(e)), then the
   probability of graph disconnection goes to zero as N->infinity.
   Otherwise, if c<(1-d)/(-log(e)), then the probability of
   disconnection goes to one as N->infinity.  They presented a
   localizer, which finds approximate minima to a global function of
   peer degree and arbitrary link costs using only local information.
   The Scamp overlay construction algorithms could support any of the
   flooding and walking routing schemes above, or other epidemic and
   multicasting schemes for that matter.  Resilience to high churn rates
   was identified for future study.

2.2. Central Index (Napster)

Centralized schemes like Napster [256] are significant because they were the first to demonstrate the P2P scalability that comes from separating the data index from the data itself. Ultimately, 36 million Napster users lost their service not because of technical failure, but because the single administration was vulnerable to the legal challenges of record companies [275]. There has since been little research on P2P systems with central data indexes. Such systems have also been called 'hybrid' since the index is centralized but the data is distributed. Yang and Garcia-Molina devised a four-way classification of hybrid systems [276]: unchained servers, where users whose index is on one server do not see other servers' indexes; chained servers, where the server that receives a query forwards it to a list of servers if it does not own the index itself; full replication, where all centralized servers keep a complete index of all available metadata; and hashing, where keywords are hashed to the server where the associated inverted list is kept. The unchained architecture was used by Napster, but it has the disadvantage that users do not see all indexed data in the system. Strictly speaking, the other three options illustrate the distributed data index, not the central index. The chained architecture was recommended as the optimum for the music-swapping application at the time. The methods by which clients update the central index were classified as batch or incremental, with the optimum determined by the query-to-login ratio. Measurements were derived from a clone of Napster called OpenNap[277]. Another study of live Napster data reported wide variation in the availability of peers, a general unwillingness to share files (20-40% of peers share few or no files), and a common understatement of available bandwidth so as to discourage other peers from sharing one's link [202].
Top   ToC   RFC4981 - Page 13
   Influenced by Napster's early demise, the P2P research community may
   have prematurely turned its back on centralized architectures.
   Chawathe, Ratnasamy, et al. opined that Google and Yahoo demonstrate
   the viability of a centralized index.  They argued that "the real
   barriers to Napster-like designs are not technical but legal and
   financial" [61].  Even this view may be a little too harsh on the
   centralized architectures -- it implies that they always have an up-
   front capital hurdle that is steeper than for distributed
   architectures.  The closer one looks at scalable 'centralized'
   architectures, the less the distinction with 'distributed'
   architectures seems to matter.  For example, it is clear that
   Google's designers consider Google a distributed, not centralized,
   file system [278].  Google demonstrates the scale and performance
   possible on commodity hardware, but still has a centralized master
   that is critical to the operation of each Google cluster.  Time may
   prove that the value of emerging P2P networks, regardless of the
   centralized-versus-distributed classification, is that they smooth
   the capital outlays and remove the single points of failure across
   the spectra of scale and geographic distribution.

2.3. Distributed Index (Freenet)

An important early P2P proposal for a distributed index was Freenet [5, 71, 279]. While its primary emphasis was the anonymity of peers, it did introduce a novel indexing scheme. Files are identified by low-level "content-hash" keys and by "secure signed-subspace" keys, which ensure that only a file owner can write to a file while anyone can read from it. To find a file, the requesting peer first checks its local table for the node with keys closest to the target. When that node receives the query, it too checks for either a match or another node with keys close to the target. Eventually, the query either finds the target or exceeds time-to-live (TTL) limits. The query response traverses the successful query path in reverse, depositing a new routing table entry (the requested key and the data holder) at each peer. The insert message similarly steps towards the target node, updating routing table entries as it goes, and finally stores the file there. Whereas early versions of Gnutella used breadth-first flooding, Freenet uses a more economic depth-first search [280]. An initial assessment has been done of Freenet's robustness. It was shown that in a network of 1000 nodes, the median query path length stayed under 20 hops for a failure of 30% of nodes. While the Freenet designers considered this as evidence that the system is "surprisingly robust against quite large failures" [71], the same datapoint may well be outside meaningful operating bounds. How many applications are useful when the first quartile of queries have path lengths of several hundred hops in a network of only 1000 nodes, per
Top   ToC   RFC4981 - Page 14
   Figure 4 of [71]?  To date, there has been no analysis of Freenet's
   dynamic robustness.  For example, how does it perform when nodes are
   continually arriving and departing?

   There have been both criticisms and extensions of the early Freenet
   work.  Gnutella proponents acknowledged the merit in Freenet's
   avoidance of query broadcasting [281].  However, they are critical on
   two counts: the exact file name is needed to construct a query; and
   exactly one match is returned for each query.  P2P designs using
   DHTs, per Section 3, share similar characteristics -- a precise query
   yields a precise response.  The similarity is not surprising since
   Freenet also uses a hash function to generate keys.  However, the
   query routing used in the DHTs has firmer theoretical foundations.
   Another difference with DHTs is that Freenet will take time, when a
   new node joins the network, to build an index that facilitates
   efficient query routing.  By the inventor's own admission, this is
   damaging for a user's first impressions [282].  It was proposed to
   download a copy of routing tables from seed nodes at startup, even
   though the new node might be far from the seed node.  Freenet's slow
   startup motivated Mache, Gilbert, et al. to amend the overlay after
   failed requests and to place additional index entries on successful
   requests -- they claim almost an order of magnitude reduction in
   average query path length [280].  Clarke also highlighted the lack of
   locality or bandwidth information available for efficient query
   routing decisions [282].  He proposed that each node gather response
   times, connection times, and proportion of successful requests for
   each entry in the query routing table.  When searching for a key that
   is not in its own routing table, it was proposed to estimate response
   times from the routing metrics for the nearest known keys and
   consequently choose the node that can retrieve the data fastest.  The
   response time heuristic assumed that nodes close in the key space
   have similar response times.  This assumption stemmed from early
   deployment observations that Freenet peers seemed to specialize in
   parts of the keyspace -- it has not been justified analytically.
   Kronfol drew attention to Freenet's inability to do keyword searches
   [283].  He suggested that peers cache lists of weighted keywords in
   order to route queries to documents, using Term Frequency Inverse
   Document Frequency (TFIDF) measures and inverted indexes (Section
   4.2.1).  With these methods, a peer can route queries for simple
   keyword lists or more complicated conjunctions and disjunctions of
   keywords.  Robustness analysis and simulation of Kronfol's proposal
   remain open.

   The vast majority of P2P proposals in following sections rely on a
   distributed index.


(next page on part 2)

Next Section