Tech-invite3GPPspaceIETFspace
9796959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 4981

Survey of Research towards Robust Peer-to-Peer Networks: Search Methods

Pages: 91
Informational
Part 3 of 5 – Pages 33 to 54
First   Prev   Next

Top   ToC   RFC4981 - Page 33   prevText

4. Semantic Index

Semantic indexes capture object relationships. While the semantic- free methods (DHTs) have firmer theoretic foundations and guarantee that a key can be found if it exists, they do not capture the relationships between the document name and its content or metadata on their own. Semantic P2P designs do. However, since their design is often driven by heuristics, they may not guarantee that scarce items will be found. So what might the semantically indexed P2Ps add to an already crowded field of distributed information architectures? At one extreme, there are the distributed relational database management systems (RDBMSs), with their strong consistency guarantees [284]. They provide strong data independence, the flexibility of SQL queries, and strong transactional semantics -- Atomicity, Consistency, Isolation and Durability (ACID) [349]. They guarantee that the query response is complete -- all matching results are returned. The price is performance. They scale to perhaps 1000 nodes, as evidenced in Mariposa [350, 351], or require query caching front ends to constrain the load [284]. Database research has "arguably been cornered into traditional, high-end, transactional applications" [72]. Then there are distributed file systems, like the Network File System (NFS) or the Serverless Network File Systems (xFS), with little data independence, low-level file retrieval interfaces, and varied consistency [284]. Today's eclectic mix of Content Distribution Networks (CDNs) generally deload primary servers by redirecting Web requests to a nearby replica. Some intercept the HTTP requests at the DNS level and then use consistent hashing to find a replica [23]. Since this same consistent hashing was a forerunner to the DHT
Top   ToC   RFC4981 - Page 34
   approaches above, CDNs are generally constrained to the same simple
   key lookups.

   The opportunity for semantically indexed P2Ps, then, is to provide:

   a) graduated data independence, consistency, and query flexibility,
      and

   b) probabilistically complete query responses, across

   c) very large numbers of low-cost, geographically distributed,
      dynamic nodes.

4.1. Keyword Lookup

P2P keyword lookup is best understood by considering the structure of the underlying index and the algorithms by which queries are routed over that index. Figure 3 summarizes the following paragraphs by classifying the keyword query algorithms, index structures, and metrics. The research has largely focused on scalability, not dependability. There have been very few studies that quantify the impact of network churn. One exception is the work by Chawathe, et al. on the Gia system [61]. Gia's combination of algorithms from Figure 3 (receiver-based flow control, biased random walk, and one- hop replication) gave 2-4 orders of magnitude improvement in query success rates in churning networks.
Top   ToC   RFC4981 - Page 35
   QUERY
   Query routing
     Flooding: Peers only index local files so queries must propagate
       widely [4]
     Policy-based: Choice of the next hop node: random; most/least
       recently used; most files shared; most results [265, 352]
     Random walks: Parallel [67] or biased random walks [61, 66]
   Query forwarding
     Iterative: Nodes perform iterative unicast searches of ultrapeers,
       until the desired number of results is achieved.  See Gnutella
       UDP Extension for Scalable Searches (GUESS) [265, 353]
     Recursive
   Query flow control
     Receiver-controlled: Receivers grant query tokens to senders, so
       as to avoid overload [61]
     Reactive: sender throttles queries when it notices receivers are
       discarding packets [61, 66]
     Dynamic Time To Live: In the Dynamic Query Protocol, the sender
       adjusts the time-to-live on each iteration based on the number
       of results received, the number of connections left, and the
       number of nodes already theoretically reached by the search [354]

   INDEX
   Distribution
     Compression: Leaf nodes periodically send ultrapeers compressed
       query routing tables, as in the Query Routing Protocol [260]
     One hop replication: Nodes maintain an index of content on their
       nearest neighbors [61, 352]
   Partitioning
     By document [210]
     By keyword: Use an inverted list to find a matching document,
       either locally or at another peer [21].  Partition by keyword
       sets [355]
     By document and keyword: Also called Multi-Level Partitioning [21]

   METRIC
   Query load: Queries per second per node/link [65, 265]
   Degree: The number of links per node [66, 352].  Early P2P networks
     approximated power-law networks, where the number of nodes with L
     links is proportional to L^(-k), where k is a constant [65]
   Query delay: Reported in terms of time and hop count [61, 66]
   Query success rate: The "Collapse Point" is the per-node query rate
     at which the query success rate drops below 90% [61].  See
     also [61, 265, 352].

                  Figure 3: Keyword Lookup in P2P Systems
Top   ToC   RFC4981 - Page 36

4.1.1. Gnutella Enhancements

Perhaps the most widely referenced P2P system for simple keyword match is Gnutella [4]. Gnutella queries contain a string of keywords. Gnutella peers answer when they have files whose names contain all the keywords. As discussed in Section 2.1, early versions of Gnutella did not forward the document index. Queries were flooded and peers searched their own local indexes for filename matches. An early review highlighted numerous areas for improvement [65]. It was estimated that the query traffic alone from 50,000 early-generation Gnutella nodes would amount to 1.7% of the total U.S. Internet backbone traffic at December 2000 levels. It was speculated that high-degree Gnutella nodes would impede dependability. An unnecessarily high percentage of Gnutella traffic crossed Autonomous System (AS) boundaries -- a locality mechanism may have found suitable nearby peers. Fortunately, there have since been numerous enhancements within the Gnutella Developer Forum. At the time of writing, it has been reported that Gnutella has almost 350,000 unique hosts, of which nearly 90,000 accept incoming connections [356]. One of the main improvements is that an index of filename keywords, called the Query Routing Table (QRT), can now be forwarded from 'leaf peers' to its 'ultrapeers' [260]. Ultrapeers can then ensure that the leaves only receive queries for which they have a match, dramatically reducing the query traffic at the leaves. Ultrapeers can have connections to many leaf nodes (~10-100) and a small number of other ultrapeers (<10) [260]. Originally, a leaf node's QRT was not forwarded by the parent ultrapeer to other ultrapeers. More recently, there has been a proposal to distribute aggregated QRTs amongst ultrapeers [357]. To further limit traffic, QRTs are compressed by hashing, according to the Query Routing Protocol (QRP) specification [281]. This same specification claims QRP may reduce Gnutella traffic by orders of magnitude, but cautions that simulation is required before mass deployment. A known shortcoming of QRP was that the extent of query propagation was independent of the popularity of the search terms. The Dynamic Query Protocol addressed this [358]. It required leaf nodes to send single queries to high-degree ultrapeers that adjust the queries' time-to-live (TTL) bounds according to the number of received query results. An earlier proposal, called the Gnutella UDP Extension for Scalable Searches (GUESS) [353], similarly aimed to reduce the number of queries for widely distributed files. GUESS reuses the non-forwarding idea (Section 2). A GUESS peer repeatedly queries single ultrapeers with a TTL of 1, with a small timeout on each query to limit load. It chooses the number of iterations and selects ultrapeers so as to satisfy its search needs. For adaptability, a small number of experimental Gnutella nodes have
Top   ToC   RFC4981 - Page 37
   implemented eXtensible Markup Language (XML) schemas for richer
   queries [359, 360].  None of the above Gnutella proposals explicitly
   assess robustness.

   The broader research community has recently been leveraging aspects
   of the Gnutella design.  Lv, Ratnasamy, et al. exposed one assumption
   implicit in some of the early DHT work -- that designs "such as
   Gnutella are inherently not scalable, and therefore should be
   abandoned" [66].  They argued that by making better use of the more
   powerful peers, Gnutella's scalability issues could be alleviated.
   Instead of its flooding mechanism, they used random walks.  Their
   preliminary design to bias random walks towards high capacity nodes
   did not go as far as the ultrapeer proposals in that the indexes did
   not move to the high-capacity nodes.  Chawathe, Ratnasamy, et al.
   chose to extend the Gnutella design with their Gia system, in
   response to the perceived shortcomings of DHTs in Section 1.2 [61].
   Compared to the early Gnutella designs, they incorporated several
   novel features.  They devise a topology adaptation algorithm so that
   most peers are attached to high-degree peers.  They use a random walk
   search algorithm, in lieu of flooding, and bias the query load
   towards higher-degree peers.  For 'one-hop replication', they require
   all nodes to keep pointers to content on adjacent peers.  To
   implement a receiver-controlled token-based flow control, a peer must
   have a token from its neighbouring peer before it sends a query to
   it.  Chawathe, Ratnasamy, et al. show by simulations that the
   combination of these features provides a scalability improvement of
   three to five orders of magnitude over Gnutella "while retaining
   significant robustness".  The main robustness metrics they used were
   the 'collapse point' query rate (the per-node query rate at which the
   successful query rate falls below 90%) and the average hop count
   immediately prior to collapse.  Their comparison with Gnutella did
   not take into account the Gnutella enhancements above -- this was
   left as future work.  Castro, Costa, and Rowstron argued that if
   Gnutella were built on top of a structured overlay, then both the
   query and overlay maintenance traffic could be reduced [259].  Yang,
   Vinograd, et al. explore various policies for peer selection in the
   GUESS protocol, since the issue is left open in the original proposal
   [265].  For example, the peer initiating the query could choose peers
   that have been "most recently used" or that have the "most files
   shared".  Various policy pitfalls are identified.  For example, good
   peers could be overloaded, victims of their own success.
   Alternatively, malicious peers could encourage the querying peer to
   try inactive peers.  They conclude that a "most results" policy gives
   the best balance of robustness and efficiency.  Like Castro, Costa,
   and Rowstron, they concentrated on the static network scenario.
   Cholvi, Felber, et al. very briefly describe how similar "least
   recently used" and "most often used" heuristics can be used by a peer
   to select peer 'acquaintances' [352].  They were motivated by the
Top   ToC   RFC4981 - Page 38
   congestion associated with Gnutella's TTL-limited flooding.
   Recognizing that the busiest peers can quickly become overloaded
   central hubs for the entire network, they limit the number of
   acquaintances for any given peer to 25.  They sketch a mechanism to
   decrement a query's TTL multiple times when it traverses "interested
   peers".  In summary, these Gnutella-related investigations are
   characterized by a bias for high-degree peers and very short directed
   query paths, a disdain for flooding, and concern about excessive load
   on the 'better' peers.  Generally, the robustness analysis for
   dynamic networks (content updates and node arrivals/departures)
   remains open.

4.1.2. Partition-by-Document, Partition-by-Keyword

One aspect of P2P keyword search systems has received particular attention: should the index be partitioned by document or by keyword? The issue affects scalability. To be partitioned by document, each node has a local index of documents for which it is responsible. Gnutella is a prime example. Queries are generally flooded in systems partitioned by document. On the other hand, a peer may assume responsibility for a set of keywords. The peer uses an inverted list to find a matching document, either locally or at another peer. If the query contains several keywords, inverted lists may need to be retrieved from several different peers to find the intersection [21]. The initial assessment by Li, Loo, et al. was that the partition-by-document approach was superior [210]. For one scenario of a full-text Web search, they estimated the communications costs to be about six times higher than the feasible budget. However, wanting to exploit prior work on inverted list intersection, they studied the partition-by-keyword strategy. They proposed several optimizations that put the communication costs for a partition-by-keyword system within an order of magnitude of feasibility. There had been a couple of prior papers that suggested partitioned-by-keyword designs incorporate DHTs to map keywords to peers [355, 361]. In Gnawali's Keyword-set Search System (KSS), the index is partitioned by sets of keywords [355]. Terpstra, Behnel, et al. point out that by keeping keyword pairs or triples, the number of lists per document in KSS is squared or tripled [362]. Shi, Guangwen, et al. interpreted the approximations of Li, Loo, et al. to mean that neither approach is feasible on its own [21]. Their Multi-Level Partitioning (MLP) scheme incorporates both partitioning approaches. They arrange nodes into a group hierarchy, with all nodes in the single 'level 0' group, and with the same nodes sub- divided into k logical subgroups on 'level 1'. The subgroups are again divided, level by level, until level l. The inverted index is partitioned by document between groups and by keyword within groups. MLP avoids the query flooding normally associated with systems partitioned by document, since a small number of nodes in each group
Top   ToC   RFC4981 - Page 39
   process the query.  It reduces the bandwidth overheads associated
   with inverted list intersection in systems partitioned solely by
   keyword, since groups can calculate the intersection independently
   over the documents for which they are responsible.  MLP was overlaid
   on SkipNet, per Section 3.5.6 [38].  Some initial analyses of
   communications costs and query latencies were provided.

4.1.3. Partial Search, Exhaustive Search

Much of the research above addresses partial keyword search. Daswani, et al. highlighted the open problem of efficient, comprehensive keyword search [25]. How can exhaustive searches be achieved without flooding queries to every peer in the network? Terpstra, Behnel et al. couched the keyword search problem in rendezvous terms: dynamic keyword queries need to 'meet' with static document lists [362]. Their Bitzipper scheme is partitioned by document. They improved on full flooding by putting document metadata on 2sqrt(n) nodes and forwarding queries through only 6sqrt(n) nodes. They reported that Bitzipper nodes need only 1/166th of the bandwidth of full-flooding Gnutella nodes for an exhaustive search. An initial comparison of query load was given. There was little consideration of either static or dynamic resilience; that is, of nodes failing, of documents continually changing, or of nodes continually joining and leaving the network.

4.2. Information Retrieval

The field of Information Retrieval (IR) has matured considerably since its inception in the 1950s [363]. A taxonomy for IR models has been formalized [262]. It consists of four elements: a representation of documents in a collection; a representation of user queries; a framework describing relationships between document representations and queries; and a ranking function that quantifies an ordering amongst documents for a particular query. Three main issues motivate current IR research -- information relevance, query response time, and user interaction with IR systems. The dominant IR trends for searching large text collections are also threefold [262]. The size of collections is increasing dramatically. More complicated search mechanisms are being found to exploit document structure, to accommodate heterogeneous document collections, and to deal with document errors. Compression is in favour -- it may be quicker to search compact text or retrieve it from external devices. In a distributed IR system, query processing has four parts. Firstly, particular collections are targeted for the search. Secondly, queries are sent to the targeted collections. Queries are then evaluated at the individual collections. Finally, results from the collections are collated.
Top   ToC   RFC4981 - Page 40
   So how do P2P networks differ from distributed IR systems?  Bawa,
   Manku, et al. presented four differences [62].  They suggested that a
   P2P network is typically larger, with tens or hundreds of thousands
   of nodes.  It is usually more dynamic, with node lifetimes measured
   in hours.  They suggested that a P2P network is usually homogeneous,
   with a common resource description language.  It lacks the
   centralized "mediators" found in many IR systems that assume
   responsibility for selecting collections, for rewriting queries, and
   for merging ranked results.  These distinctions are generally aligned
   with the peer characteristics in Section 1.  One might add that P2P
   nodes display more symmetry -- peers are often both information
   consumers and producers.  Daswani, Garcia-Molina, et al. pointed out
   that, while there are IR techniques for ranked keyword search at
   moderate scale, research is required so that ranking mechanisms are
   efficient at the larger scale targeted by P2P designs [25].  Joseph
   and Hoshiai surveyed several P2P systems using metadata techniques
   from the IR toolkit [60].  They described an assortment of IR
   techniques and P2P systems, including various metadata formats,
   retrieval models, bloom filters, DHTs, and trust issues.

   In the ensuing paragraphs, we survey P2P work that has incorporated
   information retrieval models, particularly the Vector Model and the
   Latent Semantic Indexing Model.  We omit the P2P work based on
   Bayesian models.  Some have pointed to such work [60], but made no
   explicit mention of the model [364].  One early paper on P2P
   content-based image retrieval also leveraged the Bayesian model
   [365].  For the former two models, we briefly describe the design,
   then try to highlight robustness aspects.  On robustness, we are
   again stymied for lack of prior work.  Indeed, a search across all
   proceedings of the Annual ACM Conference on Research and Development
   in Information Retrieval for the words "reliable", "available",
   "dependable", or "adaptable" did not return any results at the time
   of writing.  In contrast, a standard text on distributed database
   management systems [366] contains a whole chapter on reliability.  IR
   research concentrates on performance measures.  Common performance
   measures include recall, the fraction of the relevant documents that
   has been retrieved and precision, the fraction of the retrieved
   documents that is relevant [262].  Ideally, an IR system would have
   high recall and high precision.  Unfortunately techniques favouring
   one often disadvantage the other [363].
Top   ToC   RFC4981 - Page 41

4.2.1. Vector Model (PlanetP, FASD, eSearch)

The vector model [367] represents both documents and queries as term vectors, where a term could be a word or a phrase. If a document or query has a term, the weight of the corresponding dimension of the vector is non-zero. The similarity of the document and query vectors gives an indication of how well a document matches a particular query. The weighting calculation is critical across the retrieval models. Amongst the numerous proposals for the probabilistic and vector models, there are some commonly recurring weighting factors [363]. One is term frequency. The more a term is repeated in a document, the more important the term is. Another is inverse document frequency. Terms common to many documents give less information about the content of a document. Then there is document length. Larger documents can bias term frequencies, so weightings are sometimes normalized against document length. The expression "TFIDF weighting" refers to the collection of weighting calculations that incorporate term frequency and inverse document frequency, not just to one. Two weighting calculations have been particularly dominant -- Okapi [368] and pivoted normalization [369]. A distributed version of Google's Pagerank algorithm has also been devised for a P2P environment [370]. It allows incremental, ongoing Pagerank calculations while documents are inserted and deleted. A couple of early P2P systems leveraged the vector model. Building on the vector model, PlanetP divided the ranking problem into two steps [215]. In the first, peers are ranked for the probability that they have matching documents. In the second, higher-priority peers are contacted and the matching documents are ranked. An Inverse Peer Frequency, analogous to the Inverse Document Frequency, is used to rank relevant peers. To further constrain the query traffic, PlanetP contacts only the first group of m peers to retrieve a relevant set of documents. In this way, it repeatedly contacts groups of m peers until the top k document rankings are stable. While the PlanetP designers first quantified recall and precision, they also considered reliability. Each PlanetP peer has a global index with a list of all other peers, their IP addresses, and their Bloom filters. This large volume of shared information needs to be maintained. Klampanos and Jose saw this as PlanetP's primary shortcoming [371]. Each Bloom filter summarized the set of terms in the local index of each peer. The time to propagate changes, be they new documents or peer arrivals/departures, was studied by simulation for up to 1000 peers. The reported propagation times were in the hundreds of seconds. Design workarounds were required for PlanetP to be viable across slower dial-up modem connections. For future work, the authors were
Top   ToC   RFC4981 - Page 42
   considering some sort of hierarchy to scale to larger numbers of
   peers.

   A second early system using the vector model is the Fault-tolerant,
   Adaptive, Scalable Distributed (FASD) search engine [283], which
   extended the Freenet design (Section 2.3) for richer queries.  The
   original Freenet design could find a document based on a globally
   unique identifier.  Kronfol's design added the ability to search, for
   example, for documents about "apples AND oranges NOT bananas".  It
   uses a TFIDF weighting scheme to build a document's term vector.
   Each peer calculates the similarity of the query vector and local
   documents and forwards the query to the best downstream peer.  Once
   the best downstream peer returns a result, the second-best peer is
   tried, and so on.  Simulations with 1000 nodes gave an indication of
   the query path lengths in various situations -- when routing queries
   in a network with constant rates of node and document insertion, when
   bootstrapping the network in a "worst-case" ring topology, or when
   failing randomly and specifically selected peers.  Kronfol claimed
   excellent average-case performance -- less than 20 hops to retrieve
   the same top n results as a centralized search engine.  There were,
   however, numerous cases where the worst-case path length was several
   hundred hops in a network of only 1000 nodes.

   In parallel, there have been some P2P designs based on the vector
   model from the University of Rochester -- pSearch [9, 372] and
   eSearch [373].  The early pSearch paper suggested a couple of
   retrieval models, one of which was the Vector Space Model, to search
   only the nodes likely to have matching documents.  To obtain
   approximate global statistics for the TFIDF calculation, a spanning
   tree was constructed across a subset of the peers.  For the m top
   terms, the term-to-document index was inserted into a Content-
   Addressable Network [334].  A variant that mapped terms to document
   clusters was also suggested. eSearch is a hybrid of the partition-
   by-document and partition-by-term approaches (Section 4.1.2) eSearch
   nodes are primarily partitioned by term.  Each is responsible for the
   inverted lists for some top terms.  For each document in the inverted
   list, the node stores the complete term list.  To reduce the size of
   the index, the complete term lists for a document are only kept on
   nodes that are responsible for top terms in the document.  eSearch
   uses the Okapi term weighting to select top terms.  It relies on the
   Chord DHT [34] to associate terms with nodes storing the inverted
   lists.  It also uses automatic query expansion.  This takes the
   significant terms from the top document matches and automatically
   adds them to the user's query to find additional relevant documents.
   The eSearch performance was quantified in terms of search precision,
   the number of retrieved documents, and various load-balancing
   metrics.  Compared to the more common proposals for partitioning by
Top   ToC   RFC4981 - Page 43
   keywords, eSearch consumed 6.8 times the storage space to achieve
   faster search times.

4.2.2. Latent Semantic Indexing (pSearch)

Another retrieval model used in P2P proposals is Latent Semantic Indexing (LSI) [374]. Its key idea is to map both the document and query vectors to a concept space with lower dimensions. The starting point is a t*N weighting matrix, where t is the total number of indexed terms, N is the total number of documents, and the matrix elements could be TFIDF rankings. Using singular value decomposition, this matrix is reduced to a smaller number of dimensions, while retaining the more significant term-to-document mappings. Baeza-Yates and Ribeiro-Neto suggested that LSI's value is a novel theoretic framework, but that its practical performance advantage for real document collections had yet to be proven [262]. pSearch incorporated LSI [9]. By placing the indices for semantically similar documents close in the network, Tang, Xu, et al. touted significant bandwidth savings relative to the early full- flooding variant of Gnutella [372]. They plotted the number of nodes visited by a query. They also explored the trade-off with accuracy, the percentage match between the documents returned by the distributed pSearch algorithm and those from a centralized LSI baseline. In a more recent update to the pSearch work, Tang, Dwarkadas, et al. summarized LSI's shortcomings [375]. Firstly, for large document collections, its retrieval quality is inherently inferior to Okapi. Secondly, singular value decomposition consumes excessive memory and computation time. Consequently, the authors used Okapi for searching while retaining LSI for indexing. With Okapi, they selected the next node to be searched and selected documents on searched nodes. With LSI, they ensured that similar documents are clustered near each other, thereby optimizing the network search costs. When retrieving a small number of top documents, the precision of LSI+Okapi approached that of Okapi. However, if retrieving a large number of documents, the LSI+Okapi precision is inferior. The authors want to improve this in future work.

4.2.3. Small Worlds

The "small world" concept originally described how people are interconnected by short chains of acquaintances [376]. Kleinberg was struck by the algorithmic lesson of the small world, namely "that individuals using local information are collectively very effective at constructing short paths between two points in a social network" [377]. Small world networks have a small diameter and a large clustering coefficient (a large number of connections amongst relevant nodes) [378].
Top   ToC   RFC4981 - Page 44
   The small world idea has had a limited impact on peer-to-peer
   algorithms.  It has influenced only a few unstructured [62, 378-380]
   and structured [344, 381] algorithms.  The most promising work on
   "small worlds" in P2P networks are those concerned with the
   information retrieval metrics, precision and recall [62, 378, 380].

5. Queries

Database research suggests directions for P2P research. Hellerstein observed that, while work on fast P2P indexes is well underway, P2P query optimization remains a promising topic for future research [23]. Kossman reviewed the state of the art of distributed query processing, highlighting areas for future research: simulation and query optimization for networks of tens of thousands of servers and millions of clients; non-relational data types (e.g., XML, text, and images); and partial query responses since on the Internet, "failure is the rule rather than the exception" [19]. A primary motivation for the P2P system, PIER, was to scale from the largest database systems of a few hundred nodes to an Internet environment in which there are over 160 million nodes [22]. Litwin and Sahri have also considered ways to combine distributed hashing, more specifically the Scalable Distributed Data Structures, with SQL databases, claiming to be first to implement scalable distributed database partitioning [382]. Motivated by the lack of transparent distribution in current distributed databases, they measure query execution times for Microsoft SQL servers aggregated by means of an SDDS layer. One of their starting assumptions was that it is too challenging to change the SQL query optimizer. Database research also suggests the approach to P2P research. Researchers of database query optimization were divided between those looking for optimal solutions in special cases and those using heuristics to answer all queries [383]. Gribble, et al. cast query optimization in terms of the data placement problem, which is to "distribute data and work so the full query workload is answered with lowest cost under the existing bandwidth and resource constraints" [250]. They pointed out that even the static version of this problem is NP-complete in P2P networks. Consequently, research on massive, dynamic P2P networks will likely progress using both strategies of early database research - heuristics and special-case optimizations. If P2P networks are going to be adaptable, if they are to support a wide range of applications, then they need to accommodate many query types [72]. Up to this point, we have reviewed queries for keys (Section 3) and keywords (Sections 4.1. and 4.2). Unfortunately, a major shortcoming of the DHTs in Section 3.5 is that they primarily support exact-match, single-key queries. Skip Graphs support range and prefix queries, but not aggregation queries. Here we probe below
Top   ToC   RFC4981 - Page 45
   the language syntax to identify the open research issues associated
   with more expressive P2P queries [25].  Triantafillou and Pitoura
   observed the disparate P2P designs for different types of queries and
   so outlined a unifying framework [76].  To classify queries, they
   considered the number of relations (single or multiple), the number
   of attributes (single or multiple), and the type of query operator.
   They described numerous operators:  equality, range, join, and
   "special functions".  The latter referred to aggregation (like sum,
   count, average, minimum, and maximum), grouping and ordering.  The
   following sections approximately fit their taxonomy -- range queries,
   multi-attribute queries, join queries and aggregation queries.  There
   has been some initial P2P work on other query types -- continuous
   queries [20, 22, 73], recursive queries [22, 74], and adaptive
   queries [23, 75].  For these, we defer to the primary references.

5.1. Range Queries

The support of efficient range predicates in P2P networks was identified as an important open research issue by Huebsch, et al. [22]. Range partitioning has been important in parallel databases to improve performance, so that a transaction commonly needs data from only one disk or node [22]. One type of range search, longest prefix match, is important because of its prevalence in routing schemes for voice and data networks alike. In other applications, users may pose broad, inexact queries, even though they require only a small number of responses. Consequently, techniques to locate similar ranges are also important [77]. Various proposals for range searches over P2P networks are summarized in Figure 4. Since the Scalable Distributed Data Structure (SDDS) has been an important influence on contemporary Distributed Hash Tables (DHTs) [49-51], we also include ongoing work on SDDS range searches. PEER-TO-PEER (P2P) Locality Sensitive Hashing (Chord) [77] Prefix Hash Trees (unspecified DHT) [78, 79] Space Filling Curves (CAN) [80] Space Filling Curves (Chord) [81] Quadtrees (Chord) [82] Skip Graphs [38, 41, 83, 100] Mercury [84] P-Grid [85, 86] SCALABLE DISTRIBUTED DATA STRUCTURES (SDDS) RP* [87, 88] Figure 4: Solutions for Range Queries on P2P and SDDS Indexes
Top   ToC   RFC4981 - Page 46
   The papers on P2P range search can be divided into those that rely on
   an underlying DHT (the first five entries in Figure 4) and those that
   do not (the subsequent three entries).  Bharambe, Agrawal, et al.
   argued that DHTs are inherently ill-suited to range queries [84].
   The very feature that makes for their good load balancing properties,
   randomized hash functions, works against range queries.  One possible
   solution would be to hash ranges, but this can require a priori
   partitioning.  If the partitions are too large, partitions risk
   overload.  If they are too small, there may be too many hops.

   Despite these potential shortcomings, there have been several range
   query proposals based on DHTs.  If hashing ranges to nodes, it is
   entirely possible that overlapping ranges map to different nodes.
   Gupta, Agrawal, et al. rely on locality sensitive hashing to ensure
   that, with high probability, similar ranges are mapped to the same
   node [77].  They propose one particular family of locality sensitive
   hash functions, called min-wise independent permutations.  The number
   of partitions per node and the path length were plotted against the
   total numbers of peers in the system.  For a network with 1000 nodes,
   the hop count distribution was very similar to that of the exact-
   matching Chord scheme.  Was it load-balanced?  For the same network
   with 50,000 partitions, there were over two orders of magnitude
   variation in the number of partitions at each node (first and
   ninety-ninth percentiles).  The Prefix Hash Tree is a trie in which
   prefixes are hashed onto any DHT.  The preliminary analysis suggests
   efficient doubly logarithmic lookup, balanced load, and fault
   resilience [78, 79].  Andrzejak and Xu were perhaps the first to
   propose a mapping from ranges to DHTs [80].  They use one particular
   Space Filling Curve, the Hilbert curve, over a Content Addressable
   Network (CAN) construction (Section 3.5.3).  They maintain two
   properties: nearby ranges map to nearby CAN zones; if a range is
   split into two sub-ranges, then the zones of the sub-ranges partition
   the zone of the primary range.  They plot path length and load proxy
   measures (the total number of messages and nodes visited) for three
   algorithms to propagate range queries: brute force, controlled
   flooding, and directed controlled flooding.  Schmidt and Parashar
   also advocated Space Filling Curves to achieve range queries over a
   DHT [81].  However, they point out that, while Andrzejak and Xu use
   an inverse Space Filling Curve to map a one-dimensional space to d-
   dimensional zones, they map a d-dimensional space back to a one-
   dimensional index.  Such a construction gives the ability to search
   across multiple attributes (Section 5.2).  Tanin, Harwood, et al.
   suggested quadtrees over Chord [82], and gave preliminary simulation
   results for query response times.

   Because DHTs are naturally constrained to exact-match, single-key
   queries, researchers have considered other P2P indexes for range
   searches.  Several were based on Skip Graphs [38, 41], which, unlike
Top   ToC   RFC4981 - Page 47
   the DHTs, do not necessitate randomizing hash functions and are
   therefore capable of range searches.  Unfortunately, they are not
   load balanced [83].  For example, in SkipNet [48], hashing was added
   to balance the load -- the Skip Graph could support range searches or
   load balancing, but not both.  One solution for load-balancing relies
   on an increased number of 'virtual' servers [168] but, in their
   search for a system that can both search for ranges and balance
   loads, Bharambe, Agrawal, et al. rejected the idea [84].  The virtual
   servers work assumed load imbalance stems from hashing; that is, by
   skewed data insertions and deletions.  In some situations, the
   imbalance is triggered by a skewed query load.  In such
   circumstances, additional virtual servers can increase the number of
   routing hops and increase the number of pointers that a Skip Graph
   needs to maintain.  Ganesan, Bawa, et al. devised an alternate method
   to balance load [83].  They proposed two Skip Graphs, one to index
   the data itself and the other to track load at each node in the
   system.  Each node is able to determine the load on its neighbours
   and the most (least) loaded nodes in the system.  They devise two
   algorithms: NBRADJUST balances load on neighbouring nodes; using
   REORDER, empty nodes can take over some of the tuples on heavily
   loaded nodes.  Their simulations focus on skewed storage load, rather
   than on skewed query loads, but they surmise that the same approach
   could be used for the latter.

   Other proposals for range queries avoid both the DHT and the Skip
   Graph.  Bharambe, Agrawal, et al. distinguish their Mercury design by
   its support for multi-attribute range queries and its explicit load
   balancing [84].  In Mercury, nodes are grouped into routing hubs,
   each of which is responsible for various query attributes.  While it
   does not use hashing, Mercury is loosely similar to the DHT
   approaches: nodes within hubs are arranged into rings, like Chord
   [34]; for efficient routing within hubs, k long-distance links are
   used, like Symphony [381].  Range lookups require O(((log n)^2)/k)
   hops.  Random sampling is used to estimate the average load on nodes
   and to find the parts of the overlay that are lightly loaded.
   Whereas Symphony assumed that nodes are responsible for ranges of
   approximately equal size, Mercury's random sampling can determine the
   location of the start of the range, even for non-uniform ranges [84].
   P-Grid [42] does provide for range queries, by virtue of the key
   ordering in its tree structures.  Ganesan, Bawa, et al. critiqued its
   capabilities [83]: P-Grid assumes fixed-capacity nodes; there was no
   formal characterization of imbalance ratios or balancing costs; every
   P-Grid periodically contacts other nodes for load information.

   The work on Scalable Distributed Data Structures (SDDSs) has
   progressed in parallel with P2P work and has addressed range queries.
   Like the DHTs above, the early SDDS Linear Hashing (LH*) schemes were
   not order-preserving [52].  To facilitate range queries, Litwin,
Top   ToC   RFC4981 - Page 48
   Niemat, et al. devised a Range Parititioning variant, RP* [87].
   There are options to dispense with the index, to add indexes to
   clients, and to add them to servers.  In the variant without an
   index, every query is issued via multicasting.  The other variants
   also use some multicasting.  The initial RP* paper suggested
   scalability to thousands of sites, but a more recent RP* simulation
   was capped at 140 servers [88].  In that work, Tsangou, Ndiaye, et
   al. investigated TCP and UDP mechanisms by which servers could return
   range query results to clients.  The primary metrics were search and
   response times.  Amongst the commercial parallel database management
   systems, they reported that the largest seems only to scale to 32
   servers (SQL Server 2000).  For future work, they planned to explore
   aggregation of query results, rather than establishing a connection
   between the client and every single server with a response.

   All in all, it seems there are numerous open research questions on
   P2P range queries.  How realistic is the maintenance of global load
   statistics considering the scale and dynamism of P2P networks?
   Simulations at larger scales are required.  Proposals should take
   into account both the storage load (insert and delete messages) and
   the query load (lookup messages).  Simplifying assumptions need to be
   attacked.  For example, how well do the above solutions work in
   networks with heterogeneous nodes, where the maximum message loads
   and index sizes are node-dependent?

5.2. Multi-Attribute Queries

There has been some work on multi-attribute P2P queries. As late as September 2003, it was suggested that there has not been an efficient solution [76]. Again, an early significant work on multi-attribute queries over aggregated commodity nodes germinated amongst SDDSs. k-RP* [89] uses the multi-dimensional binary search tree (or k-d tree, where k indicates the number of dimensions of the search index) [384]. It builds on the RP* work from the previous section and inherits their capabilities for range search and partial match. Like the other SDDSs, k-RP* indexes can fit into RAM for very fast lookup. For future work, Litwin and Neimat suggested a) a formal analysis of the range search termination algorithm and the k-d paging algorithm, b) a comparison with other multi-attribute data structures (quad-trees and R-trees) and c) exploration of query processing, concurrency control, and transaction management for k-RP* files [89]. On the latter point, others have considered transactions to be inconsequential to the core problem of supporting more complex queries in P2P networks [72].
Top   ToC   RFC4981 - Page 49
   In architecting their secure wide-area Service Discovery Service
   (SDS), Hodes, Czerwinski, et al. considered three possible designs
   for multi-criteria search -- Centralization, Mapping and Flooding
   [90].  These correlate to the index classifications of Section 2 --
   Central, Distributed, and Local.  They discounted the centralized,
   Napster-like index for its risk of a single point of failure.  They
   considered the hash-based mappings of Section 3, but concluded that
   it would not be possible to adequately partition data.  A document
   satisfying many criteria would be wastefully stored in many
   partitions.  They rejected full flooding for its lack of scalability.
   Instead, they devised a query filtering technique, reminiscent of
   Gnutella's query routing protocol (Section 4.1).  Nodes push
   proactive summaries of their data rather than waiting for a query.
   Summaries are aggregated and stored throughout a server hierarchy, to
   guide subsequent queries.  Some initial prototype measurements were
   provided for total load on the system, but not for load distribution.
   They put several issues forward for future work.  The indexing needs
   to be flexible to change according to query and storage workloads.  A
   mesh topology might improve on their hierarchic topology since query
   misses would not propagate to root servers.  The choice is analogous
   to BGP meshes and DNS trees.

   More recently, Cai, Frank, et al. devised the Multi-Attribute
   Addressable Network (MAAN) [91].  They built on Chord to provide both
   multi-attribute and range queries, claiming to be the first to
   service both query types in a structured P2P system.  Each MAAN node
   has O(log n) neighbours, where N is the number of nodes.  MAAN
   multi-attribute range queries require O(log n+N*Smin) hops, where
   Smin is the minimum range selectivity across all attributes.
   Selectivity is the ratio of the query range to the entire identifier
   range.  The paper assumed that a locality preserving hash function
   would ensure balanced load.  Per Section 5.1, the arguments by
   Bharambe, Agrawal, et al. have highlighted the shortcomings of this
   assumption [84].  MAAN required that the schema must be fixed and
   known in advance -- adaptable schemas were recommended for subsequent
   attention.  The authors also acknowledged that there is a selectivity
   breakpoint at which full flooding becomes more efficient than their
   scheme.  This begs for a query resolution algorithm that adapts to
   the profile of queries.  Cai and Frank followed up with RDFPeers
   [55].  They differentiate their work from other RDF proposals by a)
   guaranteeing to find query results if they exist and b) removing the
   requirement of prior definition of a fixed schema.  They hashed
   <subject, predicate, object> triples onto the MAAN and reported
   routing hop metrics for their implementation.  Load imbalance across
   nodes was reduced to less than one order of magnitude, but the
   specific measure was the number of triples stored per node - skewed
   query loads were not considered.  They plan to improve load balancing
   with the virtual servers of Section 5.1 [168].
Top   ToC   RFC4981 - Page 50

5.3. Join Queries

Two research teams have done some initial work on P2P join operations. Harren, Hellerstein, et al. initially described a three-layer architecture -- storage, DHT and query processing. They implemented the join operation by modifying an existing Content Addressable Network (CAN) simulator, reporting "significant hot-spots in all dimensions: storage, processing, and routing" [72]. They progressed their design more recently in the context of PIER, a distributed query engine based on CAN [22, 385]. They implemented two equi-join algorithms. In their design, a key is constructed from the "namespace" and the "resource ID". There is a namespace for each relation and the resource ID is the primary key for base tuples in that relation. Queries are multicast to all nodes in the two namespaces (relations) to be joined. Their first algorithm is a DHT version of the symmetric hash join. Each node in the two namespaces finds the relevant tuples and hashes them to a new query namespace. The resource ID in the new namespace is the concatenation of join attributes. In the second algorithm, called "fetch matches", one of the relations is already hashed on the join attributes. Each node in the second namespace finds tuples matching the query and retrieves the corresponding tuples from the first relation. They leveraged two other techniques, namely the symmetric semi-join rewrite and the Bloom filter rewrite, to reduce the high bandwidth overheads of the symmetric hash join. For an overlay of 10,000 nodes, they simulated the delay to retrieve tuples and the aggregate network bandwidth for these four schemes. The initial prototype was on a cluster of 64 PCs, but it has more recently been expanded to PlanetLab. Triantafillou and Pitoura considered multicasting to large numbers of peers to be inefficient [76]. They therefore allocated a limited number of special peers, called range guards. The domain of the join attributes was divided, one partition per range guard. Join queries were sent only to range guards, where the query was executed. Efficient selection of range guards and a quantitive evaluation of their proposal were left for future work.

5.4. Aggregation Queries

Aggregation queries invariable rely on tree-structures to combine results from a large number of nodes. Examples of aggregation queries are Count, Sum, Maximum, Minimum, Average, Median, and Top-K [92, 386, 387]. Figure 5 summarizes the tree and query characteristics that affect dependability.
Top   ToC   RFC4981 - Page 51
   Tree type: Doesn't use DHT [92], use internal DHT trees [95], use
      independent trees on top of DHTs
   Tree repair: Periodic [93], exceptional [32]
   Tree count: One per key, one per overlay [56]
   Tree flexibility: Static [92], dynamic

   Query interface: install, update, probe [98]
   Query distribution: multicast [98], gossip [92]
   Query applications: leader election, voting, resource location,
      object placement and error recovery [98, 388]
   Query semantics
      Consistency: Best-effort, eventual [92], snapshot / interval /
         single-site validity [99]
      Timeliness [388]
      Lifetime: Continuous [97, 99], single-shot
      No. attributes: Single, multiple
   Query types: Count, sum, maximum, minimum, average, median, top k
      [92, 386, 387]

          Figure 5: Aggregation Trees and Queries in P2P Networks

   Key: Astrolabe [92]; Cone [93]; Distributed Approximative System
   Information Service (DASIS) [95]; Scalable Distributed Information

   Management System (SDIMS) [98]; Self-Organized Metadata Overlay
   (SOMO) [56]; Wildfire [99]; Willow [32]; Newscast [97]

   The fundamental design choices for aggregation trees relate to how
   the overlay uses DHTs, how it repairs itself when there are failures,
   how many aggregation trees there are, and whether the tree is static
   or dynamic (Figure 5).  Astrolabe is one of the most influential P2P
   designs included in Figure 5, yet it makes no use of DHTs [92].
   Other designs make use of the internal trees of Plaxton-like DHTs.
   Others build independent tree structures on top of DHTs.  Most of the
   designs repair the aggregation tree with periodic mechanisms similar
   to those used in the DHTs themselves.  Willow is an exception [32].
   It uses a Tree Maintenance Protocol to "zip" disjoint aggregation
   trees together when there are major failures.  Yalagandula and Dahlin
   found reconfigurations at the aggregation layer to be costly,
   suggesting more research on techniques to reduce the cost and
   frequency of such reconfigurations [98].  Many of the designs use
   multiple aggregation trees, each rooted at the DHT node responsible
   for the aggregation attribute.  On the other hand, the Self-Organized
   Metadata Overlay [56] uses a single tree and is vulnerable to a
   single point of failure at its root.
Top   ToC   RFC4981 - Page 52
   At the time of writing, researchers have just begun exploring the
   performance of queries in the presence of churn.  Most designs are
   for best-effort queries.  Bawa, et al. devised a better consistency
   model, called Single-Site Validity [99] to qualify the accuracy of
   results when there is churn.  Its price was a five-fold increase in
   the message load, when compared to an efficient but best-effort
   Spanning Tree.  Gossip mechanisms are resilient to churn, but they
   delay aggregation results and incur high message cost for aggregation
   attributes with small read-to-write ratios.

6. Security Considerations

An initial list of references to research on P2P security is given in Figure 1, Section 1. This document addresses P2P search. P2P storage, security, and applications are recommended for further investigation in Section 8.

7. Conclusions

Research on peer-to-peer networks can be divided into four categories -- search, storage, security and applications. This critical survey has focused on search methods. While P2P networks have been classified by the existence of an index (structured or unstructured) or the location of the index (local, centralized, and distributed), this survey has shown that most have evolved to have some structure, whether it is indexes at superpeers or indexes defined by DHT algorithms. As for location, the distributed index is most common. The survey has characterized indexes as semantic and semantic-free. It has also critiqued P2P work on major query types. While much of it addresses work from 2000 or later, we have traced important building blocks from the 1990s. The initial motivation in this survey was to answer the question, "How robust are P2P search networks?" The question is key to the deployment of P2P technology. Balakrishnan, Kaashoek, et al. argued that the P2P architecture is appealing: the startup and growth barriers are low; they can aggregate enormous storage and processing resources; "the decentralized and distributed nature of P2P systems gives them the potential to be robust to faults or intentional attacks" [18]. If P2P is to be a disruptive technology in applications other than casual file sharing, then robustness needs to be practically verified [20]. The best comparative research on P2P dependability has been done in the context of Distributed Hash Tables (DHTs) [291]. The entire body of DHT research can be distilled to four main observations about dependability (Section 3.2). Firstly, static dependability comparisons show that no O(log n) DHT geometry is significantly more
Top   ToC   RFC4981 - Page 53
   dependable than the other O(log n) geometries.  Secondly, dynamic
   dependability comparisons show that DHT dependability is sensitive to
   the underlying topology maintenance algorithms (Figure 2).  Thirdly,
   most DHTs use O(log n) geometries to suit ephemeral nodes, whereas
   the O(1) hop DHTs suit stable nodes - they deserve more research
   attention.  Fourthly, although not yet a mature science, the study of
   DHT dependability is helped by recent simulation tools that support
   multiple DHTs [299].

   We make the following four suggestions for future P2P research:

   1) Complete the companion P2P surveys for storage, security, and
      applications.  A rough outline has been suggested in Figure 1,
      along with references.  The need for such surveys was highlighted
      within the peer-to-peer research group of the Internet Research
      Task Force (IRTF) [17].

   2) P2P indexes are maturing.  P2P queries are embryonic.  Work on
      more expressive queries over P2P indexes started to gain momentum
      in 2003, but remains fraught with efficiency and load issues.

   3) Isolate the low-level mechanisms affecting robustness.  There is
      limited value in comparing robustness of DHT geometries (like
      rings versus de Bruijn graphs), when robustness is highly
      sensitive to underlying topology maintenance algorithms (Figure
      2).

   4) Build consensus on robustness metrics and their acceptable ranges.
      This paper has teased out numerous measures that impinge on
      robustness, for example, the median query path length for a
      failure of x% of nodes, bisection width, path overlap, the number
      of alternatives available for the next hop, lookup latency,
      average live bandwidth (bytes/node/sec), successful routing rates,
      the number of timeouts (caused by a finger pointing to a departed
      node), lookup failure rates (caused by nodes that temporarily
      point to the wrong successor during churn), and clustering
      measures (edge expansion and node expansion).  Application-level
      robustness metrics need to drive a consistent assessment of the
      underlying search mechanics.

8. Acknowledgments

This document was adapted from a paper in Elsevier's Computer Networks: J. Risson & T. Moors, Survey of Research towards Robust Peer-to- Peer Networks: Search Methods, Computer Networks 51(7)2007.
Top   ToC   RFC4981 - Page 54
   We thank Bill Yeager, Ali Ghodsi, and several anonymous reviewers for
   thorough comments that significantly improved the quality of earlier
   versions of this document.



(page 54 continued on part 4)

Next Section