RFC 4981

Survey of Research towards Robust Peer-to-Peer Networks: Search Methods

Pages: 91
Informational

Part 3 of 5 – Pages 33 to 54

RFC4981 - Page 33 prevText

4.  Semantic Index

   Semantic indexes capture object relationships.  While the semantic-
   free methods (DHTs) have firmer theoretic foundations and guarantee
   that a key can be found if it exists, they do not capture the
   relationships between the document name and its content or metadata
   on their own.  Semantic P2P designs do.  However, since their design
   is often driven by heuristics, they may not guarantee that scarce
   items will be found.

   So what might the semantically indexed P2Ps add to an already crowded
   field of distributed information architectures?  At one extreme,
   there are the distributed relational database management systems
   (RDBMSs), with their strong consistency guarantees [284].  They
   provide strong data independence, the flexibility of SQL queries, and
   strong transactional semantics -- Atomicity, Consistency, Isolation
   and Durability (ACID) [349].  They guarantee that the query response
   is complete -- all matching results are returned.  The price is
   performance.  They scale to perhaps 1000 nodes, as evidenced in
   Mariposa [350, 351], or require query caching front ends to constrain
   the load [284].  Database research has "arguably been cornered into
   traditional, high-end, transactional applications" [72].  Then there
   are distributed file systems, like the Network File System (NFS) or
   the Serverless Network File Systems (xFS), with little data
   independence, low-level file retrieval interfaces, and varied
   consistency [284].  Today's eclectic mix of Content Distribution
   Networks (CDNs) generally deload primary servers by redirecting Web
   requests to a nearby replica.  Some intercept the HTTP requests at
   the DNS level and then use consistent hashing to find a replica [23].
   Since this same consistent hashing was a forerunner to the DHT

RFC4981 - Page 34

   approaches above, CDNs are generally constrained to the same simple
   key lookups.

   The opportunity for semantically indexed P2Ps, then, is to provide:

   a) graduated data independence, consistency, and query flexibility,
      and

   b) probabilistically complete query responses, across

   c) very large numbers of low-cost, geographically distributed,
      dynamic nodes.

4.1.  Keyword Lookup

   P2P keyword lookup is best understood by considering the structure of
   the underlying index and the algorithms by which queries are routed
   over that index.  Figure 3 summarizes the following paragraphs by
   classifying the keyword query algorithms, index structures, and
   metrics.  The research has largely focused on scalability, not
   dependability.  There have been very few studies that quantify the
   impact of network churn.  One exception is the work by Chawathe, et
   al. on the Gia system [61].  Gia's combination of algorithms from
   Figure 3 (receiver-based flow control, biased random walk, and one-
   hop replication) gave 2-4 orders of magnitude improvement in query
   success rates in churning networks.

RFC4981 - Page 35

   QUERY
   Query routing
     Flooding: Peers only index local files so queries must propagate
       widely [4]
     Policy-based: Choice of the next hop node: random; most/least
       recently used; most files shared; most results [265, 352]
     Random walks: Parallel [67] or biased random walks [61, 66]
   Query forwarding
     Iterative: Nodes perform iterative unicast searches of ultrapeers,
       until the desired number of results is achieved.  See Gnutella
       UDP Extension for Scalable Searches (GUESS) [265, 353]
     Recursive
   Query flow control
     Receiver-controlled: Receivers grant query tokens to senders, so
       as to avoid overload [61]
     Reactive: sender throttles queries when it notices receivers are
       discarding packets [61, 66]
     Dynamic Time To Live: In the Dynamic Query Protocol, the sender
       adjusts the time-to-live on each iteration based on the number
       of results received, the number of connections left, and the
       number of nodes already theoretically reached by the search [354]

   INDEX
   Distribution
     Compression: Leaf nodes periodically send ultrapeers compressed
       query routing tables, as in the Query Routing Protocol [260]
     One hop replication: Nodes maintain an index of content on their
       nearest neighbors [61, 352]
   Partitioning
     By document [210]
     By keyword: Use an inverted list to find a matching document,
       either locally or at another peer [21].  Partition by keyword
       sets [355]
     By document and keyword: Also called Multi-Level Partitioning [21]

   METRIC
   Query load: Queries per second per node/link [65, 265]
   Degree: The number of links per node [66, 352].  Early P2P networks
     approximated power-law networks, where the number of nodes with L
     links is proportional to L^(-k), where k is a constant [65]
   Query delay: Reported in terms of time and hop count [61, 66]
   Query success rate: The "Collapse Point" is the per-node query rate
     at which the query success rate drops below 90% [61].  See
     also [61, 265, 352].

                  Figure 3: Keyword Lookup in P2P Systems

RFC4981 - Page 36

4.1.1.  Gnutella Enhancements

   Perhaps the most widely referenced P2P system for simple keyword
   match is Gnutella [4].  Gnutella queries contain a string of
   keywords.  Gnutella peers answer when they have files whose names
   contain all the keywords.  As discussed in Section 2.1, early
   versions of Gnutella did not forward the document index.  Queries
   were flooded and peers searched their own local indexes for filename
   matches.  An early review highlighted numerous areas for improvement
   [65].  It was estimated that the query traffic alone from 50,000
   early-generation Gnutella nodes would amount to 1.7% of the total
   U.S. Internet backbone traffic at December 2000 levels.  It was
   speculated that high-degree Gnutella nodes would impede
   dependability.  An unnecessarily high percentage of Gnutella traffic
   crossed Autonomous System (AS) boundaries -- a locality mechanism may
   have found suitable nearby peers.

   Fortunately, there have since been numerous enhancements within the
   Gnutella Developer Forum.  At the time of writing, it has been
   reported that Gnutella has almost 350,000 unique hosts, of which
   nearly 90,000 accept incoming connections [356].  One of the main
   improvements is that an index of filename keywords, called the Query
   Routing Table (QRT), can now be forwarded from 'leaf peers' to its
   'ultrapeers' [260].  Ultrapeers can then ensure that the leaves only
   receive queries for which they have a match, dramatically reducing
   the query traffic at the leaves.  Ultrapeers can have connections to
   many leaf nodes (~10-100) and a small number of other ultrapeers
   (<10) [260].  Originally, a leaf node's QRT was not forwarded by the
   parent ultrapeer to other ultrapeers.  More recently, there has been
   a proposal to distribute aggregated QRTs amongst ultrapeers [357].
   To further limit traffic, QRTs are compressed by hashing, according
   to the Query Routing Protocol (QRP) specification [281].  This same
   specification claims QRP may reduce Gnutella traffic by orders of
   magnitude, but cautions that simulation is required before mass
   deployment.  A known shortcoming of QRP was that the extent of query
   propagation was independent of the popularity of the search terms.
   The Dynamic Query Protocol addressed this [358].  It required leaf
   nodes to send single queries to high-degree ultrapeers that adjust
   the queries' time-to-live (TTL) bounds according to the number of
   received query results.  An earlier proposal, called the Gnutella UDP
   Extension for Scalable Searches (GUESS) [353], similarly aimed to
   reduce the number of queries for widely distributed files.  GUESS
   reuses the non-forwarding idea (Section 2).  A GUESS peer repeatedly
   queries single ultrapeers with a TTL of 1, with a small timeout on
   each query to limit load.  It chooses the number of iterations and
   selects ultrapeers so as to satisfy its search needs.  For
   adaptability, a small number of experimental Gnutella nodes have

RFC4981 - Page 37

   implemented eXtensible Markup Language (XML) schemas for richer
   queries [359, 360].  None of the above Gnutella proposals explicitly
   assess robustness.

   The broader research community has recently been leveraging aspects
   of the Gnutella design.  Lv, Ratnasamy, et al. exposed one assumption
   implicit in some of the early DHT work -- that designs "such as
   Gnutella are inherently not scalable, and therefore should be
   abandoned" [66].  They argued that by making better use of the more
   powerful peers, Gnutella's scalability issues could be alleviated.
   Instead of its flooding mechanism, they used random walks.  Their
   preliminary design to bias random walks towards high capacity nodes
   did not go as far as the ultrapeer proposals in that the indexes did
   not move to the high-capacity nodes.  Chawathe, Ratnasamy, et al.
   chose to extend the Gnutella design with their Gia system, in
   response to the perceived shortcomings of DHTs in Section 1.2 [61].
   Compared to the early Gnutella designs, they incorporated several
   novel features.  They devise a topology adaptation algorithm so that
   most peers are attached to high-degree peers.  They use a random walk
   search algorithm, in lieu of flooding, and bias the query load
   towards higher-degree peers.  For 'one-hop replication', they require
   all nodes to keep pointers to content on adjacent peers.  To
   implement a receiver-controlled token-based flow control, a peer must
   have a token from its neighbouring peer before it sends a query to
   it.  Chawathe, Ratnasamy, et al. show by simulations that the
   combination of these features provides a scalability improvement of
   three to five orders of magnitude over Gnutella "while retaining
   significant robustness".  The main robustness metrics they used were
   the 'collapse point' query rate (the per-node query rate at which the
   successful query rate falls below 90%) and the average hop count
   immediately prior to collapse.  Their comparison with Gnutella did
   not take into account the Gnutella enhancements above -- this was
   left as future work.  Castro, Costa, and Rowstron argued that if
   Gnutella were built on top of a structured overlay, then both the
   query and overlay maintenance traffic could be reduced [259].  Yang,
   Vinograd, et al. explore various policies for peer selection in the
   GUESS protocol, since the issue is left open in the original proposal
   [265].  For example, the peer initiating the query could choose peers
   that have been "most recently used" or that have the "most files
   shared".  Various policy pitfalls are identified.  For example, good
   peers could be overloaded, victims of their own success.
   Alternatively, malicious peers could encourage the querying peer to
   try inactive peers.  They conclude that a "most results" policy gives
   the best balance of robustness and efficiency.  Like Castro, Costa,
   and Rowstron, they concentrated on the static network scenario.
   Cholvi, Felber, et al. very briefly describe how similar "least
   recently used" and "most often used" heuristics can be used by a peer
   to select peer 'acquaintances' [352].  They were motivated by the

RFC4981 - Page 38

   congestion associated with Gnutella's TTL-limited flooding.
   Recognizing that the busiest peers can quickly become overloaded
   central hubs for the entire network, they limit the number of
   acquaintances for any given peer to 25.  They sketch a mechanism to
   decrement a query's TTL multiple times when it traverses "interested
   peers".  In summary, these Gnutella-related investigations are
   characterized by a bias for high-degree peers and very short directed
   query paths, a disdain for flooding, and concern about excessive load
   on the 'better' peers.  Generally, the robustness analysis for
   dynamic networks (content updates and node arrivals/departures)
   remains open.

4.1.2.  Partition-by-Document, Partition-by-Keyword

   One aspect of P2P keyword search systems has received particular
   attention: should the index be partitioned by document or by keyword?
   The issue affects scalability.  To be partitioned by document, each
   node has a local index of documents for which it is responsible.
   Gnutella is a prime example.  Queries are generally flooded in
   systems partitioned by document.  On the other hand, a peer may
   assume responsibility for a set of keywords.  The peer uses an
   inverted list to find a matching document, either locally or at
   another peer.  If the query contains several keywords, inverted lists
   may need to be retrieved from several different peers to find the
   intersection [21].  The initial assessment by Li, Loo, et al. was
   that the partition-by-document approach was superior [210].  For one
   scenario of a full-text Web search, they estimated the communications
   costs to be about six times higher than the feasible budget.
   However, wanting to exploit prior work on inverted list intersection,
   they studied the partition-by-keyword strategy.  They proposed
   several optimizations that put the communication costs for a
   partition-by-keyword system within an order of magnitude of
   feasibility.  There had been a couple of prior papers that suggested
   partitioned-by-keyword designs incorporate DHTs to map keywords to
   peers [355, 361].  In Gnawali's Keyword-set Search System (KSS), the
   index is partitioned by sets of keywords [355].  Terpstra, Behnel, et
   al. point out that by keeping keyword pairs or triples, the number of
   lists per document in KSS is squared or tripled [362].  Shi,
   Guangwen, et al. interpreted the approximations of Li, Loo, et al. to
   mean that neither approach is feasible on its own [21].  Their
   Multi-Level Partitioning (MLP) scheme incorporates both partitioning
   approaches.  They arrange nodes into a group hierarchy, with all
   nodes in the single 'level 0' group, and with the same nodes sub-
   divided into k logical subgroups on 'level 1'.  The subgroups are
   again divided, level by level, until level l.  The inverted index is
   partitioned by document between groups and by keyword within groups.
   MLP avoids the query flooding normally associated with systems
   partitioned by document, since a small number of nodes in each group

RFC4981 - Page 39

   process the query.  It reduces the bandwidth overheads associated
   with inverted list intersection in systems partitioned solely by
   keyword, since groups can calculate the intersection independently
   over the documents for which they are responsible.  MLP was overlaid
   on SkipNet, per Section 3.5.6 [38].  Some initial analyses of
   communications costs and query latencies were provided.

4.1.3.  Partial Search, Exhaustive Search

   Much of the research above addresses partial keyword search.
   Daswani, et al. highlighted the open problem of efficient,
   comprehensive keyword search [25].  How can exhaustive searches be
   achieved without flooding queries to every peer in the network?
   Terpstra, Behnel et al. couched the keyword search problem in
   rendezvous terms: dynamic keyword queries need to 'meet' with static
   document lists [362].  Their Bitzipper scheme is partitioned by
   document.  They improved on full flooding by putting document
   metadata on 2sqrt(n) nodes and forwarding queries through only
   6sqrt(n) nodes.  They reported that Bitzipper nodes need only 1/166th
   of the bandwidth of full-flooding Gnutella nodes for an exhaustive
   search.  An initial comparison of query load was given.  There was
   little consideration of either static or dynamic resilience; that is,
   of nodes failing, of documents continually changing, or of nodes
   continually joining and leaving the network.

4.2.  Information Retrieval

   The field of Information Retrieval (IR) has matured considerably
   since its inception in the 1950s [363].  A taxonomy for IR models has
   been formalized [262].  It consists of four elements: a
   representation of documents in a collection; a representation of user
   queries; a framework describing relationships between document
   representations and queries; and a ranking function that quantifies
   an ordering amongst documents for a particular query.  Three main
   issues motivate current IR research -- information relevance, query
   response time, and user interaction with IR systems.  The dominant IR
   trends for searching large text collections are also threefold [262].
   The size of collections is increasing dramatically.  More complicated
   search mechanisms are being found to exploit document structure, to
   accommodate heterogeneous document collections, and to deal with
   document errors.  Compression is in favour -- it may be quicker to
   search compact text or retrieve it from external devices.  In a
   distributed IR system, query processing has four parts.  Firstly,
   particular collections are targeted for the search.  Secondly,
   queries are sent to the targeted collections.  Queries are then
   evaluated at the individual collections.  Finally, results from the
   collections are collated.

RFC4981 - Page 40

   So how do P2P networks differ from distributed IR systems?  Bawa,
   Manku, et al. presented four differences [62].  They suggested that a
   P2P network is typically larger, with tens or hundreds of thousands
   of nodes.  It is usually more dynamic, with node lifetimes measured
   in hours.  They suggested that a P2P network is usually homogeneous,
   with a common resource description language.  It lacks the
   centralized "mediators" found in many IR systems that assume
   responsibility for selecting collections, for rewriting queries, and
   for merging ranked results.  These distinctions are generally aligned
   with the peer characteristics in Section 1.  One might add that P2P
   nodes display more symmetry -- peers are often both information
   consumers and producers.  Daswani, Garcia-Molina, et al. pointed out
   that, while there are IR techniques for ranked keyword search at
   moderate scale, research is required so that ranking mechanisms are
   efficient at the larger scale targeted by P2P designs [25].  Joseph
   and Hoshiai surveyed several P2P systems using metadata techniques
   from the IR toolkit [60].  They described an assortment of IR
   techniques and P2P systems, including various metadata formats,
   retrieval models, bloom filters, DHTs, and trust issues.

   In the ensuing paragraphs, we survey P2P work that has incorporated
   information retrieval models, particularly the Vector Model and the
   Latent Semantic Indexing Model.  We omit the P2P work based on
   Bayesian models.  Some have pointed to such work [60], but made no
   explicit mention of the model [364].  One early paper on P2P
   content-based image retrieval also leveraged the Bayesian model
   [365].  For the former two models, we briefly describe the design,
   then try to highlight robustness aspects.  On robustness, we are
   again stymied for lack of prior work.  Indeed, a search across all
   proceedings of the Annual ACM Conference on Research and Development
   in Information Retrieval for the words "reliable", "available",
   "dependable", or "adaptable" did not return any results at the time
   of writing.  In contrast, a standard text on distributed database
   management systems [366] contains a whole chapter on reliability.  IR
   research concentrates on performance measures.  Common performance
   measures include recall, the fraction of the relevant documents that
   has been retrieved and precision, the fraction of the retrieved
   documents that is relevant [262].  Ideally, an IR system would have
   high recall and high precision.  Unfortunately techniques favouring
   one often disadvantage the other [363].

RFC4981 - Page 41

4.2.1.  Vector Model (PlanetP, FASD, eSearch)

   The vector model [367] represents both documents and queries as term
   vectors, where a term could be a word or a phrase.  If a document or
   query has a term, the weight of the corresponding dimension of the
   vector is non-zero.  The similarity of the document and query vectors
   gives an indication of how well a document matches a particular
   query.

   The weighting calculation is critical across the retrieval models.
   Amongst the numerous proposals for the probabilistic and vector
   models, there are some commonly recurring weighting factors [363].
   One is term frequency.  The more a term is repeated in a document,
   the more important the term is.  Another is inverse document
   frequency.  Terms common to many documents give less information
   about the content of a document.  Then there is document length.
   Larger documents can bias term frequencies, so weightings are
   sometimes normalized against document length.  The expression "TFIDF
   weighting" refers to the collection of weighting calculations that
   incorporate term frequency and inverse document frequency, not just
   to one.  Two weighting calculations have been particularly dominant
   -- Okapi [368] and pivoted normalization [369].  A distributed
   version of Google's Pagerank algorithm has also been devised for a
   P2P environment [370].  It allows incremental, ongoing Pagerank
   calculations while documents are inserted and deleted.

   A couple of early P2P systems leveraged the vector model.  Building
   on the vector model, PlanetP divided the ranking problem into two
   steps [215].  In the first, peers are ranked for the probability that
   they have matching documents.  In the second, higher-priority peers
   are contacted and the matching documents are ranked.  An Inverse Peer
   Frequency, analogous to the Inverse Document Frequency, is used to
   rank relevant peers.  To further constrain the query traffic, PlanetP
   contacts only the first group of m peers to retrieve a relevant set
   of documents.  In this way, it repeatedly contacts groups of m peers
   until the top k document rankings are stable.  While the PlanetP
   designers first quantified recall and precision, they also considered
   reliability.  Each PlanetP peer has a global index with a list of all
   other peers, their IP addresses, and their Bloom filters.  This large
   volume of shared information needs to be maintained.  Klampanos and
   Jose saw this as PlanetP's primary shortcoming [371].  Each Bloom
   filter summarized the set of terms in the local index of each peer.
   The time to propagate changes, be they new documents or peer
   arrivals/departures, was studied by simulation for up to 1000 peers.
   The reported propagation times were in the hundreds of seconds.
   Design workarounds were required for PlanetP to be viable across
   slower dial-up modem connections.  For future work, the authors were

RFC4981 - Page 42

   considering some sort of hierarchy to scale to larger numbers of
   peers.

   A second early system using the vector model is the Fault-tolerant,
   Adaptive, Scalable Distributed (FASD) search engine [283], which
   extended the Freenet design (Section 2.3) for richer queries.  The
   original Freenet design could find a document based on a globally
   unique identifier.  Kronfol's design added the ability to search, for
   example, for documents about "apples AND oranges NOT bananas".  It
   uses a TFIDF weighting scheme to build a document's term vector.
   Each peer calculates the similarity of the query vector and local
   documents and forwards the query to the best downstream peer.  Once
   the best downstream peer returns a result, the second-best peer is
   tried, and so on.  Simulations with 1000 nodes gave an indication of
   the query path lengths in various situations -- when routing queries
   in a network with constant rates of node and document insertion, when
   bootstrapping the network in a "worst-case" ring topology, or when
   failing randomly and specifically selected peers.  Kronfol claimed
   excellent average-case performance -- less than 20 hops to retrieve
   the same top n results as a centralized search engine.  There were,
   however, numerous cases where the worst-case path length was several
   hundred hops in a network of only 1000 nodes.

   In parallel, there have been some P2P designs based on the vector
   model from the University of Rochester -- pSearch [9, 372] and
   eSearch [373].  The early pSearch paper suggested a couple of
   retrieval models, one of which was the Vector Space Model, to search
   only the nodes likely to have matching documents.  To obtain
   approximate global statistics for the TFIDF calculation, a spanning
   tree was constructed across a subset of the peers.  For the m top
   terms, the term-to-document index was inserted into a Content-
   Addressable Network [334].  A variant that mapped terms to document
   clusters was also suggested. eSearch is a hybrid of the partition-
   by-document and partition-by-term approaches (Section 4.1.2) eSearch
   nodes are primarily partitioned by term.  Each is responsible for the
   inverted lists for some top terms.  For each document in the inverted
   list, the node stores the complete term list.  To reduce the size of
   the index, the complete term lists for a document are only kept on
   nodes that are responsible for top terms in the document.  eSearch
   uses the Okapi term weighting to select top terms.  It relies on the
   Chord DHT [34] to associate terms with nodes storing the inverted
   lists.  It also uses automatic query expansion.  This takes the
   significant terms from the top document matches and automatically
   adds them to the user's query to find additional relevant documents.
   The eSearch performance was quantified in terms of search precision,
   the number of retrieved documents, and various load-balancing
   metrics.  Compared to the more common proposals for partitioning by

RFC4981 - Page 43

   keywords, eSearch consumed 6.8 times the storage space to achieve
   faster search times.

4.2.2.  Latent Semantic Indexing (pSearch)

   Another retrieval model used in P2P proposals is Latent Semantic
   Indexing (LSI) [374].  Its key idea is to map both the document and
   query vectors to a concept space with lower dimensions.  The starting
   point is a t*N weighting matrix, where t is the total number of
   indexed terms, N is the total number of documents, and the matrix
   elements could be TFIDF rankings.  Using singular value
   decomposition, this matrix is reduced to a smaller number of
   dimensions, while retaining the more significant term-to-document
   mappings.  Baeza-Yates and Ribeiro-Neto suggested that LSI's value is
   a novel theoretic framework, but that its practical performance
   advantage for real document collections had yet to be proven [262].
   pSearch incorporated LSI [9].  By placing the indices for
   semantically similar documents close in the network, Tang, Xu, et al.
   touted significant bandwidth savings relative to the early full-
   flooding variant of Gnutella [372].  They plotted the number of nodes
   visited by a query.  They also explored the trade-off with accuracy,
   the percentage match between the documents returned by the
   distributed pSearch algorithm and those from a centralized LSI
   baseline.  In a more recent update to the pSearch work, Tang,
   Dwarkadas, et al. summarized LSI's shortcomings [375].  Firstly, for
   large document collections, its retrieval quality is inherently
   inferior to Okapi.  Secondly, singular value decomposition consumes
   excessive memory and computation time.  Consequently, the authors
   used Okapi for searching while retaining LSI for indexing.  With
   Okapi, they selected the next node to be searched and selected
   documents on searched nodes.  With LSI, they ensured that similar
   documents are clustered near each other, thereby optimizing the
   network search costs.  When retrieving a small number of top
   documents, the precision of LSI+Okapi approached that of Okapi.
   However, if retrieving a large number of documents, the LSI+Okapi
   precision is inferior.  The authors want to improve this in future
   work.

4.2.3.  Small Worlds

   The "small world" concept originally described how people are
   interconnected by short chains of acquaintances [376].  Kleinberg was
   struck by the algorithmic lesson of the small world, namely "that
   individuals using local information are collectively very effective
   at constructing short paths between two points in a social network"
   [377].  Small world networks have a small diameter and a large
   clustering coefficient (a large number of connections amongst
   relevant nodes) [378].

RFC4981 - Page 44

   The small world idea has had a limited impact on peer-to-peer
   algorithms.  It has influenced only a few unstructured [62, 378-380]
   and structured [344, 381] algorithms.  The most promising work on
   "small worlds" in P2P networks are those concerned with the
   information retrieval metrics, precision and recall [62, 378, 380].

5.  Queries

   Database research suggests directions for P2P research.  Hellerstein
   observed that, while work on fast P2P indexes is well underway, P2P
   query optimization remains a promising topic for future research
   [23].  Kossman reviewed the state of the art of distributed query
   processing, highlighting areas for future research: simulation and
   query optimization for networks of tens of thousands of servers and
   millions of clients; non-relational data types (e.g., XML, text, and
   images); and partial query responses since on the Internet, "failure
   is the rule rather than the exception" [19].  A primary motivation
   for the P2P system, PIER, was to scale from the largest database
   systems of a few hundred nodes to an Internet environment in which
   there are over 160 million nodes [22].  Litwin and Sahri have also
   considered ways to combine distributed hashing, more specifically the
   Scalable Distributed Data Structures, with SQL databases, claiming to
   be first to implement scalable distributed database partitioning
   [382].  Motivated by the lack of transparent distribution in current
   distributed databases, they measure query execution times for
   Microsoft SQL servers aggregated by means of an SDDS layer.  One of
   their starting assumptions was that it is too challenging to change
   the SQL query optimizer.

   Database research also suggests the approach to P2P research.
   Researchers of database query optimization were divided between those
   looking for optimal solutions in special cases and those using
   heuristics to answer all queries [383].  Gribble, et al. cast query
   optimization in terms of the data placement problem, which is to
   "distribute data and work so the full query workload is answered with
   lowest cost under the existing bandwidth and resource constraints"
   [250].  They pointed out that even the static version of this problem
   is NP-complete in P2P networks.  Consequently, research on massive,
   dynamic P2P networks will likely progress using both strategies of
   early database research - heuristics and special-case optimizations.

   If P2P networks are going to be adaptable, if they are to support a
   wide range of applications, then they need to accommodate many query
   types [72].  Up to this point, we have reviewed queries for keys
   (Section 3) and keywords (Sections 4.1. and 4.2).  Unfortunately, a
   major shortcoming of the DHTs in Section 3.5 is that they primarily
   support exact-match, single-key queries.  Skip Graphs support range
   and prefix queries, but not aggregation queries.  Here we probe below

RFC4981 - Page 45

   the language syntax to identify the open research issues associated
   with more expressive P2P queries [25].  Triantafillou and Pitoura
   observed the disparate P2P designs for different types of queries and
   so outlined a unifying framework [76].  To classify queries, they
   considered the number of relations (single or multiple), the number
   of attributes (single or multiple), and the type of query operator.
   They described numerous operators:  equality, range, join, and
   "special functions".  The latter referred to aggregation (like sum,
   count, average, minimum, and maximum), grouping and ordering.  The
   following sections approximately fit their taxonomy -- range queries,
   multi-attribute queries, join queries and aggregation queries.  There
   has been some initial P2P work on other query types -- continuous
   queries [20, 22, 73], recursive queries [22, 74], and adaptive
   queries [23, 75].  For these, we defer to the primary references.

5.1.  Range Queries

   The support of efficient range predicates in P2P networks was
   identified as an important open research issue by Huebsch, et al.
   [22].  Range partitioning has been important in parallel databases to
   improve performance, so that a transaction commonly needs data from
   only one disk or node [22].  One type of range search, longest prefix
   match, is important because of its prevalence in routing schemes for
   voice and data networks alike.  In other applications, users may pose
   broad, inexact queries, even though they require only a small number
   of responses.  Consequently, techniques to locate similar ranges are
   also important [77].  Various proposals for range searches over P2P
   networks are summarized in Figure 4.  Since the Scalable Distributed
   Data Structure (SDDS) has been an important influence on contemporary
   Distributed Hash Tables (DHTs) [49-51], we also include ongoing work
   on SDDS range searches.

   PEER-TO-PEER (P2P)
   Locality Sensitive Hashing (Chord) [77]
   Prefix Hash Trees (unspecified DHT) [78, 79]
   Space Filling Curves (CAN) [80]
   Space Filling Curves (Chord) [81]
   Quadtrees (Chord) [82]
   Skip Graphs [38, 41, 83, 100]
   Mercury [84]
   P-Grid [85, 86]

   SCALABLE DISTRIBUTED DATA STRUCTURES (SDDS)
   RP*   [87, 88]

       Figure 4: Solutions for Range Queries on P2P and SDDS Indexes

RFC4981 - Page 46

   The papers on P2P range search can be divided into those that rely on
   an underlying DHT (the first five entries in Figure 4) and those that
   do not (the subsequent three entries).  Bharambe, Agrawal, et al.
   argued that DHTs are inherently ill-suited to range queries [84].
   The very feature that makes for their good load balancing properties,
   randomized hash functions, works against range queries.  One possible
   solution would be to hash ranges, but this can require a priori
   partitioning.  If the partitions are too large, partitions risk
   overload.  If they are too small, there may be too many hops.

   Despite these potential shortcomings, there have been several range
   query proposals based on DHTs.  If hashing ranges to nodes, it is
   entirely possible that overlapping ranges map to different nodes.
   Gupta, Agrawal, et al. rely on locality sensitive hashing to ensure
   that, with high probability, similar ranges are mapped to the same
   node [77].  They propose one particular family of locality sensitive
   hash functions, called min-wise independent permutations.  The number
   of partitions per node and the path length were plotted against the
   total numbers of peers in the system.  For a network with 1000 nodes,
   the hop count distribution was very similar to that of the exact-
   matching Chord scheme.  Was it load-balanced?  For the same network
   with 50,000 partitions, there were over two orders of magnitude
   variation in the number of partitions at each node (first and
   ninety-ninth percentiles).  The Prefix Hash Tree is a trie in which
   prefixes are hashed onto any DHT.  The preliminary analysis suggests
   efficient doubly logarithmic lookup, balanced load, and fault
   resilience [78, 79].  Andrzejak and Xu were perhaps the first to
   propose a mapping from ranges to DHTs [80].  They use one particular
   Space Filling Curve, the Hilbert curve, over a Content Addressable
   Network (CAN) construction (Section 3.5.3).  They maintain two
   properties: nearby ranges map to nearby CAN zones; if a range is
   split into two sub-ranges, then the zones of the sub-ranges partition
   the zone of the primary range.  They plot path length and load proxy
   measures (the total number of messages and nodes visited) for three
   algorithms to propagate range queries: brute force, controlled
   flooding, and directed controlled flooding.  Schmidt and Parashar
   also advocated Space Filling Curves to achieve range queries over a
   DHT [81].  However, they point out that, while Andrzejak and Xu use
   an inverse Space Filling Curve to map a one-dimensional space to d-
   dimensional zones, they map a d-dimensional space back to a one-
   dimensional index.  Such a construction gives the ability to search
   across multiple attributes (Section 5.2).  Tanin, Harwood, et al.
   suggested quadtrees over Chord [82], and gave preliminary simulation
   results for query response times.

   Because DHTs are naturally constrained to exact-match, single-key
   queries, researchers have considered other P2P indexes for range
   searches.  Several were based on Skip Graphs [38, 41], which, unlike

RFC4981 - Page 47

   the DHTs, do not necessitate randomizing hash functions and are
   therefore capable of range searches.  Unfortunately, they are not
   load balanced [83].  For example, in SkipNet [48], hashing was added
   to balance the load -- the Skip Graph could support range searches or
   load balancing, but not both.  One solution for load-balancing relies
   on an increased number of 'virtual' servers [168] but, in their
   search for a system that can both search for ranges and balance
   loads, Bharambe, Agrawal, et al. rejected the idea [84].  The virtual
   servers work assumed load imbalance stems from hashing; that is, by
   skewed data insertions and deletions.  In some situations, the
   imbalance is triggered by a skewed query load.  In such
   circumstances, additional virtual servers can increase the number of
   routing hops and increase the number of pointers that a Skip Graph
   needs to maintain.  Ganesan, Bawa, et al. devised an alternate method
   to balance load [83].  They proposed two Skip Graphs, one to index
   the data itself and the other to track load at each node in the
   system.  Each node is able to determine the load on its neighbours
   and the most (least) loaded nodes in the system.  They devise two
   algorithms: NBRADJUST balances load on neighbouring nodes; using
   REORDER, empty nodes can take over some of the tuples on heavily
   loaded nodes.  Their simulations focus on skewed storage load, rather
   than on skewed query loads, but they surmise that the same approach
   could be used for the latter.

   Other proposals for range queries avoid both the DHT and the Skip
   Graph.  Bharambe, Agrawal, et al. distinguish their Mercury design by
   its support for multi-attribute range queries and its explicit load
   balancing [84].  In Mercury, nodes are grouped into routing hubs,
   each of which is responsible for various query attributes.  While it
   does not use hashing, Mercury is loosely similar to the DHT
   approaches: nodes within hubs are arranged into rings, like Chord
   [34]; for efficient routing within hubs, k long-distance links are
   used, like Symphony [381].  Range lookups require O(((log n)^2)/k)
   hops.  Random sampling is used to estimate the average load on nodes
   and to find the parts of the overlay that are lightly loaded.
   Whereas Symphony assumed that nodes are responsible for ranges of
   approximately equal size, Mercury's random sampling can determine the
   location of the start of the range, even for non-uniform ranges [84].
   P-Grid [42] does provide for range queries, by virtue of the key
   ordering in its tree structures.  Ganesan, Bawa, et al. critiqued its
   capabilities [83]: P-Grid assumes fixed-capacity nodes; there was no
   formal characterization of imbalance ratios or balancing costs; every
   P-Grid periodically contacts other nodes for load information.

   The work on Scalable Distributed Data Structures (SDDSs) has
   progressed in parallel with P2P work and has addressed range queries.
   Like the DHTs above, the early SDDS Linear Hashing (LH*) schemes were
   not order-preserving [52].  To facilitate range queries, Litwin,

RFC4981 - Page 48

   Niemat, et al. devised a Range Parititioning variant, RP* [87].
   There are options to dispense with the index, to add indexes to
   clients, and to add them to servers.  In the variant without an
   index, every query is issued via multicasting.  The other variants
   also use some multicasting.  The initial RP* paper suggested
   scalability to thousands of sites, but a more recent RP* simulation
   was capped at 140 servers [88].  In that work, Tsangou, Ndiaye, et
   al. investigated TCP and UDP mechanisms by which servers could return
   range query results to clients.  The primary metrics were search and
   response times.  Amongst the commercial parallel database management
   systems, they reported that the largest seems only to scale to 32
   servers (SQL Server 2000).  For future work, they planned to explore
   aggregation of query results, rather than establishing a connection
   between the client and every single server with a response.

   All in all, it seems there are numerous open research questions on
   P2P range queries.  How realistic is the maintenance of global load
   statistics considering the scale and dynamism of P2P networks?
   Simulations at larger scales are required.  Proposals should take
   into account both the storage load (insert and delete messages) and
   the query load (lookup messages).  Simplifying assumptions need to be
   attacked.  For example, how well do the above solutions work in
   networks with heterogeneous nodes, where the maximum message loads
   and index sizes are node-dependent?

5.2.  Multi-Attribute Queries

   There has been some work on multi-attribute P2P queries.  As late as
   September 2003, it was suggested that there has not been an efficient
   solution [76].

   Again, an early significant work on multi-attribute queries over
   aggregated commodity nodes germinated amongst SDDSs.  k-RP* [89] uses
   the multi-dimensional binary search tree (or k-d tree, where k
   indicates the number of dimensions of the search index) [384].  It
   builds on the RP* work from the previous section and inherits their
   capabilities for range search and partial match.  Like the other
   SDDSs, k-RP* indexes can fit into RAM for very fast lookup.  For
   future work, Litwin and Neimat suggested a) a formal analysis of the
   range search termination algorithm and the k-d paging algorithm, b) a
   comparison with other multi-attribute data structures (quad-trees and
   R-trees) and c) exploration of query processing, concurrency control,
   and transaction management for k-RP* files [89].  On the latter
   point, others have considered transactions to be inconsequential to
   the core problem of supporting more complex queries in P2P networks
   [72].

RFC4981 - Page 49

   In architecting their secure wide-area Service Discovery Service
   (SDS), Hodes, Czerwinski, et al. considered three possible designs
   for multi-criteria search -- Centralization, Mapping and Flooding
   [90].  These correlate to the index classifications of Section 2 --
   Central, Distributed, and Local.  They discounted the centralized,
   Napster-like index for its risk of a single point of failure.  They
   considered the hash-based mappings of Section 3, but concluded that
   it would not be possible to adequately partition data.  A document
   satisfying many criteria would be wastefully stored in many
   partitions.  They rejected full flooding for its lack of scalability.
   Instead, they devised a query filtering technique, reminiscent of
   Gnutella's query routing protocol (Section 4.1).  Nodes push
   proactive summaries of their data rather than waiting for a query.
   Summaries are aggregated and stored throughout a server hierarchy, to
   guide subsequent queries.  Some initial prototype measurements were
   provided for total load on the system, but not for load distribution.
   They put several issues forward for future work.  The indexing needs
   to be flexible to change according to query and storage workloads.  A
   mesh topology might improve on their hierarchic topology since query
   misses would not propagate to root servers.  The choice is analogous
   to BGP meshes and DNS trees.

   More recently, Cai, Frank, et al. devised the Multi-Attribute
   Addressable Network (MAAN) [91].  They built on Chord to provide both
   multi-attribute and range queries, claiming to be the first to
   service both query types in a structured P2P system.  Each MAAN node
   has O(log n) neighbours, where N is the number of nodes.  MAAN
   multi-attribute range queries require O(log n+N*Smin) hops, where
   Smin is the minimum range selectivity across all attributes.
   Selectivity is the ratio of the query range to the entire identifier
   range.  The paper assumed that a locality preserving hash function
   would ensure balanced load.  Per Section 5.1, the arguments by
   Bharambe, Agrawal, et al. have highlighted the shortcomings of this
   assumption [84].  MAAN required that the schema must be fixed and
   known in advance -- adaptable schemas were recommended for subsequent
   attention.  The authors also acknowledged that there is a selectivity
   breakpoint at which full flooding becomes more efficient than their
   scheme.  This begs for a query resolution algorithm that adapts to
   the profile of queries.  Cai and Frank followed up with RDFPeers
   [55].  They differentiate their work from other RDF proposals by a)
   guaranteeing to find query results if they exist and b) removing the
   requirement of prior definition of a fixed schema.  They hashed
   <subject, predicate, object> triples onto the MAAN and reported
   routing hop metrics for their implementation.  Load imbalance across
   nodes was reduced to less than one order of magnitude, but the
   specific measure was the number of triples stored per node - skewed
   query loads were not considered.  They plan to improve load balancing
   with the virtual servers of Section 5.1 [168].

RFC4981 - Page 50

5.3.  Join Queries

   Two research teams have done some initial work on P2P join
   operations.  Harren, Hellerstein, et al. initially described a
   three-layer architecture -- storage, DHT and query processing.  They
   implemented the join operation by modifying an existing Content
   Addressable Network (CAN) simulator, reporting "significant hot-spots
   in all dimensions: storage, processing, and routing" [72].  They
   progressed their design more recently in the context of PIER, a
   distributed query engine based on CAN [22, 385].  They implemented
   two equi-join algorithms.  In their design, a key is constructed from
   the "namespace" and the "resource ID".  There is a namespace for each
   relation and the resource ID is the primary key for base tuples in
   that relation.  Queries are multicast to all nodes in the two
   namespaces (relations) to be joined.  Their first algorithm is a DHT
   version of the symmetric hash join.  Each node in the two namespaces
   finds the relevant tuples and hashes them to a new query namespace.
   The resource ID in the new namespace is the concatenation of join
   attributes.  In the second algorithm, called "fetch matches", one of
   the relations is already hashed on the join attributes.  Each node in
   the second namespace finds tuples matching the query and retrieves
   the corresponding tuples from the first relation.  They leveraged two
   other techniques, namely the symmetric semi-join rewrite and the
   Bloom filter rewrite, to reduce the high bandwidth overheads of the
   symmetric hash join.  For an overlay of 10,000 nodes, they simulated
   the delay to retrieve tuples and the aggregate network bandwidth for
   these four schemes.  The initial prototype was on a cluster of 64
   PCs, but it has more recently been expanded to PlanetLab.

   Triantafillou and Pitoura considered multicasting to large numbers of
   peers to be inefficient [76].  They therefore allocated a limited
   number of special peers, called range guards.  The domain of the join
   attributes was divided, one partition per range guard.  Join queries
   were sent only to range guards, where the query was executed.
   Efficient selection of range guards and a quantitive evaluation of
   their proposal were left for future work.

5.4.  Aggregation Queries

   Aggregation queries invariable rely on tree-structures to combine
   results from a large number of nodes.  Examples of aggregation
   queries are Count, Sum, Maximum, Minimum, Average, Median, and Top-K
   [92, 386, 387].  Figure 5 summarizes the tree and query
   characteristics that affect dependability.

RFC4981 - Page 51

   Tree type: Doesn't use DHT [92], use internal DHT trees [95], use
      independent trees on top of DHTs
   Tree repair: Periodic [93], exceptional [32]
   Tree count: One per key, one per overlay [56]
   Tree flexibility: Static [92], dynamic

   Query interface: install, update, probe [98]
   Query distribution: multicast [98], gossip [92]
   Query applications: leader election, voting, resource location,
      object placement and error recovery [98, 388]
   Query semantics
      Consistency: Best-effort, eventual [92], snapshot / interval /
         single-site validity [99]
      Timeliness [388]
      Lifetime: Continuous [97, 99], single-shot
      No. attributes: Single, multiple
   Query types: Count, sum, maximum, minimum, average, median, top k
      [92, 386, 387]

          Figure 5: Aggregation Trees and Queries in P2P Networks

   Key: Astrolabe [92]; Cone [93]; Distributed Approximative System
   Information Service (DASIS) [95]; Scalable Distributed Information

   Management System (SDIMS) [98]; Self-Organized Metadata Overlay
   (SOMO) [56]; Wildfire [99]; Willow [32]; Newscast [97]

   The fundamental design choices for aggregation trees relate to how
   the overlay uses DHTs, how it repairs itself when there are failures,
   how many aggregation trees there are, and whether the tree is static
   or dynamic (Figure 5).  Astrolabe is one of the most influential P2P
   designs included in Figure 5, yet it makes no use of DHTs [92].
   Other designs make use of the internal trees of Plaxton-like DHTs.
   Others build independent tree structures on top of DHTs.  Most of the
   designs repair the aggregation tree with periodic mechanisms similar
   to those used in the DHTs themselves.  Willow is an exception [32].
   It uses a Tree Maintenance Protocol to "zip" disjoint aggregation
   trees together when there are major failures.  Yalagandula and Dahlin
   found reconfigurations at the aggregation layer to be costly,
   suggesting more research on techniques to reduce the cost and
   frequency of such reconfigurations [98].  Many of the designs use
   multiple aggregation trees, each rooted at the DHT node responsible
   for the aggregation attribute.  On the other hand, the Self-Organized
   Metadata Overlay [56] uses a single tree and is vulnerable to a
   single point of failure at its root.

RFC4981 - Page 52

   At the time of writing, researchers have just begun exploring the
   performance of queries in the presence of churn.  Most designs are
   for best-effort queries.  Bawa, et al. devised a better consistency
   model, called Single-Site Validity [99] to qualify the accuracy of
   results when there is churn.  Its price was a five-fold increase in
   the message load, when compared to an efficient but best-effort
   Spanning Tree.  Gossip mechanisms are resilient to churn, but they
   delay aggregation results and incur high message cost for aggregation
   attributes with small read-to-write ratios.

6.  Security Considerations

   An initial list of references to research on P2P security is given in
   Figure 1, Section 1.  This document addresses P2P search.  P2P
   storage, security, and applications are recommended for further
   investigation in Section 8.

7.  Conclusions

   Research on peer-to-peer networks can be divided into four categories
   -- search, storage, security and applications.  This critical survey
   has focused on search methods.  While P2P networks have been
   classified by the existence of an index (structured or unstructured)
   or the location of the index (local, centralized, and distributed),
   this survey has shown that most have evolved to have some structure,
   whether it is indexes at superpeers or indexes defined by DHT
   algorithms.  As for location, the distributed index is most common.
   The survey has characterized indexes as semantic and semantic-free.
   It has also critiqued P2P work on major query types.  While much of
   it addresses work from 2000 or later, we have traced important
   building blocks from the 1990s.

   The initial motivation in this survey was to answer the question,
   "How robust are P2P search networks?"  The question is key to the
   deployment of P2P technology.  Balakrishnan, Kaashoek, et al. argued
   that the P2P architecture is appealing: the startup and growth
   barriers are low; they can aggregate enormous storage and processing
   resources; "the decentralized and distributed nature of P2P systems
   gives them the potential to be robust to faults or intentional
   attacks" [18].  If P2P is to be a disruptive technology in
   applications other than casual file sharing, then robustness needs to
   be practically verified [20].

   The best comparative research on P2P dependability has been done in
   the context of Distributed Hash Tables (DHTs) [291].  The entire body
   of DHT research can be distilled to four main observations about
   dependability (Section 3.2).  Firstly, static dependability
   comparisons show that no O(log n) DHT geometry is significantly more

RFC4981 - Page 53

   dependable than the other O(log n) geometries.  Secondly, dynamic
   dependability comparisons show that DHT dependability is sensitive to
   the underlying topology maintenance algorithms (Figure 2).  Thirdly,
   most DHTs use O(log n) geometries to suit ephemeral nodes, whereas
   the O(1) hop DHTs suit stable nodes - they deserve more research
   attention.  Fourthly, although not yet a mature science, the study of
   DHT dependability is helped by recent simulation tools that support
   multiple DHTs [299].

   We make the following four suggestions for future P2P research:

   1) Complete the companion P2P surveys for storage, security, and
      applications.  A rough outline has been suggested in Figure 1,
      along with references.  The need for such surveys was highlighted
      within the peer-to-peer research group of the Internet Research
      Task Force (IRTF) [17].

   2) P2P indexes are maturing.  P2P queries are embryonic.  Work on
      more expressive queries over P2P indexes started to gain momentum
      in 2003, but remains fraught with efficiency and load issues.

   3) Isolate the low-level mechanisms affecting robustness.  There is
      limited value in comparing robustness of DHT geometries (like
      rings versus de Bruijn graphs), when robustness is highly
      sensitive to underlying topology maintenance algorithms (Figure
      2).

   4) Build consensus on robustness metrics and their acceptable ranges.
      This paper has teased out numerous measures that impinge on
      robustness, for example, the median query path length for a
      failure of x% of nodes, bisection width, path overlap, the number
      of alternatives available for the next hop, lookup latency,
      average live bandwidth (bytes/node/sec), successful routing rates,
      the number of timeouts (caused by a finger pointing to a departed
      node), lookup failure rates (caused by nodes that temporarily
      point to the wrong successor during churn), and clustering
      measures (edge expansion and node expansion).  Application-level
      robustness metrics need to drive a consistent assessment of the
      underlying search mechanics.

8.  Acknowledgments

   This document was adapted from a paper in Elsevier's Computer
   Networks:

      J. Risson & T. Moors, Survey of Research towards Robust Peer-to-
      Peer Networks: Search Methods, Computer Networks 51(7)2007.

RFC4981 - Page 54

   We thank Bill Yeager, Ali Ghodsi, and several anonymous reviewers for
   thorough comments that significantly improved the quality of earlier
   versions of this document.

(page 54 continued on part 4)