4. Semantic Index
Semantic indexes capture object relationships. While the semantic- free methods (DHTs) have firmer theoretic foundations and guarantee that a key can be found if it exists, they do not capture the relationships between the document name and its content or metadata on their own. Semantic P2P designs do. However, since their design is often driven by heuristics, they may not guarantee that scarce items will be found. So what might the semantically indexed P2Ps add to an already crowded field of distributed information architectures? At one extreme, there are the distributed relational database management systems (RDBMSs), with their strong consistency guarantees [284]. They provide strong data independence, the flexibility of SQL queries, and strong transactional semantics -- Atomicity, Consistency, Isolation and Durability (ACID) [349]. They guarantee that the query response is complete -- all matching results are returned. The price is performance. They scale to perhaps 1000 nodes, as evidenced in Mariposa [350, 351], or require query caching front ends to constrain the load [284]. Database research has "arguably been cornered into traditional, high-end, transactional applications" [72]. Then there are distributed file systems, like the Network File System (NFS) or the Serverless Network File Systems (xFS), with little data independence, low-level file retrieval interfaces, and varied consistency [284]. Today's eclectic mix of Content Distribution Networks (CDNs) generally deload primary servers by redirecting Web requests to a nearby replica. Some intercept the HTTP requests at the DNS level and then use consistent hashing to find a replica [23]. Since this same consistent hashing was a forerunner to the DHT
approaches above, CDNs are generally constrained to the same simple key lookups. The opportunity for semantically indexed P2Ps, then, is to provide: a) graduated data independence, consistency, and query flexibility, and b) probabilistically complete query responses, across c) very large numbers of low-cost, geographically distributed, dynamic nodes.4.1. Keyword Lookup
P2P keyword lookup is best understood by considering the structure of the underlying index and the algorithms by which queries are routed over that index. Figure 3 summarizes the following paragraphs by classifying the keyword query algorithms, index structures, and metrics. The research has largely focused on scalability, not dependability. There have been very few studies that quantify the impact of network churn. One exception is the work by Chawathe, et al. on the Gia system [61]. Gia's combination of algorithms from Figure 3 (receiver-based flow control, biased random walk, and one- hop replication) gave 2-4 orders of magnitude improvement in query success rates in churning networks.
QUERY Query routing Flooding: Peers only index local files so queries must propagate widely [4] Policy-based: Choice of the next hop node: random; most/least recently used; most files shared; most results [265, 352] Random walks: Parallel [67] or biased random walks [61, 66] Query forwarding Iterative: Nodes perform iterative unicast searches of ultrapeers, until the desired number of results is achieved. See Gnutella UDP Extension for Scalable Searches (GUESS) [265, 353] Recursive Query flow control Receiver-controlled: Receivers grant query tokens to senders, so as to avoid overload [61] Reactive: sender throttles queries when it notices receivers are discarding packets [61, 66] Dynamic Time To Live: In the Dynamic Query Protocol, the sender adjusts the time-to-live on each iteration based on the number of results received, the number of connections left, and the number of nodes already theoretically reached by the search [354] INDEX Distribution Compression: Leaf nodes periodically send ultrapeers compressed query routing tables, as in the Query Routing Protocol [260] One hop replication: Nodes maintain an index of content on their nearest neighbors [61, 352] Partitioning By document [210] By keyword: Use an inverted list to find a matching document, either locally or at another peer [21]. Partition by keyword sets [355] By document and keyword: Also called Multi-Level Partitioning [21] METRIC Query load: Queries per second per node/link [65, 265] Degree: The number of links per node [66, 352]. Early P2P networks approximated power-law networks, where the number of nodes with L links is proportional to L^(-k), where k is a constant [65] Query delay: Reported in terms of time and hop count [61, 66] Query success rate: The "Collapse Point" is the per-node query rate at which the query success rate drops below 90% [61]. See also [61, 265, 352]. Figure 3: Keyword Lookup in P2P Systems
4.1.1. Gnutella Enhancements
Perhaps the most widely referenced P2P system for simple keyword match is Gnutella [4]. Gnutella queries contain a string of keywords. Gnutella peers answer when they have files whose names contain all the keywords. As discussed in Section 2.1, early versions of Gnutella did not forward the document index. Queries were flooded and peers searched their own local indexes for filename matches. An early review highlighted numerous areas for improvement [65]. It was estimated that the query traffic alone from 50,000 early-generation Gnutella nodes would amount to 1.7% of the total U.S. Internet backbone traffic at December 2000 levels. It was speculated that high-degree Gnutella nodes would impede dependability. An unnecessarily high percentage of Gnutella traffic crossed Autonomous System (AS) boundaries -- a locality mechanism may have found suitable nearby peers. Fortunately, there have since been numerous enhancements within the Gnutella Developer Forum. At the time of writing, it has been reported that Gnutella has almost 350,000 unique hosts, of which nearly 90,000 accept incoming connections [356]. One of the main improvements is that an index of filename keywords, called the Query Routing Table (QRT), can now be forwarded from 'leaf peers' to its 'ultrapeers' [260]. Ultrapeers can then ensure that the leaves only receive queries for which they have a match, dramatically reducing the query traffic at the leaves. Ultrapeers can have connections to many leaf nodes (~10-100) and a small number of other ultrapeers (<10) [260]. Originally, a leaf node's QRT was not forwarded by the parent ultrapeer to other ultrapeers. More recently, there has been a proposal to distribute aggregated QRTs amongst ultrapeers [357]. To further limit traffic, QRTs are compressed by hashing, according to the Query Routing Protocol (QRP) specification [281]. This same specification claims QRP may reduce Gnutella traffic by orders of magnitude, but cautions that simulation is required before mass deployment. A known shortcoming of QRP was that the extent of query propagation was independent of the popularity of the search terms. The Dynamic Query Protocol addressed this [358]. It required leaf nodes to send single queries to high-degree ultrapeers that adjust the queries' time-to-live (TTL) bounds according to the number of received query results. An earlier proposal, called the Gnutella UDP Extension for Scalable Searches (GUESS) [353], similarly aimed to reduce the number of queries for widely distributed files. GUESS reuses the non-forwarding idea (Section 2). A GUESS peer repeatedly queries single ultrapeers with a TTL of 1, with a small timeout on each query to limit load. It chooses the number of iterations and selects ultrapeers so as to satisfy its search needs. For adaptability, a small number of experimental Gnutella nodes have
implemented eXtensible Markup Language (XML) schemas for richer queries [359, 360]. None of the above Gnutella proposals explicitly assess robustness. The broader research community has recently been leveraging aspects of the Gnutella design. Lv, Ratnasamy, et al. exposed one assumption implicit in some of the early DHT work -- that designs "such as Gnutella are inherently not scalable, and therefore should be abandoned" [66]. They argued that by making better use of the more powerful peers, Gnutella's scalability issues could be alleviated. Instead of its flooding mechanism, they used random walks. Their preliminary design to bias random walks towards high capacity nodes did not go as far as the ultrapeer proposals in that the indexes did not move to the high-capacity nodes. Chawathe, Ratnasamy, et al. chose to extend the Gnutella design with their Gia system, in response to the perceived shortcomings of DHTs in Section 1.2 [61]. Compared to the early Gnutella designs, they incorporated several novel features. They devise a topology adaptation algorithm so that most peers are attached to high-degree peers. They use a random walk search algorithm, in lieu of flooding, and bias the query load towards higher-degree peers. For 'one-hop replication', they require all nodes to keep pointers to content on adjacent peers. To implement a receiver-controlled token-based flow control, a peer must have a token from its neighbouring peer before it sends a query to it. Chawathe, Ratnasamy, et al. show by simulations that the combination of these features provides a scalability improvement of three to five orders of magnitude over Gnutella "while retaining significant robustness". The main robustness metrics they used were the 'collapse point' query rate (the per-node query rate at which the successful query rate falls below 90%) and the average hop count immediately prior to collapse. Their comparison with Gnutella did not take into account the Gnutella enhancements above -- this was left as future work. Castro, Costa, and Rowstron argued that if Gnutella were built on top of a structured overlay, then both the query and overlay maintenance traffic could be reduced [259]. Yang, Vinograd, et al. explore various policies for peer selection in the GUESS protocol, since the issue is left open in the original proposal [265]. For example, the peer initiating the query could choose peers that have been "most recently used" or that have the "most files shared". Various policy pitfalls are identified. For example, good peers could be overloaded, victims of their own success. Alternatively, malicious peers could encourage the querying peer to try inactive peers. They conclude that a "most results" policy gives the best balance of robustness and efficiency. Like Castro, Costa, and Rowstron, they concentrated on the static network scenario. Cholvi, Felber, et al. very briefly describe how similar "least recently used" and "most often used" heuristics can be used by a peer to select peer 'acquaintances' [352]. They were motivated by the
congestion associated with Gnutella's TTL-limited flooding. Recognizing that the busiest peers can quickly become overloaded central hubs for the entire network, they limit the number of acquaintances for any given peer to 25. They sketch a mechanism to decrement a query's TTL multiple times when it traverses "interested peers". In summary, these Gnutella-related investigations are characterized by a bias for high-degree peers and very short directed query paths, a disdain for flooding, and concern about excessive load on the 'better' peers. Generally, the robustness analysis for dynamic networks (content updates and node arrivals/departures) remains open.4.1.2. Partition-by-Document, Partition-by-Keyword
One aspect of P2P keyword search systems has received particular attention: should the index be partitioned by document or by keyword? The issue affects scalability. To be partitioned by document, each node has a local index of documents for which it is responsible. Gnutella is a prime example. Queries are generally flooded in systems partitioned by document. On the other hand, a peer may assume responsibility for a set of keywords. The peer uses an inverted list to find a matching document, either locally or at another peer. If the query contains several keywords, inverted lists may need to be retrieved from several different peers to find the intersection [21]. The initial assessment by Li, Loo, et al. was that the partition-by-document approach was superior [210]. For one scenario of a full-text Web search, they estimated the communications costs to be about six times higher than the feasible budget. However, wanting to exploit prior work on inverted list intersection, they studied the partition-by-keyword strategy. They proposed several optimizations that put the communication costs for a partition-by-keyword system within an order of magnitude of feasibility. There had been a couple of prior papers that suggested partitioned-by-keyword designs incorporate DHTs to map keywords to peers [355, 361]. In Gnawali's Keyword-set Search System (KSS), the index is partitioned by sets of keywords [355]. Terpstra, Behnel, et al. point out that by keeping keyword pairs or triples, the number of lists per document in KSS is squared or tripled [362]. Shi, Guangwen, et al. interpreted the approximations of Li, Loo, et al. to mean that neither approach is feasible on its own [21]. Their Multi-Level Partitioning (MLP) scheme incorporates both partitioning approaches. They arrange nodes into a group hierarchy, with all nodes in the single 'level 0' group, and with the same nodes sub- divided into k logical subgroups on 'level 1'. The subgroups are again divided, level by level, until level l. The inverted index is partitioned by document between groups and by keyword within groups. MLP avoids the query flooding normally associated with systems partitioned by document, since a small number of nodes in each group
process the query. It reduces the bandwidth overheads associated with inverted list intersection in systems partitioned solely by keyword, since groups can calculate the intersection independently over the documents for which they are responsible. MLP was overlaid on SkipNet, per Section 3.5.6 [38]. Some initial analyses of communications costs and query latencies were provided.4.1.3. Partial Search, Exhaustive Search
Much of the research above addresses partial keyword search. Daswani, et al. highlighted the open problem of efficient, comprehensive keyword search [25]. How can exhaustive searches be achieved without flooding queries to every peer in the network? Terpstra, Behnel et al. couched the keyword search problem in rendezvous terms: dynamic keyword queries need to 'meet' with static document lists [362]. Their Bitzipper scheme is partitioned by document. They improved on full flooding by putting document metadata on 2sqrt(n) nodes and forwarding queries through only 6sqrt(n) nodes. They reported that Bitzipper nodes need only 1/166th of the bandwidth of full-flooding Gnutella nodes for an exhaustive search. An initial comparison of query load was given. There was little consideration of either static or dynamic resilience; that is, of nodes failing, of documents continually changing, or of nodes continually joining and leaving the network.4.2. Information Retrieval
The field of Information Retrieval (IR) has matured considerably since its inception in the 1950s [363]. A taxonomy for IR models has been formalized [262]. It consists of four elements: a representation of documents in a collection; a representation of user queries; a framework describing relationships between document representations and queries; and a ranking function that quantifies an ordering amongst documents for a particular query. Three main issues motivate current IR research -- information relevance, query response time, and user interaction with IR systems. The dominant IR trends for searching large text collections are also threefold [262]. The size of collections is increasing dramatically. More complicated search mechanisms are being found to exploit document structure, to accommodate heterogeneous document collections, and to deal with document errors. Compression is in favour -- it may be quicker to search compact text or retrieve it from external devices. In a distributed IR system, query processing has four parts. Firstly, particular collections are targeted for the search. Secondly, queries are sent to the targeted collections. Queries are then evaluated at the individual collections. Finally, results from the collections are collated.
So how do P2P networks differ from distributed IR systems? Bawa, Manku, et al. presented four differences [62]. They suggested that a P2P network is typically larger, with tens or hundreds of thousands of nodes. It is usually more dynamic, with node lifetimes measured in hours. They suggested that a P2P network is usually homogeneous, with a common resource description language. It lacks the centralized "mediators" found in many IR systems that assume responsibility for selecting collections, for rewriting queries, and for merging ranked results. These distinctions are generally aligned with the peer characteristics in Section 1. One might add that P2P nodes display more symmetry -- peers are often both information consumers and producers. Daswani, Garcia-Molina, et al. pointed out that, while there are IR techniques for ranked keyword search at moderate scale, research is required so that ranking mechanisms are efficient at the larger scale targeted by P2P designs [25]. Joseph and Hoshiai surveyed several P2P systems using metadata techniques from the IR toolkit [60]. They described an assortment of IR techniques and P2P systems, including various metadata formats, retrieval models, bloom filters, DHTs, and trust issues. In the ensuing paragraphs, we survey P2P work that has incorporated information retrieval models, particularly the Vector Model and the Latent Semantic Indexing Model. We omit the P2P work based on Bayesian models. Some have pointed to such work [60], but made no explicit mention of the model [364]. One early paper on P2P content-based image retrieval also leveraged the Bayesian model [365]. For the former two models, we briefly describe the design, then try to highlight robustness aspects. On robustness, we are again stymied for lack of prior work. Indeed, a search across all proceedings of the Annual ACM Conference on Research and Development in Information Retrieval for the words "reliable", "available", "dependable", or "adaptable" did not return any results at the time of writing. In contrast, a standard text on distributed database management systems [366] contains a whole chapter on reliability. IR research concentrates on performance measures. Common performance measures include recall, the fraction of the relevant documents that has been retrieved and precision, the fraction of the retrieved documents that is relevant [262]. Ideally, an IR system would have high recall and high precision. Unfortunately techniques favouring one often disadvantage the other [363].
4.2.1. Vector Model (PlanetP, FASD, eSearch)
The vector model [367] represents both documents and queries as term vectors, where a term could be a word or a phrase. If a document or query has a term, the weight of the corresponding dimension of the vector is non-zero. The similarity of the document and query vectors gives an indication of how well a document matches a particular query. The weighting calculation is critical across the retrieval models. Amongst the numerous proposals for the probabilistic and vector models, there are some commonly recurring weighting factors [363]. One is term frequency. The more a term is repeated in a document, the more important the term is. Another is inverse document frequency. Terms common to many documents give less information about the content of a document. Then there is document length. Larger documents can bias term frequencies, so weightings are sometimes normalized against document length. The expression "TFIDF weighting" refers to the collection of weighting calculations that incorporate term frequency and inverse document frequency, not just to one. Two weighting calculations have been particularly dominant -- Okapi [368] and pivoted normalization [369]. A distributed version of Google's Pagerank algorithm has also been devised for a P2P environment [370]. It allows incremental, ongoing Pagerank calculations while documents are inserted and deleted. A couple of early P2P systems leveraged the vector model. Building on the vector model, PlanetP divided the ranking problem into two steps [215]. In the first, peers are ranked for the probability that they have matching documents. In the second, higher-priority peers are contacted and the matching documents are ranked. An Inverse Peer Frequency, analogous to the Inverse Document Frequency, is used to rank relevant peers. To further constrain the query traffic, PlanetP contacts only the first group of m peers to retrieve a relevant set of documents. In this way, it repeatedly contacts groups of m peers until the top k document rankings are stable. While the PlanetP designers first quantified recall and precision, they also considered reliability. Each PlanetP peer has a global index with a list of all other peers, their IP addresses, and their Bloom filters. This large volume of shared information needs to be maintained. Klampanos and Jose saw this as PlanetP's primary shortcoming [371]. Each Bloom filter summarized the set of terms in the local index of each peer. The time to propagate changes, be they new documents or peer arrivals/departures, was studied by simulation for up to 1000 peers. The reported propagation times were in the hundreds of seconds. Design workarounds were required for PlanetP to be viable across slower dial-up modem connections. For future work, the authors were
considering some sort of hierarchy to scale to larger numbers of peers. A second early system using the vector model is the Fault-tolerant, Adaptive, Scalable Distributed (FASD) search engine [283], which extended the Freenet design (Section 2.3) for richer queries. The original Freenet design could find a document based on a globally unique identifier. Kronfol's design added the ability to search, for example, for documents about "apples AND oranges NOT bananas". It uses a TFIDF weighting scheme to build a document's term vector. Each peer calculates the similarity of the query vector and local documents and forwards the query to the best downstream peer. Once the best downstream peer returns a result, the second-best peer is tried, and so on. Simulations with 1000 nodes gave an indication of the query path lengths in various situations -- when routing queries in a network with constant rates of node and document insertion, when bootstrapping the network in a "worst-case" ring topology, or when failing randomly and specifically selected peers. Kronfol claimed excellent average-case performance -- less than 20 hops to retrieve the same top n results as a centralized search engine. There were, however, numerous cases where the worst-case path length was several hundred hops in a network of only 1000 nodes. In parallel, there have been some P2P designs based on the vector model from the University of Rochester -- pSearch [9, 372] and eSearch [373]. The early pSearch paper suggested a couple of retrieval models, one of which was the Vector Space Model, to search only the nodes likely to have matching documents. To obtain approximate global statistics for the TFIDF calculation, a spanning tree was constructed across a subset of the peers. For the m top terms, the term-to-document index was inserted into a Content- Addressable Network [334]. A variant that mapped terms to document clusters was also suggested. eSearch is a hybrid of the partition- by-document and partition-by-term approaches (Section 4.1.2) eSearch nodes are primarily partitioned by term. Each is responsible for the inverted lists for some top terms. For each document in the inverted list, the node stores the complete term list. To reduce the size of the index, the complete term lists for a document are only kept on nodes that are responsible for top terms in the document. eSearch uses the Okapi term weighting to select top terms. It relies on the Chord DHT [34] to associate terms with nodes storing the inverted lists. It also uses automatic query expansion. This takes the significant terms from the top document matches and automatically adds them to the user's query to find additional relevant documents. The eSearch performance was quantified in terms of search precision, the number of retrieved documents, and various load-balancing metrics. Compared to the more common proposals for partitioning by
keywords, eSearch consumed 6.8 times the storage space to achieve faster search times.4.2.2. Latent Semantic Indexing (pSearch)
Another retrieval model used in P2P proposals is Latent Semantic Indexing (LSI) [374]. Its key idea is to map both the document and query vectors to a concept space with lower dimensions. The starting point is a t*N weighting matrix, where t is the total number of indexed terms, N is the total number of documents, and the matrix elements could be TFIDF rankings. Using singular value decomposition, this matrix is reduced to a smaller number of dimensions, while retaining the more significant term-to-document mappings. Baeza-Yates and Ribeiro-Neto suggested that LSI's value is a novel theoretic framework, but that its practical performance advantage for real document collections had yet to be proven [262]. pSearch incorporated LSI [9]. By placing the indices for semantically similar documents close in the network, Tang, Xu, et al. touted significant bandwidth savings relative to the early full- flooding variant of Gnutella [372]. They plotted the number of nodes visited by a query. They also explored the trade-off with accuracy, the percentage match between the documents returned by the distributed pSearch algorithm and those from a centralized LSI baseline. In a more recent update to the pSearch work, Tang, Dwarkadas, et al. summarized LSI's shortcomings [375]. Firstly, for large document collections, its retrieval quality is inherently inferior to Okapi. Secondly, singular value decomposition consumes excessive memory and computation time. Consequently, the authors used Okapi for searching while retaining LSI for indexing. With Okapi, they selected the next node to be searched and selected documents on searched nodes. With LSI, they ensured that similar documents are clustered near each other, thereby optimizing the network search costs. When retrieving a small number of top documents, the precision of LSI+Okapi approached that of Okapi. However, if retrieving a large number of documents, the LSI+Okapi precision is inferior. The authors want to improve this in future work.4.2.3. Small Worlds
The "small world" concept originally described how people are interconnected by short chains of acquaintances [376]. Kleinberg was struck by the algorithmic lesson of the small world, namely "that individuals using local information are collectively very effective at constructing short paths between two points in a social network" [377]. Small world networks have a small diameter and a large clustering coefficient (a large number of connections amongst relevant nodes) [378].
The small world idea has had a limited impact on peer-to-peer algorithms. It has influenced only a few unstructured [62, 378-380] and structured [344, 381] algorithms. The most promising work on "small worlds" in P2P networks are those concerned with the information retrieval metrics, precision and recall [62, 378, 380].5. Queries
Database research suggests directions for P2P research. Hellerstein observed that, while work on fast P2P indexes is well underway, P2P query optimization remains a promising topic for future research [23]. Kossman reviewed the state of the art of distributed query processing, highlighting areas for future research: simulation and query optimization for networks of tens of thousands of servers and millions of clients; non-relational data types (e.g., XML, text, and images); and partial query responses since on the Internet, "failure is the rule rather than the exception" [19]. A primary motivation for the P2P system, PIER, was to scale from the largest database systems of a few hundred nodes to an Internet environment in which there are over 160 million nodes [22]. Litwin and Sahri have also considered ways to combine distributed hashing, more specifically the Scalable Distributed Data Structures, with SQL databases, claiming to be first to implement scalable distributed database partitioning [382]. Motivated by the lack of transparent distribution in current distributed databases, they measure query execution times for Microsoft SQL servers aggregated by means of an SDDS layer. One of their starting assumptions was that it is too challenging to change the SQL query optimizer. Database research also suggests the approach to P2P research. Researchers of database query optimization were divided between those looking for optimal solutions in special cases and those using heuristics to answer all queries [383]. Gribble, et al. cast query optimization in terms of the data placement problem, which is to "distribute data and work so the full query workload is answered with lowest cost under the existing bandwidth and resource constraints" [250]. They pointed out that even the static version of this problem is NP-complete in P2P networks. Consequently, research on massive, dynamic P2P networks will likely progress using both strategies of early database research - heuristics and special-case optimizations. If P2P networks are going to be adaptable, if they are to support a wide range of applications, then they need to accommodate many query types [72]. Up to this point, we have reviewed queries for keys (Section 3) and keywords (Sections 4.1. and 4.2). Unfortunately, a major shortcoming of the DHTs in Section 3.5 is that they primarily support exact-match, single-key queries. Skip Graphs support range and prefix queries, but not aggregation queries. Here we probe below
the language syntax to identify the open research issues associated with more expressive P2P queries [25]. Triantafillou and Pitoura observed the disparate P2P designs for different types of queries and so outlined a unifying framework [76]. To classify queries, they considered the number of relations (single or multiple), the number of attributes (single or multiple), and the type of query operator. They described numerous operators: equality, range, join, and "special functions". The latter referred to aggregation (like sum, count, average, minimum, and maximum), grouping and ordering. The following sections approximately fit their taxonomy -- range queries, multi-attribute queries, join queries and aggregation queries. There has been some initial P2P work on other query types -- continuous queries [20, 22, 73], recursive queries [22, 74], and adaptive queries [23, 75]. For these, we defer to the primary references.5.1. Range Queries
The support of efficient range predicates in P2P networks was identified as an important open research issue by Huebsch, et al. [22]. Range partitioning has been important in parallel databases to improve performance, so that a transaction commonly needs data from only one disk or node [22]. One type of range search, longest prefix match, is important because of its prevalence in routing schemes for voice and data networks alike. In other applications, users may pose broad, inexact queries, even though they require only a small number of responses. Consequently, techniques to locate similar ranges are also important [77]. Various proposals for range searches over P2P networks are summarized in Figure 4. Since the Scalable Distributed Data Structure (SDDS) has been an important influence on contemporary Distributed Hash Tables (DHTs) [49-51], we also include ongoing work on SDDS range searches. PEER-TO-PEER (P2P) Locality Sensitive Hashing (Chord) [77] Prefix Hash Trees (unspecified DHT) [78, 79] Space Filling Curves (CAN) [80] Space Filling Curves (Chord) [81] Quadtrees (Chord) [82] Skip Graphs [38, 41, 83, 100] Mercury [84] P-Grid [85, 86] SCALABLE DISTRIBUTED DATA STRUCTURES (SDDS) RP* [87, 88] Figure 4: Solutions for Range Queries on P2P and SDDS Indexes
The papers on P2P range search can be divided into those that rely on an underlying DHT (the first five entries in Figure 4) and those that do not (the subsequent three entries). Bharambe, Agrawal, et al. argued that DHTs are inherently ill-suited to range queries [84]. The very feature that makes for their good load balancing properties, randomized hash functions, works against range queries. One possible solution would be to hash ranges, but this can require a priori partitioning. If the partitions are too large, partitions risk overload. If they are too small, there may be too many hops. Despite these potential shortcomings, there have been several range query proposals based on DHTs. If hashing ranges to nodes, it is entirely possible that overlapping ranges map to different nodes. Gupta, Agrawal, et al. rely on locality sensitive hashing to ensure that, with high probability, similar ranges are mapped to the same node [77]. They propose one particular family of locality sensitive hash functions, called min-wise independent permutations. The number of partitions per node and the path length were plotted against the total numbers of peers in the system. For a network with 1000 nodes, the hop count distribution was very similar to that of the exact- matching Chord scheme. Was it load-balanced? For the same network with 50,000 partitions, there were over two orders of magnitude variation in the number of partitions at each node (first and ninety-ninth percentiles). The Prefix Hash Tree is a trie in which prefixes are hashed onto any DHT. The preliminary analysis suggests efficient doubly logarithmic lookup, balanced load, and fault resilience [78, 79]. Andrzejak and Xu were perhaps the first to propose a mapping from ranges to DHTs [80]. They use one particular Space Filling Curve, the Hilbert curve, over a Content Addressable Network (CAN) construction (Section 3.5.3). They maintain two properties: nearby ranges map to nearby CAN zones; if a range is split into two sub-ranges, then the zones of the sub-ranges partition the zone of the primary range. They plot path length and load proxy measures (the total number of messages and nodes visited) for three algorithms to propagate range queries: brute force, controlled flooding, and directed controlled flooding. Schmidt and Parashar also advocated Space Filling Curves to achieve range queries over a DHT [81]. However, they point out that, while Andrzejak and Xu use an inverse Space Filling Curve to map a one-dimensional space to d- dimensional zones, they map a d-dimensional space back to a one- dimensional index. Such a construction gives the ability to search across multiple attributes (Section 5.2). Tanin, Harwood, et al. suggested quadtrees over Chord [82], and gave preliminary simulation results for query response times. Because DHTs are naturally constrained to exact-match, single-key queries, researchers have considered other P2P indexes for range searches. Several were based on Skip Graphs [38, 41], which, unlike
the DHTs, do not necessitate randomizing hash functions and are therefore capable of range searches. Unfortunately, they are not load balanced [83]. For example, in SkipNet [48], hashing was added to balance the load -- the Skip Graph could support range searches or load balancing, but not both. One solution for load-balancing relies on an increased number of 'virtual' servers [168] but, in their search for a system that can both search for ranges and balance loads, Bharambe, Agrawal, et al. rejected the idea [84]. The virtual servers work assumed load imbalance stems from hashing; that is, by skewed data insertions and deletions. In some situations, the imbalance is triggered by a skewed query load. In such circumstances, additional virtual servers can increase the number of routing hops and increase the number of pointers that a Skip Graph needs to maintain. Ganesan, Bawa, et al. devised an alternate method to balance load [83]. They proposed two Skip Graphs, one to index the data itself and the other to track load at each node in the system. Each node is able to determine the load on its neighbours and the most (least) loaded nodes in the system. They devise two algorithms: NBRADJUST balances load on neighbouring nodes; using REORDER, empty nodes can take over some of the tuples on heavily loaded nodes. Their simulations focus on skewed storage load, rather than on skewed query loads, but they surmise that the same approach could be used for the latter. Other proposals for range queries avoid both the DHT and the Skip Graph. Bharambe, Agrawal, et al. distinguish their Mercury design by its support for multi-attribute range queries and its explicit load balancing [84]. In Mercury, nodes are grouped into routing hubs, each of which is responsible for various query attributes. While it does not use hashing, Mercury is loosely similar to the DHT approaches: nodes within hubs are arranged into rings, like Chord [34]; for efficient routing within hubs, k long-distance links are used, like Symphony [381]. Range lookups require O(((log n)^2)/k) hops. Random sampling is used to estimate the average load on nodes and to find the parts of the overlay that are lightly loaded. Whereas Symphony assumed that nodes are responsible for ranges of approximately equal size, Mercury's random sampling can determine the location of the start of the range, even for non-uniform ranges [84]. P-Grid [42] does provide for range queries, by virtue of the key ordering in its tree structures. Ganesan, Bawa, et al. critiqued its capabilities [83]: P-Grid assumes fixed-capacity nodes; there was no formal characterization of imbalance ratios or balancing costs; every P-Grid periodically contacts other nodes for load information. The work on Scalable Distributed Data Structures (SDDSs) has progressed in parallel with P2P work and has addressed range queries. Like the DHTs above, the early SDDS Linear Hashing (LH*) schemes were not order-preserving [52]. To facilitate range queries, Litwin,
Niemat, et al. devised a Range Parititioning variant, RP* [87]. There are options to dispense with the index, to add indexes to clients, and to add them to servers. In the variant without an index, every query is issued via multicasting. The other variants also use some multicasting. The initial RP* paper suggested scalability to thousands of sites, but a more recent RP* simulation was capped at 140 servers [88]. In that work, Tsangou, Ndiaye, et al. investigated TCP and UDP mechanisms by which servers could return range query results to clients. The primary metrics were search and response times. Amongst the commercial parallel database management systems, they reported that the largest seems only to scale to 32 servers (SQL Server 2000). For future work, they planned to explore aggregation of query results, rather than establishing a connection between the client and every single server with a response. All in all, it seems there are numerous open research questions on P2P range queries. How realistic is the maintenance of global load statistics considering the scale and dynamism of P2P networks? Simulations at larger scales are required. Proposals should take into account both the storage load (insert and delete messages) and the query load (lookup messages). Simplifying assumptions need to be attacked. For example, how well do the above solutions work in networks with heterogeneous nodes, where the maximum message loads and index sizes are node-dependent?5.2. Multi-Attribute Queries
There has been some work on multi-attribute P2P queries. As late as September 2003, it was suggested that there has not been an efficient solution [76]. Again, an early significant work on multi-attribute queries over aggregated commodity nodes germinated amongst SDDSs. k-RP* [89] uses the multi-dimensional binary search tree (or k-d tree, where k indicates the number of dimensions of the search index) [384]. It builds on the RP* work from the previous section and inherits their capabilities for range search and partial match. Like the other SDDSs, k-RP* indexes can fit into RAM for very fast lookup. For future work, Litwin and Neimat suggested a) a formal analysis of the range search termination algorithm and the k-d paging algorithm, b) a comparison with other multi-attribute data structures (quad-trees and R-trees) and c) exploration of query processing, concurrency control, and transaction management for k-RP* files [89]. On the latter point, others have considered transactions to be inconsequential to the core problem of supporting more complex queries in P2P networks [72].
In architecting their secure wide-area Service Discovery Service (SDS), Hodes, Czerwinski, et al. considered three possible designs for multi-criteria search -- Centralization, Mapping and Flooding [90]. These correlate to the index classifications of Section 2 -- Central, Distributed, and Local. They discounted the centralized, Napster-like index for its risk of a single point of failure. They considered the hash-based mappings of Section 3, but concluded that it would not be possible to adequately partition data. A document satisfying many criteria would be wastefully stored in many partitions. They rejected full flooding for its lack of scalability. Instead, they devised a query filtering technique, reminiscent of Gnutella's query routing protocol (Section 4.1). Nodes push proactive summaries of their data rather than waiting for a query. Summaries are aggregated and stored throughout a server hierarchy, to guide subsequent queries. Some initial prototype measurements were provided for total load on the system, but not for load distribution. They put several issues forward for future work. The indexing needs to be flexible to change according to query and storage workloads. A mesh topology might improve on their hierarchic topology since query misses would not propagate to root servers. The choice is analogous to BGP meshes and DNS trees. More recently, Cai, Frank, et al. devised the Multi-Attribute Addressable Network (MAAN) [91]. They built on Chord to provide both multi-attribute and range queries, claiming to be the first to service both query types in a structured P2P system. Each MAAN node has O(log n) neighbours, where N is the number of nodes. MAAN multi-attribute range queries require O(log n+N*Smin) hops, where Smin is the minimum range selectivity across all attributes. Selectivity is the ratio of the query range to the entire identifier range. The paper assumed that a locality preserving hash function would ensure balanced load. Per Section 5.1, the arguments by Bharambe, Agrawal, et al. have highlighted the shortcomings of this assumption [84]. MAAN required that the schema must be fixed and known in advance -- adaptable schemas were recommended for subsequent attention. The authors also acknowledged that there is a selectivity breakpoint at which full flooding becomes more efficient than their scheme. This begs for a query resolution algorithm that adapts to the profile of queries. Cai and Frank followed up with RDFPeers [55]. They differentiate their work from other RDF proposals by a) guaranteeing to find query results if they exist and b) removing the requirement of prior definition of a fixed schema. They hashed <subject, predicate, object> triples onto the MAAN and reported routing hop metrics for their implementation. Load imbalance across nodes was reduced to less than one order of magnitude, but the specific measure was the number of triples stored per node - skewed query loads were not considered. They plan to improve load balancing with the virtual servers of Section 5.1 [168].
5.3. Join Queries
Two research teams have done some initial work on P2P join operations. Harren, Hellerstein, et al. initially described a three-layer architecture -- storage, DHT and query processing. They implemented the join operation by modifying an existing Content Addressable Network (CAN) simulator, reporting "significant hot-spots in all dimensions: storage, processing, and routing" [72]. They progressed their design more recently in the context of PIER, a distributed query engine based on CAN [22, 385]. They implemented two equi-join algorithms. In their design, a key is constructed from the "namespace" and the "resource ID". There is a namespace for each relation and the resource ID is the primary key for base tuples in that relation. Queries are multicast to all nodes in the two namespaces (relations) to be joined. Their first algorithm is a DHT version of the symmetric hash join. Each node in the two namespaces finds the relevant tuples and hashes them to a new query namespace. The resource ID in the new namespace is the concatenation of join attributes. In the second algorithm, called "fetch matches", one of the relations is already hashed on the join attributes. Each node in the second namespace finds tuples matching the query and retrieves the corresponding tuples from the first relation. They leveraged two other techniques, namely the symmetric semi-join rewrite and the Bloom filter rewrite, to reduce the high bandwidth overheads of the symmetric hash join. For an overlay of 10,000 nodes, they simulated the delay to retrieve tuples and the aggregate network bandwidth for these four schemes. The initial prototype was on a cluster of 64 PCs, but it has more recently been expanded to PlanetLab. Triantafillou and Pitoura considered multicasting to large numbers of peers to be inefficient [76]. They therefore allocated a limited number of special peers, called range guards. The domain of the join attributes was divided, one partition per range guard. Join queries were sent only to range guards, where the query was executed. Efficient selection of range guards and a quantitive evaluation of their proposal were left for future work.5.4. Aggregation Queries
Aggregation queries invariable rely on tree-structures to combine results from a large number of nodes. Examples of aggregation queries are Count, Sum, Maximum, Minimum, Average, Median, and Top-K [92, 386, 387]. Figure 5 summarizes the tree and query characteristics that affect dependability.
Tree type: Doesn't use DHT [92], use internal DHT trees [95], use independent trees on top of DHTs Tree repair: Periodic [93], exceptional [32] Tree count: One per key, one per overlay [56] Tree flexibility: Static [92], dynamic Query interface: install, update, probe [98] Query distribution: multicast [98], gossip [92] Query applications: leader election, voting, resource location, object placement and error recovery [98, 388] Query semantics Consistency: Best-effort, eventual [92], snapshot / interval / single-site validity [99] Timeliness [388] Lifetime: Continuous [97, 99], single-shot No. attributes: Single, multiple Query types: Count, sum, maximum, minimum, average, median, top k [92, 386, 387] Figure 5: Aggregation Trees and Queries in P2P Networks Key: Astrolabe [92]; Cone [93]; Distributed Approximative System Information Service (DASIS) [95]; Scalable Distributed Information Management System (SDIMS) [98]; Self-Organized Metadata Overlay (SOMO) [56]; Wildfire [99]; Willow [32]; Newscast [97] The fundamental design choices for aggregation trees relate to how the overlay uses DHTs, how it repairs itself when there are failures, how many aggregation trees there are, and whether the tree is static or dynamic (Figure 5). Astrolabe is one of the most influential P2P designs included in Figure 5, yet it makes no use of DHTs [92]. Other designs make use of the internal trees of Plaxton-like DHTs. Others build independent tree structures on top of DHTs. Most of the designs repair the aggregation tree with periodic mechanisms similar to those used in the DHTs themselves. Willow is an exception [32]. It uses a Tree Maintenance Protocol to "zip" disjoint aggregation trees together when there are major failures. Yalagandula and Dahlin found reconfigurations at the aggregation layer to be costly, suggesting more research on techniques to reduce the cost and frequency of such reconfigurations [98]. Many of the designs use multiple aggregation trees, each rooted at the DHT node responsible for the aggregation attribute. On the other hand, the Self-Organized Metadata Overlay [56] uses a single tree and is vulnerable to a single point of failure at its root.
At the time of writing, researchers have just begun exploring the performance of queries in the presence of churn. Most designs are for best-effort queries. Bawa, et al. devised a better consistency model, called Single-Site Validity [99] to qualify the accuracy of results when there is churn. Its price was a five-fold increase in the message load, when compared to an efficient but best-effort Spanning Tree. Gossip mechanisms are resilient to churn, but they delay aggregation results and incur high message cost for aggregation attributes with small read-to-write ratios.6. Security Considerations
An initial list of references to research on P2P security is given in Figure 1, Section 1. This document addresses P2P search. P2P storage, security, and applications are recommended for further investigation in Section 8.7. Conclusions
Research on peer-to-peer networks can be divided into four categories -- search, storage, security and applications. This critical survey has focused on search methods. While P2P networks have been classified by the existence of an index (structured or unstructured) or the location of the index (local, centralized, and distributed), this survey has shown that most have evolved to have some structure, whether it is indexes at superpeers or indexes defined by DHT algorithms. As for location, the distributed index is most common. The survey has characterized indexes as semantic and semantic-free. It has also critiqued P2P work on major query types. While much of it addresses work from 2000 or later, we have traced important building blocks from the 1990s. The initial motivation in this survey was to answer the question, "How robust are P2P search networks?" The question is key to the deployment of P2P technology. Balakrishnan, Kaashoek, et al. argued that the P2P architecture is appealing: the startup and growth barriers are low; they can aggregate enormous storage and processing resources; "the decentralized and distributed nature of P2P systems gives them the potential to be robust to faults or intentional attacks" [18]. If P2P is to be a disruptive technology in applications other than casual file sharing, then robustness needs to be practically verified [20]. The best comparative research on P2P dependability has been done in the context of Distributed Hash Tables (DHTs) [291]. The entire body of DHT research can be distilled to four main observations about dependability (Section 3.2). Firstly, static dependability comparisons show that no O(log n) DHT geometry is significantly more
dependable than the other O(log n) geometries. Secondly, dynamic dependability comparisons show that DHT dependability is sensitive to the underlying topology maintenance algorithms (Figure 2). Thirdly, most DHTs use O(log n) geometries to suit ephemeral nodes, whereas the O(1) hop DHTs suit stable nodes - they deserve more research attention. Fourthly, although not yet a mature science, the study of DHT dependability is helped by recent simulation tools that support multiple DHTs [299]. We make the following four suggestions for future P2P research: 1) Complete the companion P2P surveys for storage, security, and applications. A rough outline has been suggested in Figure 1, along with references. The need for such surveys was highlighted within the peer-to-peer research group of the Internet Research Task Force (IRTF) [17]. 2) P2P indexes are maturing. P2P queries are embryonic. Work on more expressive queries over P2P indexes started to gain momentum in 2003, but remains fraught with efficiency and load issues. 3) Isolate the low-level mechanisms affecting robustness. There is limited value in comparing robustness of DHT geometries (like rings versus de Bruijn graphs), when robustness is highly sensitive to underlying topology maintenance algorithms (Figure 2). 4) Build consensus on robustness metrics and their acceptable ranges. This paper has teased out numerous measures that impinge on robustness, for example, the median query path length for a failure of x% of nodes, bisection width, path overlap, the number of alternatives available for the next hop, lookup latency, average live bandwidth (bytes/node/sec), successful routing rates, the number of timeouts (caused by a finger pointing to a departed node), lookup failure rates (caused by nodes that temporarily point to the wrong successor during churn), and clustering measures (edge expansion and node expansion). Application-level robustness metrics need to drive a consistent assessment of the underlying search mechanics.8. Acknowledgments
This document was adapted from a paper in Elsevier's Computer Networks: J. Risson & T. Moors, Survey of Research towards Robust Peer-to- Peer Networks: Search Methods, Computer Networks 51(7)2007.
We thank Bill Yeager, Ali Ghodsi, and several anonymous reviewers for thorough comments that significantly improved the quality of earlier versions of this document.