Infoseek


Infoseek's approach to distributed search

by Steve Kirsch, Infoseek Corporation, stk@infoseek.com

One of the papers presented at the Distributed Indexing/Searching Workshop sponsored by W3C.


Abstract This paper describes some of the outstanding issues for searching the Internet and also how Infoseek is approaching the problem of distributed search and retrieval on the Internet and Intranet. Our goal is to enable someone to do a comprehensive and accurate world-wide distributed search in less than 1 second.
 

Issues related to searching a centralized index of distributed data

WWW master site list The Comprehensive List of Sites was not available at the time this paper was written (May 15). We need a reliable and complete list of all WWW sites that robots can retrieve. The list should also be searchable by people using a fielded search and include basic contact information. Infoseek would be happy to host such a service as a public service.
Additional robots files needed In order to minimize net traffic caused by robots and increase the currency of data indexed, we propose that each WWW site run a script that will create a file of URLs along with their size and last modification date, sorted most recent first. The file would begin with a magic number, and have a 64 bit unique ID (generated from the time, hostname, IP address, etc). This allows web crawlers to efficiently download changes from a site with minimal overhead on the site, as well as detect hostname aliases. Infoseek is working with Excite to finalize such a standard, develop the scripts, and make them available to the community at no charge. The sitelist.txt specification is available for comments.
 

Issues related to searching decentralized indices ("true distributed search")

Distributed search Existing search engines on the Internet use the technique of collecting all information into a centralized index. Even with the two changes proposed above, this technique will break down soon as the number of sites and amount of data increases dramatically. Today it costs $10M to keep an out of date index of 50M Web pages. Tomorrow, with the increased traffic and size of the WWW, it will cost in upwards of $100M.

A better, but more sophisticated approach, is to perform true distributed searching directly from the user's computer. If implemented properlly, this has several advantages over the existing approach:

  • SIZE: it scales easily to over a trillion documents. Plus you get the capability of searching documents not directly on the WWW, as in within databases.
  • ACCURACY: it allows a user more opportunity to target his query (e.g., search in only "english" databases about "AIDS"), i.e., a more controlled search (broad or narrow, depending on the needs of the user). Also, since each participating search engine can be tuned for the type of data under management, individual search results are also better. For example, when searching engines containing legal documents, the search engine may have a thesaurus of specialized legal terms which can then be used to provide better relevance ranking.
  • SPEED: it can give sub-second response time
  • RELIABILITY: it is immune to system failure (one or more web sites going down)
  • CURRENCY: a completely up-to-the-minute search since you are always searching the latest data

In order to accomplish this, the model is to utilize a Java search client sitting on the user's destop that first runs the user's query (or a special "establish your subject area" query) against a database containing the "essence" of each participating database, i.e., a "table of contents" database, often called a "meta-index." This database is normally replicated for reliability. Since this database is relatively small (compared to the actual data under management), this is very feasible. From this search, a list of the best candidate search servers is determined. The query is then passed in parallel to each engine, and the query results are assembled and re-ranked to provide a consistent ranking.

So there are two principal problems to be solved here:

  1. identifying the collection(s) to run the query against (the "collection identification" problem)
  2. merging search results from heterogeneous collections (the "collection fusion" problem)
Collection identification Infoseek's new full text indexing software (Ultraseek) creates a sophisticated fingerprint file during the indexing process. This fingerprint file can be adjusted by the user to contain every word and multi-word phrase from the original corpus as well as a score for each word and phrase. The user can set a threshold of significance for more concise output. Similarly, a requestor of the fingerprint file could set a similar threshold, but this would require a more sophisticated interface than FTP (HTTP could be used). Ultraseek is capable of running a user's query against a meta-index of fingerprint files to accurately determine a rank ordered list of the best collections to run the query against. No manual indexing is required for each collection. This means that a user can say run the query "satellites" against collections about "space exploration" and thus not get articles about TV satellites. Or a user could run the query "Michael Jordan" against databases primarly about "Westinghouse Electric" to avoid articles about the basketball player by the same name. Infoseek's approach is unique and patent-pending since the fingerprint file (a.k.a., meta-index) contains phrase information, and not just individual words. This results in far better accuracy in determining the correct collections to search.
Fusion of search results from heterogenous servers Ultraseek performs query results merging from distributed collections in a unique way. We allow each search engine to handle the query using the most appropriate scoring algorithms and local statistics (rather than the more traditional approach of using global statistics which can have the effect of skewing the search results in an undesirable way). The resulting DocIDs are returned to the user, along with a few fundamental statistics about each of the top ranked documents. This patent-pending technique allows the documents to be precisely re-scored at the user's workstation using a consistent scoring algorithm. It is very efficient (an IDF collection pass is not required), heterogenous search engines are supported (e.g., Verity and PLS), and most importantly, a document's score is completely independent of the collection statistics and search engine used. Our technique is the basis for the standards work being done at Stanford in this area (STARTS). We are working on implementing this interface now. A number of commercial search vendors have plans to adopt this format when it is available. Infoseek, Microsoft, Fulcrum, Excite, various US government agencies are participating in the effort.