Infoseek's approach to distributed search
by Steve Kirsch, Infoseek Corporation,
[email protected]
One of the
papers presented at the Distributed
Indexing/Searching Workshop sponsored by W3C.
Abstract |
This paper describes some of the outstanding issues
for searching the Internet and also how Infoseek is
approaching the problem of distributed search and
retrieval on the Internet and Intranet. Our goal is to
enable someone to do a comprehensive and accurate
world-wide distributed search in less than 1 second. |
|
Issues related to searching a centralized index of
distributed data
|
WWW master site list |
The Comprehensive
List of Sites was not available at the time this
paper was written (May 15). We need a reliable and
complete list of all WWW sites that robots can retrieve.
The list should also be searchable by people using a
fielded search and include basic contact information.
Infoseek would be happy to host such a service as a
public service. |
Additional robots files needed |
In order to minimize net traffic caused by robots and
increase the currency of data indexed, we propose that
each WWW site run a script that will create a file of
URLs along with their size and last modification date,
sorted most recent first. The file would begin with a
magic number, and have a 64 bit unique ID (generated from
the time, hostname, IP address, etc). This allows web
crawlers to efficiently download changes from a site with
minimal overhead on the site, as well as detect hostname
aliases. Infoseek is working
with Excite to finalize such a standard, develop the
scripts, and make them available to the community at no
charge. The sitelist.txt
specification is available for comments. |
|
Issues related to searching decentralized indices
("true distributed search")
|
Distributed search |
Existing search engines on the Internet
use the technique of collecting all information into a
centralized index. Even with the two changes proposed
above, this technique will break down soon as the number
of sites and amount of data increases dramatically. Today
it costs $10M to keep an out of date index of 50M Web
pages. Tomorrow, with the increased traffic and size of
the WWW, it will cost in upwards of $100M. A better,
but more sophisticated approach, is to perform true
distributed searching directly from the user's computer.
If implemented properlly, this has several advantages
over the existing approach:
- SIZE: it scales easily to over a
trillion documents. Plus you get the capability
of searching documents not directly on the WWW,
as in within databases.
- ACCURACY: it allows a user more
opportunity to target his query (e.g., search in
only "english" databases about
"AIDS"), i.e., a more controlled search
(broad or narrow, depending on the needs of the
user). Also, since each participating search
engine can be tuned for the type of data under
management, individual search results are also
better. For example, when searching engines
containing legal documents, the search engine may
have a thesaurus of specialized legal terms which
can then be used to provide better relevance
ranking.
- SPEED: it can give sub-second
response time
- RELIABILITY: it is immune to
system failure (one or more web sites going down)
- CURRENCY: a completely
up-to-the-minute search since you are always
searching the latest data
In order to accomplish this, the model is to utilize a
Java search client sitting on the user's destop that
first runs the user's query (or a special "establish
your subject area" query) against a database
containing the "essence" of each participating
database, i.e., a "table of contents" database,
often called a "meta-index." This database is
normally replicated for reliability. Since this database
is relatively small (compared to the actual data under
management), this is very feasible. From this search, a
list of the best candidate search servers is determined.
The query is then passed in parallel to each engine, and
the query results are assembled and re-ranked to provide
a consistent ranking.
So there are two principal problems to be solved here:
- identifying the collection(s) to run the query
against (the "collection
identification" problem)
- merging search results from heterogeneous
collections (the "collection fusion"
problem)
|
Collection identification |
Infoseek's new full text indexing software (Ultraseek) creates a
sophisticated fingerprint file during the indexing
process. This fingerprint file can be adjusted by the
user to contain every word and multi-word phrase
from the original corpus as well as a score for each word
and phrase. The user can set a threshold of significance
for more concise output. Similarly, a requestor of the
fingerprint file could set a similar threshold, but this
would require a more sophisticated interface than FTP
(HTTP could be used). Ultraseek is capable of running a
user's query against a meta-index of fingerprint files to
accurately determine a rank ordered list of the best
collections to run the query against. No manual indexing
is required for each collection. This means that a user
can say run the query "satellites" against
collections about "space exploration" and thus
not get articles about TV satellites. Or a user could run
the query "Michael Jordan" against databases
primarly about "Westinghouse Electric" to avoid
articles about the basketball player by the same name.
Infoseek's approach is unique and patent-pending since
the fingerprint file (a.k.a., meta-index) contains phrase
information, and not just individual words. This results
in far better accuracy in determining the correct
collections to search. |
Fusion of search results from
heterogenous servers |
Ultraseek performs query results merging from
distributed collections in a unique way. We allow each
search engine to handle the query using the most
appropriate scoring algorithms and local statistics
(rather than the more traditional approach of using
global statistics which can have the effect of skewing
the search results in an undesirable way). The resulting
DocIDs are returned to the user, along with a few
fundamental statistics about each of the top ranked
documents. This patent-pending technique allows the
documents to be precisely re-scored at the user's
workstation using a consistent scoring algorithm. It is
very efficient (an IDF collection pass is not required),
heterogenous search engines are supported (e.g., Verity
and PLS), and most importantly, a document's score is
completely independent of the collection statistics and
search engine used. Our technique is the basis for the standards
work being done at Stanford in this area (STARTS). We
are working on implementing this interface now. A number
of commercial search vendors have plans to adopt this
format when it is available. Infoseek, Microsoft,
Fulcrum, Excite, various US government agencies are
participating in the effort. |
|