I’m excited to be publishing this interview series with federated search luminary Erik Selberg. Erik’s contributions to federated search go back to 1994. You can read my preview of this series with Erik (and the list of questions) here. The majority of my questions to Erik involve his development of MetaCrawler, one of the earliest metasearch engines.
1. In 1996, you co-authored a visionary paper: The MetaCrawler Architecture for Resource Aggregation on the Web. The architecture you describe looks very much like federated search engines of today although I should note that your work was focused on metasearch engines, which federate search engines that crawl and index web sites rather than federating content that lives in databases. What background did you have that motivated that paper and enabled you to write one of the first metasearch applications?
This was a follow-up to “Multi-Service Search and Comparison using the MetaCrawler,” which was presented at WWW4 in Dec. of 1995. The papers were the result of the research I was doing in meta-search. At the time, I was a graduate student in Computer Science and Engineering at the University of Washington, and MetaCrawler was initially my quals-level project (roughly equivalent to a Master’s level project) and eventually became my dissertation topic.
2. What inspired you and Oren Etzioni to create MetaCrawler in 1994?
Frustration that I had to visit Lycos, WebCrawler, Open Text, InfoSeek, Galaxy, AND Yahoo to find results I wanted. At the time, which was roughly 2 years into the Web’s growth, search engines were all still mostly graduate student projects, and thus were prone to all sorts of problems — such as indexes being out of date, incomplete, service being down, etc. MetaCrawler was a way to query all of them in parallel and get results, independent of whether one service or another was down or incomplete.
3. Whose idea was MetaCrawler?
Mine, fundamentally scratching my own itch.
4. What role did you and Etzioni each have in creating MetaCrawler?
I wrote all the code and did all the design; Oren Etzioni, my advisor, guided me in terms of working on the “interesting” problems from a research point of view. Fundamentally, a Web service that simply sends a query to a number of search engines and brings back results isn’t all that interesting for a researcher. That’s an engineering problem, and not a difficult one. But there are a number of questions that ARE interesting — such as how do you optimally collate results? How do you scale the service? Can you automatically parse search engines, or does someone have to write a wrapper around each and every engine? How do you simulate features, like phrase search, on engines that don’t have that? Oren pushed me to answer those questions.
5. How was MetaCrawler different than its predecessor, SavvySearch?
SavvySearch was developed in parallel, and while it was released about a month before MetaCrawler, I’m not sure it’s fair to call it a predecessor. Peer is a more apt term. Essentially, they were the same general concept. SavvySearch was focused most on query routing — e.g. having a large set of disparate services, such as Roget’s Thesaurus. It would then route the query to what it determined were the most useful search engines. In contrast, my work on MetaCrawler was more concerned with scale and optimal collation of results from homogeneous sources.
6. As far as I can tell, MetaCrawler was the second metasearch engine (after SavvySearch). Am I right?
SavvySearch was released about a month before MetaCrawler, yes. I’m not aware of any that were released previously.
7. What would you say were the most important features of MetaCrawler?
At the time I wrote MetaCrawler, the #1 feature was that MetaCrawler scaled well. Effectively, all metasearch engines are big dispatchers – they take in a query, dispatch it to a number of downstream hosts, receive a response, and process. The long pole is in dispatching and waiting for the response, thus there’s a lot of parallel I/O going on. As far as I know, every meta-engine back in the day solved the problem using either multiple processes for parallelism (e.g. they used Perl) or multiple threads (e.g. they used Java or C++). Even MetaCrawler began on threads. However, all machines have a process or thread limit – a DEC AlphaStation 500, a pretty big box in the day (and what we ran MetaCrawler on), had a limit of about 1000 threads, but for practical purposes only about 500. If a search connected to 10 engines, that means only 50 simultaneous users can be active at any one time.
A fellow grad student asked me to implement real-time page checking, that would download result pages in real time to ensure they existed. That means each query MetaCrawler did could suck up 100 threads, as it asked about 10 services for 10 results each. In order to scale, I moved MetaCrawler to a multiplexed I/O model, which is what people used for parallel I/O before multiple processes and threads came along. It meant implementing the HTTP stack natively versus just using a call to fetch a URL, but the result was as traffic increased to MetaCrawler, I was able to satisfy those requests. All, and I do mean all, of the other engines at the time were unable to handle the traffic, and suffered outage after outage.
Check back next week for the second installment.