Today the interview with Erik Selberg continues. (You can read my preview of this series with Erik (and the list of questions) here. In this installment we further discuss MetaCrawler and we look at in it the context of today’s federated search applications.
8. How long did it take you to build MetaCrawler?
Three to five months initially if I recall.
9. What were your biggest challenges in implementing MetaCrawler?
Dealing with the Web, and in particular understanding the HTTP protocol. I had an undergraduate degree from CMU, so I knew the fundamentals, but I really didn’t know much about network protocols and parallel I/O at the time, so a large amount of my time was spent just trying to make it all work.
10. What was the business model for MetaCrawler when you first developed it?
There wasn’t one. This was at a time when it wasn’t even clear if there were business models for search engines, and people were reacting rather negatively towards people who would put up ads on their site. At the time, this was just a grad student project.
11. Wikipedia makes this interesting statement about MetaCrawler that I bet few people realize:
“Originally, MetaCrawler was created in order to provide a reliable abstraction layer to early Web search engines such as WebCrawler, Lycos, and InfoSeek in order to study semantic structure on the Web.”
Can you say more about the abstraction layer and about your study of the semantic structure on the web?
As I said above, a dispatcher that would simply take a query, transform it a bit, forward it to other sites, gather results, and display them, wasn’t all that interesting. However, that tool could be used to collect a large number of web pages about a topic from “knowledgeable sources” and thus we could do something to analyze semantic structure. However, this wasn’t terribly well defined, and by the time we had MetaCrawler, we still weren’t sure what structure we’d want to investigate and even what kinds of semantics we were interested in. So, that part of the project was dropped, and we focused more on the research of MetaCrawler itself.
It does turn out a lot of research goes that way… something is created for one purpose, but then it’s discovered that the thing is interesting in its own right, more so than the original purpose. It’s a good research program, such as the one at UW, that will recognize when that happens and encourage students like me to pursue the emergent discoveries. Oren’s advice on the matter was to always investigate surprises with great vigor. Predictable things are, well, predictable, and the research that comes from steady improvement, while beneficial, tends to be rather boring. However, when you discover something that was unexpected, the results and explanations are almost always exciting and fascinating.
12. What do you see as the major differences between MetaCrawler and today’s federated search applications?
Google, Yahoo, and Microsoft have done a decent job of solving Web search, such that integrating them, even tossing in a few other also-rans like Ask.com, doesn’t make that much of an improvement. So the value of MetaCrawler just integrating Web search isn’t very high, as it turns out now. However, there are two areas where there’s still focus that I see. One is niche search — which is either topical, or regional. For example, there are a number of search engines focused on South Africa, such as Ananzi, Aardvark, and Funnel. They’re doing well because they’re focused on that market, and frankly Google, Yahoo, and Microsoft aren’t, because that market isn’t big enough for the big guys to focus some of their time on it. I do see some metasearch in that world, although it’s also a bit of a niche product.
The other area of development is regarding the Deep Web or Invisible Web, i.e. the data hiding in databases behind web forms. This is still mostly inaccessible from the Googles of the world, and federated search is able to spend the effort to include those types of queries through a single interface. Here is where it’s topical — for example, there’s lots of medical and governmental activity here.
13. Which are the best metasearch engines today?
Well, I don’t know if I can answer that question in an unbiased way, nor would I expect anyone to believe my answer to be unbiased, so I’ll pass.
14. Is there a relationship between MetaCrawler and Dogpile?
Both are run by InfoSpace, located in Bellevue, WA. InfoSpace has had an interesting history, but the short of it is that they’ve divested themselves of all their non-search business, and are focused exclusively on meta-search at the moment.
15. How would you compare MetaCrawler and Dogpile?
Last I checked they’re powered by the same code! There are two main differences:
- MetaCrawler collates results into a single list; DogPile just shows the Top 10 from the various engines one after another.
- The name.
Now, before you assume those two differences are trivial… think again. It turns out there are two main reasons why MetaCrawler never really became mainstream. The first is that most people couldn’t tell the difference between the results MetaCrawler provided and the results AltaVista or Lycos or whomever provided — they all provided 10 results to a search query. Yes, the discerning user knew MetaCrawler’s value-add of providing the best from those engines, but it was a bit slower, and the interface wasn’t as nice. So that is significant.
The second difference is that the name doesn’t sell all that well. Sad, but true. Lots of people when they hear “MetaCrawler” actually hear “Medi-Crawler” and think medical. DogPile, for it’s connotations, is pretty easily recognizable and fun, so it gets more mileage out of that.
Next Friday I’ll publish the third and final installment of the interview.