The e-resources@ uvm blog has a post this morning that, among other things, said this:
The closing speaker, Tom Wilson (University of Alabama), briefly made a point about Google that I really liked, and that led to discussion afterwards. He pointed out that Google is not a federated search engine: it uses relevancy ranking (maybe well, maybe not well) and federated searches can’t. Federated search engines are, by nature, multiple databases, and can’t apply relevancy like Google can with its single database. I had never thought through to that point, and I think it’ll be on my mind for the plane ride home.
This statement really caught my attention because it’s wrong. I worked at Deep Web Technologies (this blog’s sponsor) for five years and know their technology pretty intimately. Deep Web puts a tremendous amount of effort into doing relevance ranking. Most other federated search vendors provide relevance ranking as well.
I think the point that Mr. Wilson was trying to make was that it is much more difficult for federated search applications to do relevance ranking than it is for applications that crawl and index their content. The difference between federated search and crawling isn’t, as the blog post claims, that “Federated search engines are, by nature, multiple databases, and can’t apply relevancy like Google can with its single database.” The difference is that Google has complete information in its database while federated search engines perform their relevance ranking with very incomplete information.
When a user performs a query using Google, Google can find the user’s search terms anywhere in potential document matches because it has read (extracted text from) entire documents and indexed that text. Google can rank multiple results against one another and do it consistently because it has the full text for all of the results it’s comparing.
Federated search, for all the great benefits it has, is severely limited on the relevance ranking front. A federated search application typically has access to only the title, summary, and other small bits of metadata in the result list of documents returned from a search of its databases. The federated search application can also capture the underlying search engine’s ranking of documents in the result list and use that information to influence its own ranking. But, the federated search application is at the mercy of the databases it searches. Many databases provide poor results that don’t match queries particularly well or, if they do provide relevant results, they may not rank them very well. The federated search engine doesn’t have the luxury of examining the full text of the documents returned by the databases to see if it can rank them better than the source. So, the federated search application is stuck performing the relevance ranking as best it can given very limited information.
Having pointed out the limitation of federated search relevance ranking, I must also say that not all federated search applications rank equally. A search of this blog for the phrase, “relevance ranking”, turns up a number of articles where I’ve addressed the issue. In particular, What determines quality of search results, discusses this subject at length. Plus, a federated search application can extract full text from documents it retrieves from databases, and use the full text to improve its ranking. Science.gov, whose search engine was built by Deep Web Technologies, employs this and another strategy, in selected cases, to improve relevance ranking.
Mike Moran, in the Biznology Blog, compares Dogpile (a metasearch application that federates several of the most popular web crawler search engines, including Google) to searching the underlying search engines individually. While I don’t agree with a number of the points that Mr. Moran makes in the article, I do believe he has articulated the problem well:
Because Dogpile doesn’t actually examine the documents, it suffers from limitations that degrade its results. Relevance ranking, while difficult in a single-index search engine, is excruciating for a federated search engine. Google can rank documents based on where the words appear in the documents, which documents get links to them, and dozens of other factors. Dogpile can’t. Dogpile can only take a guess at which documents are better by examining the titles, snippets, and URLs that Google returns to display on its search results screen. That’s why most people prefer Google, or Yahoo!, or another one-index search engine to Dogpile and Metacrawler.
Note that Dogpile conducted research that refutes Mr. Moran’s claim that most people prefer the native search applications over Dogpile.
For all its strengths, there are trade-offs when using federated search; relevance ranking is one of those places but a well designed search application can overcome some of the inherent limitations.
Tags: federated search