Often overlooked in conversations about federated search vs. Google crawling is that the difference between the two search approaches isn’t just whether the search engine has to get to content behind search forms or whether it follows links to build an index. Quality of content is also a major fundamental difference. A library patron expects all of the resources available to him or her to be of high quality, whether those resources are physical books and journals, or digital content. This same expectation holds true for resources presented by federated search engines, especially those used in academic, business, or scientific environments.
The Google relevance model is largely based on “authority,” which is based on popularity, which is NOT the same as credibility. Particular scientific findings may be published on the Web, not widely referenced, and not considered important to Google, yet they may be noteworthy and from highly credible sources. And, highly popular Web documents may be highly ranked by Google yet fall into the category of “pseudo science.”
From the Nihil Obstat blog I learned about an upcoming workshop on information credibility on the Web. The workshop description tells the problem:
As computers and computer networks become more common, a huge amount of information, such as that found in Web documents, has been accumulated and circulated. Such information gives many people a framework for organizing their private and professional lives. However, in general, the quality control of Web content is insufficient due to low publishing barriers. In result there is a lot of mistaken or unreliable information on the Web that can have detrimental effects on users.
The description goes on to tell that technology is needed to determine accuracy and trustworthiness of Web documents, among other characteristics. Given the explosive growth of Web 2.0 and user generated content, the need is greater than ever to separate the wheat from the chaff.
As an aside, the keynote speaker for the information credibility conference, Ricardo Baeza-Yates from Yahoo! Research, will be sharing interesting results “that show that user generated content in Flickr, Yahoo! Answers and Wikipedia is better than what can be imagined.”
The fact that there’s an entire conference dedicated to information credibility and that there are over 15 million Google hits for the unquoted words “information credibility” tells me that lots of people care about this problem, although the quoted phrase yields only 15,000 results.
Federated search goes a long way to solving, or bypassing, the information credibility problem. In the academic, business, and technical research environments where federated search engines are most likely to be found — i.e. not in the popular consumer-oriented metasearch engines — the content sources are all vetted and the chaff is left out.
In my book, credibility is what separates federated search engines from crawlers. I’m not saying that Google doesn’t deliver outstanding content. It does. It’s just hard to always tell which that is. If you want to know what other factors influence the quality of search results, I recommend this article.
Tags: federated search