3
Apr

We’ve all heard the old adage, “Don’t believe everything you read.” The Internet is full of stuff to read; how do we know what to believe? While there are numerous search engines that present us with documents in response to our queries, how do we know if the information presented in these documents is accurate? Granted, much of what’s in the Internet is personal opinion and sometimes all we want is someone’s viewpoint. There are times, however, when we need to know that the information we are reading is of high quality. We may be researching product features to make a purchase decision, company information to form competitive intelligence strategy, or medical information to address a medical concern.

A major part of the answer to the question of whether information is accurate or not is to examine its source. This is where federated search engines really shine. By their nature, federated search applications usually query deep web database sources. The databases can’t be crawled. There are no links for Google to follow to extract all documents in such a database. Now, let’s consider the type of content that lives in these non-crawlable databases. Publishers who specialize in scientific, technical, and business research articles are most likely to store their documents in databases and to make their content searchable by federated search engines. Geological, geographic, demographic data lives in databases. Much political data lives in databases as well.

By its nature, data that is housed in databases is much more likely to be accurate than crawled data because publishers of content who place their documents into database and provide searchable interfaces are often, though certainly not always, charged with ensuring that their content meets certain standards. Much of the scientific and technical information of interest to the research community is vetted by government agencies. In the U.S., OSTI, the Office of Scientific and Technical Information, has created and hosts a number of applications that search deep web databases from U.S. federal agencies and even worldwide organizations who are accountable for the quality and accuracy of their information. Science.gov, WorldWideScience.org, and Science Accelerator are three such applications. Also, publishers in the business intelligence and other sectors who provide content via subscription have standards to uphold.

In contrast to the standards that many content publishers meet, the Internet at large (the crawlable web), has no standards to meet. The three part series that Dr. Warnick, Director of OSTI, and I collaborated on and published in the OSTI Blog, Federated Search - The Wave of the future? identifies the issues with quality of crawled content vs. federated content. When it comes to crawled content, the Internet really is the Wild West!

Peter Cochrane has an article in his blog titled Searching for the truth. Cochrane explains that we have hundreds of search engines but we still have a major problem. He writes:

By and large, today’s search engines are brilliant and useless at the same time. Finding 63,800,000 results is both confounding and worrying.

Have I missed something vital and have I got the full picture, the latest and most relevant document, and perhaps more important, is what I am reading true?

Bluntly, I have no idea, and neither do you. What we really need is a search-engine filter based on validity. In short, a truth engine.

What we want to know is: are these statistics accurate and up to date? Are these historical events in order? Do I have the best available data? Is that politician correct? Is he lying, or is he bending the truth to his own advantage? And so on.

Cochrane sees Web 2.0 as the start of the answer to filtering data for “truth.” He writes:

“Web 2.0 is the starting point. Bandwidth, connectivity and sensors everywhere are the vital components - without which there will be no collective intelligence.

And the next big step is adaptability of software and ideally, but not absolutely necessary, hardware. These aspects are being born of the evolutionary developments in artificial life and artificial intelligence.”

I’m not so optimistic. I think it will be a very long time, Cochrane says only 20 years, before search engines give us what we’re looking for, accurate information, the first time. I don’t see how software will get us from unstructured unvetted data to verifiably correct and actionable information. Wired Magazine writes about a different approach to finding quality information; if you’re looking for quality information in the Wild West, you need a guide. The article gives a number of examples of businesses that endeavor to filter the content you don’t want and provide you with what you do want.

Think “social networking meets search.” The Wired article explains how businesses are being started to filter information found by search engines. Brijit was founded in 2007 to take online and offline content and provide short (100 word) abstracts to the public. ChaCha hires people to answer user queries. Mahalo uses freelancers and volunteers to steer users of its search engine towards documents that they have reviewed.

I think the future of quality search results lies in the intersection of a number of technologies and approaches: federated search to access the high quality sources that Google can’t get to, human guides to lead us to better results, best-of-the-web sites and services, and perhaps some of the artificial intelligence software that Cochrane envisions. I think we should all keep an eye on Web 2.0; this growing global community, if banded together, can really help us to pinpoint what matters most to each of us.

If you enjoyed this post, make sure you subscribe to the RSS feed!

Tags: , , , , , , ,

This entry was posted on Thursday, April 3rd, 2008 at 1:33 pm and is filed under viewpoints. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

Leave a reply

Name (*)
Mail (*)
URI
Comment