Google experiments with crawling the deep web: sky isn’t falling | Federated Search BlogFederated Search
15
Apr

The blogosphere is buzzing with posts about Google starting to look for content to index behind web forms. Google has an announcement about their experiment in the Google Webmaster Central Blog. The date of the announcement is April 11 so I guess this isn’t an April Fools’ joke. ComputerWorld wrote about the announcement, as did search engine land, Google Blogoscoped, and others.

Google explains in their announcement that they look for HTML FORM tags in “a small number of particularly useful sites.” In a nutshell, here is what Google says it’s doing when it finds web pages with “FORM” tags:

… we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.

So, do federated search companies need to worry that Google is going to put them out of business by automatically crawling the valuable content that many of them federate? Do their customers need to worry that their investments in federated search will go to waste? I’m reminded of a blog post that Abe wrote last December, Federated Search: R.I.P. Abe reminisces about how, when Google Scholar was announced in November of 2004, he received a flurry of emails from customers who were concerned that Google would make their investment in federated search obsolete. Well, almost three and a half years later, federated search has grown in popularity, and Google Scholar hasn’t displaced any federated search vendors or deployments that I’m aware of. So, no, I don’t think Google’s experiment is a threat to any enterprise that mines (and typically federates) deep web content.

I think this announcement is over-hyped for a number of reasons.

My first observation is that Google is going after the low hanging fruit. But, the best quality content is going to take much more effort to extract. Google is only going after GET forms, it is employing a very simplistic algorithm for form submission, and it is currently only trying to extract content from a small number of sites. So, Google’s experiment is a hit-or-miss venture.

One should not gloss over the difficulty of submitting a web form. I have worked closely with engineers who build connectors to search databases, and I have developed a few myself, so I know first-hand that automated form submission, in the general case, is a very difficult problem. I’m not saying that automated form submission is not viable; I do believe, however, that it is difficult to do well enough for the large variety of search interfaces out there.

A more serious limitation of the Google approach, even when Google is able to fill out forms, is its inability to guarantee comprehensiveness against any given database. Harvesting content is difficult if the content provider does not cooperate. Assuming that you point Google to a web search form for a deep web database containing tens or hundreds of thousands of documents, how would Google extract them all? How would Google even know if it had extracted them all? To be comprehensive, Google would need a sophisticated algorithm to select search terms and perform sufficiently many searches to get back results pages with links to sufficiently many, if not all, documents in the database. If Google’s coverage of any particular database is unknown, then researchers are likely to want to go directly to the source and search it for comprehensiveness.

Another issue – how will Google rank documents it finds behind forms? Google users rely on Google’s PageRank algorithm to present them with the most popular articles given their search terms. Documents inside of databases will typically have very few links to them from the web. So, these deep web documents won’t be popular, thus they won’t rank very highly, they won’t be displayed prominently in search results, users won’t see these documents very often, and that defeats the whole purpose of Google increasing their coverage of the web. Google will likely apply some algorithm to determine an appropriate rank for deep web documents. It may, for example, assign the PageRank of the page containing the web form to all documents behind the form. How well this would work remains to be seen.

I’m not worried about Google making the federated search industry obsolete for a very fundamental reason. Researchers, students, and the public rely on federated search to provide comprehensive coverage of quality deep web databases that are free of low quality documents. Google fails on all three counts: Its coverage is not comprehensive, which databases it will attempt to harvest is not clear, and Google has no way to filter out low quality content. To people conducting serious scholarly or scientific research, popularity is not the best indicator of scholarly value. I don’t fault Google; its approach is to index as much as it can. That approach is fine for finding facts, figures, and popular content but that’s not what users of federated search applications are looking for.

If you enjoyed this post, make sure you subscribe to my RSS feed!

Tags:

This entry was posted on Tuesday, April 15th, 2008 at 6:04 am and is filed under industry news, viewpoints. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

Leave a reply

Name (*)
Mail (*)
URI
Comment