When Google announced on April 11 that it was experimenting with crawling the deep web, the blogosphere buzzed with excitement, concern, and questions. The conversation, however, is not new. In March of last year, Geeking with Greg reported on a December 2006 paper: Structured Data Meets the Web: A Few Observations. The paper, attributed to Google, describes Google’s efforts to search structured data and to structure data that isn’t structured. And, an October 19, 2006 blog article at SEO by the Sea identifies a Google patent filed on April 5, 2006 and published October 12, 2006: Searching through content which is accessible through web-based forms. The three inventors listed in the patent are among the co-authors of the Google paper.
What is structured data? Much of the web is text. For a majority of this text it is difficult for computers to determine what the text is about. Making the text searchable by keyword is not good enough as anyone who has gotten 10 million results for a search can testify. What if authors or readers tagged content? Humans can categorize content better than any computer algorithm thus tagged content has great value for search engines wanting to deliver more relevant information. When a person tags or otherwise categorizes content, he or she is adding structure to the content. In the library world, document metadata is a perfect example of structure. Being able to extract the title, author, and abstract associated with a document, because the fields are tagged, makes searching based on these fields possible and greatly improves the overall search experience.
Why is Google interested in the structure of data and what is the tie-in to federated search? In a nutshell, structured data represents a very large, and growing volume of information on the web, much of this data lives in the deep web, and federated search technology is critically dependent on mining the deep web as that’s where very much of the best business, scientific, and technical information lives. Plus, Google’s ability to access the deep web is very limited; most of what it does is crawl the surface web.
Google, in its desire to make searchable all of the world’s information, is naturally interested in getting at this structured deep web data. Google has two general approaches to dealing with structured data; (1) Find ways to extract and utilize structured data from the deep web, and (2) Make it easier for humans to apply structure to unstructured data. The Google paper broadly addresses both approaches.
My impression from reading the first half of the Google paper is that Google is primarily interested in mining user-generated consumer content, not the scholarly content the federated search industry is searching. Google is very interested in using annotation (tagging) data to improve searches. And, it is very interested in encouraging humans to tag content that they create or identify.
Of interest to those following the story of Google’s experimenting with searching of forms, the second half of the paper introduces the concept of surfacing. Surfacing is Google’s process of submitting multiple queries to a web form, then extracting and indexing the search results. The article explains that Google employs surfacing as a way to attempt to capture deep web data while attempting to overcome some of the limitations of extracting it and analyzing it in real time. If you read Section 3 of the paper, or all of it for that matter, you’ll realize that Google’s goals in getting at deep web content are different from those of federated search companies. Google is aiming to increase its lead in the consumer search industry, mining web 2.0 content, and deep web content of interest to consumers, and presumably to advertisers as well.
The blog articles and the Google paper I reference in the first paragraph were very helpful to me in deepening my perspective of what Google is thinking with its foray into deep web searching. I hope you find the references, and this article, enlightening as well.