I recently took a much needed break. I spent a couple of days with a very dear friend in Colorado. On the drive back to Santa Fe, I called Abe to check in. In our discussion Abe told me that there has been a fair amount of buzz in the blogosophere about Google “surfacing” deep Web content. Last April I first wrote about Google’s efforts to crawl the deep Web. A couple of months later I followed up with Why is Google interested in the deep web. Today there’s more to write about.
Yahoo! Tech News published an article on January 30: Google Researcher Targets Web’s Structured Data (PC World). The article’s first paragraph is ominous, unless you believe that Google is regurgitating old news:
Internet search engines have focused largely on crawling text on Web pages, but Google is knee-deep in research about how to analyze and organize structured data, a company scientist said Friday.
There is new news. Check out Google’s Deep-Web Crawl. Google is indeed stepping up its efforts to mine the deep Web. Google uses the term “surfacing.” What Google is doing more of is submitting queries to HTML forms and adding the results it finds to its index. From Google’s perspective this makes sense. Their model is to build a comprehensive index. Google isn’t interested in building federated search applications. But, they’d love to index all the good content behind search forms and blend those documents in with documents and web pages it finds by crawling. Here is a paragraph from the Google article’s abstract:
Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.
I’ve bolded several of the sentences in the abstract. Google is making progress. If the abstract doesn’t convince you then read the entire 12-page paper. I bet you’ll conclude, as I do, that, yes, Google is gaining traction in its efforts to extract deep Web content.
So, is Google’s movement to surface deep Web content a threat to federated search? Yes and no. Yes because Google will be able to legitimately claim that a “bunch” of quality deep Web content is now available through its search engine. No for a couple of reasons. First, Google will never be able to claim that it has comprehensive content from any deep Web sources because Google can’t know whether its sampling approach retrieves all documents from any particular source or not. What serious researcher would use a search engine that returned many but not all relevant documents from a particular source? Not many. Second, Google’s high quality deep Web results are going to be merged into a result list with documents of unknown quality. What serious researcher would go to a library that had scholarly journals on the same shelf as non-scholarly journals if he or she would need to open each journal and determine which were the scholarly ones and which weren’t? Again, not many.
So, while I give Google credit for developing the technology to mine the deep Web, once again, I don’t think the sky is falling.