2
Feb

I recently took a much needed break. I spent a couple of days with a very dear friend in Colorado. On the drive back to Santa Fe, I called Abe to check in. In our discussion Abe told me that there has been a fair amount of buzz in the blogosophere about Google “surfacing” deep Web content. Last April I first wrote about Google’s efforts to crawl the deep Web. A couple of months later I followed up with Why is Google interested in the deep web. Today there’s more to write about.

Yahoo! Tech News published an article on January 30: Google Researcher Targets Web’s Structured Data (PC World). The article’s first paragraph is ominous, unless you believe that Google is regurgitating old news:

Internet search engines have focused largely on crawling text on Web pages, but Google is knee-deep in research about how to analyze and organize structured data, a company scientist said Friday.

There is new news. Check out Google’s Deep-Web Crawl. Google is indeed stepping up its efforts to mine the deep Web. Google uses the term “surfacing.” What Google is doing more of is submitting queries to HTML forms and adding the results it finds to its index. From Google’s perspective this makes sense. Their model is to build a comprehensive index. Google isn’t interested in building federated search applications. But, they’d love to index all the good content behind search forms and blend those documents in with documents and web pages it finds by crawling. Here is a paragraph from the Google article’s abstract:

Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.

I’ve bolded several of the sentences in the abstract. Google is making progress. If the abstract doesn’t convince you then read the entire 12-page paper. I bet you’ll conclude, as I do, that, yes, Google is gaining traction in its efforts to extract deep Web content.

So, is Google’s movement to surface deep Web content a threat to federated search? Yes and no. Yes because Google will be able to legitimately claim that a “bunch” of quality deep Web content is now available through its search engine. No for a couple of reasons. First, Google will never be able to claim that it has comprehensive content from any deep Web sources because Google can’t know whether its sampling approach retrieves all documents from any particular source or not. What serious researcher would use a search engine that returned many but not all relevant documents from a particular source? Not many. Second, Google’s high quality deep Web results are going to be merged into a result list with documents of unknown quality. What serious researcher would go to a library that had scholarly journals on the same shelf as non-scholarly journals if he or she would need to open each journal and determine which were the scholarly ones and which weren’t? Again, not many.

So, while I give Google credit for developing the technology to mine the deep Web, once again, I don’t think the sky is falling.

If you enjoyed this post, make sure you subscribe to the RSS feed!

Tags: ,

This entry was posted on Monday, February 2nd, 2009 at 10:27 pm and is filed under viewpoints. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

6 Responses so far to "Update on Google and the deep web"

  1. 1 Matthew Theobald
    February 3rd, 2009 at 12:45 am  

    Might be a good time to talk again about Internous LLC and ISEN and DWT’s federated workhorses.

  2. 2 Steve Oberg
    February 3rd, 2009 at 9:50 am  

    No, the sky isn’t falling, but I think you fail to give Google its due for what is a significant and interesting advancement. I think their paper proves fairly clearly that they are gaining significantly by their investment in this approach, and that their investment is worthwhile. In my view this is precisely the kind of thoughtful, iterative, and creative approach needed to improve search engine content.

  3. 3 Sol
    February 3rd, 2009 at 4:05 pm  

    Steve - I agree with you that Google’s achievement is significant. From the perspective of the federated search industry, the question is whether this will have a major impact.

  4. 4 Tal Ayalon
    February 4th, 2009 at 6:08 am  

    Sol,

    Thank you for yet another insightful post.
    You mention two reasons why makers of federated searching tools should not be alarmed by Google’s dive into the deep water of the deep Web.
    I believe that Google could make great strides towards overcoming your second point, scholarly materials being mixed up with non-peer-reviewed documents of unknown quality, by incorporating its deep Web retrieved records within its Google Scholar service.
    This, albeit not providing a 100% solution to this problem, would reduce it significantly and enrich Google Scholar content.
    The first point, incomplete coverage of relevant material per content provider or repository could also be negated with time, providing that we see a settlement between scholarly journal vendors and Google along the (unclear, at this point in time) lines of the Google Book Search settlement with publishers.
    Once Google is licensed to include full subscription-based scholarly article repositories within its search results, compensating authors and publishers in an agreed upon manner, it will enable Google to offer comprehensive deep Web content with unquestionable authority.
    Having said that, I do not believe that these future developments would render current federated searching tools obsolete.
    Federated searching tools, as the information world as a whole, would have to evolve and offer services and content which Google can’t or won’t provide.
    Human quality control and indexing of select article databases which provide better precision than Google’s deep Web automated tools could be one direction, as well as indexing local content produced by the buying institution, which is sometimes not on the Web at all, let alone possible for indexing by Google.
    Plus, there will always be a number of publishers unwilling to have their content indexed within a free search engine which would make the commercial federated searching tool the only tool able to retrieve their content within combined result sets.
    In short, I see this as a great step forward for Google, with the potential to lead to gigantic steps in the future, but I also see the commercial federated searching tool as a product which completes Google and will continue to do so in the foreseeable future.

  5. 5 Sol
    February 5th, 2009 at 7:04 am  

    Hello Tal,

    Thanks for the thorough comment. I don’t disagree with any of your points. As you way in your last paragraph, I believe that federated search complements Google. There’s clearly a market for both. And, as you also say, I don’t believe that all content publishers will make their content indexable by Google, especially the large volume of non-free content.

  6. 6 Update on Google and the deep web | Enterprise Social Search
    February 9th, 2009 at 1:48 am  

    [...] February An article originally posted on Federated search blog [...]

Leave a reply

Name (*)
Mail (*)
URI
Comment