Sometimes I write about things that are not quite related to federated search. This is one of those articles. While I am writing about the deep Web, this article is not about the aspect of the deep Web that the federated search community is focused on. But two of the important people in this article are ones I’ve written about before so there is some relevance here if you read on.

I received no fewer than three emails (and a flurry of Google alerts) about Alex Wright’s article in yesterday’s New York Times: Exploring a ‘Deep Web’ That Google Can’t Grasp. I like it when important publications write about the deep Web and help to spread awareness of it.

The gist of the New York Times article is that the deep Web is huge, that it contains data that can answer practical questions (e.g. finding the best airline fare), and that there is a growing effort to tackle the difficult problems of mining deep Web data.

It’s interesting to note that businesses that focus on the deep Web approach it from one of two directions; there are the federated search companies that are primarily interested in providing searchable access to the scholarly, business, and technical documents they find within it, and there are those who are interested in mining deep Web data to solve business problems (and find cheap airfares.) Alex Wright’s article focuses on the latter.

Wright is very interested in the problem of how to derive meaning from data in the deep Web. There is the semantic Web where content is tagged to facilitate automatic extraction of meaning by machines. That content can be tagged by humans or by software. Wright is looking at the semantic Web problem in the realm of the deep Web.

Wright’s article touches on the work of software company Kosmix in tackling one aspect of the deep Web problem:

“The crawlable Web is the tip of the iceberg,” says Anand Rajaraman, co-founder of Kosmix (www.kosmix.com), a Deep Web search start-up whose investors include Jeffrey P. Bezos, chief executive of Amazon.com. Kosmix has developed software that matches searches with the databases most likely to yield relevant information, then returns an overview of the topic drawn from multiple sources.

Google’s efforts in mining the deep Web are not overlooked:

Google’s Deep Web search strategy involves sending out a program to analyze the contents of every database it encounters. For example, if the search engine finds a page with a form related to fine art, it starts guessing likely search terms — “Rembrandt,” “Picasso,” “Vermeer” and so on — until one of those terms returns a match. The search engine then analyzes the results and develops a predictive model of what the database contains.

News of Google and the deep Web is not new. I’ve written several times on the subject:

Wright reports on a major effort to index the deep Web:

In a similar vein, Prof. Juliana Freire at the University of Utah is working on an ambitious project called DeepPeep (www.deeppeep.org) that eventually aims to crawl and index every database on the public Web. Extracting the contents of so many far-flung data sets requires a sophisticated kind of computational guessing game.

The last word in the article is given to federated search luminary Michael Bergman:

“The huge thing is the ability to connect disparate data sources,” said Mike Bergman, a computer scientist and consultant who is credited with coining the term Deep Web. Mr. Bergman said the long-term impact of Deep Web search had more to do with transforming business than with satisfying the whims of Web surfers.

Those of us interested in understanding how access to the deep Web is evolving would certainly benefit from understanding what the realized and unrealized potential is of the semantic Web. Michael Bergman answers my question of what the semantic Web is and what its future looks like in question #20 of my interview with him. I also wrote about Bergman in my exploration of the history of the deep Web. In that article I mention an earlier article by Alex Wright on the deep Web that goes into more depth than the New York Times article on various deep Web mining efforts and technologies:

…Bergman continues with a discussion of an article, Searching the Deep Web, by Alex Wright in the October 2008 edition of “Communications of the ACM.” (Click on the page image to zoom in, click again to zoom out.) The article, to which Bergman contributed, articulates well the differences between crawling the surface web and indexing the deep web. Additionally, the ACM article discusses new developments in technologies to mine the deep web and the emerging semantic web. Wright packs a remarkable amount of information into two pages. There are a good half dozen topics that warrant further exploration in those two pages.

These are exciting times for the deep Web and for the fledgling semantic Web, commonly referred to as the third generation of the web, or Web 3.0. Developments in mining the deep Web are sure to affect all of us in the federated search world as we learn new and better ways to connect users to content. And, one day there will be federated search engines that aggregate content from semantic Web search engines. From that perspective, all aspects of deep Web searching are worth following.

If you enjoyed this post, make sure you subscribe to the RSS feed!

Tags: ,

This entry was posted on Monday, February 23rd, 2009 at 6:14 pm and is filed under viewpoints. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

2 Responses so far to "Alex Wright on the deep web in the New York Times"

  1. 1 Kris Meister
    February 24th, 2009 at 12:58 am  

    Does it make anyone else uncomfortable that Write describes the process as “analyzing the contents of database” ? It should more clearly be described as submitting forms to find additional content not otherwise accessible to search engines.

    Though I applaud you Sol for not buying into Wrights database analytics nonsense.

    The idea of a search engine having access directly to a site’s database is absurd and not to mention databases are usually password protect and very secure.

  2. 2 Marcus Banks
    February 24th, 2009 at 12:56 pm  

    Thanks Sol-good overview of all the issues at stake and the main players. I wish Wright had mentioned librarians in his piece, but that’s more a problem for librarians to increase our visibility than it is for him.

Leave a reply

Name (*)
Mail (*)