I’ve been getting interested in the history of the deep web. I think it’s important to know our roots as a federated search community and those roots are in the early exploration of the deep web. Plus, without the deep web I don’t think there would be much of a federated search industry, although some may argue otherwise. (See my article, Federated search and the deep web, for further discussion of the relationship between the two concepts.)

This blog’s federated search luminary series helps to document history; it aims to acknowledge those individuals who contributed to shaping the industry including including those who have influenced how we mine deep web content today. The series currently honors two luminaries; read on to learn about who the third luminary will be.

Michael Bergman, probably best known for his BrightPlanet white paper in which he quantifies the size of the deep web, has played an important role in shaping federated search history. I deem the single contribution of the white paper to be important enough that I’ve chosen Mr. Bergman as the third federated search luminary in the series, together with Kate Noerr and Todd Miller. I must note that Bergman’s contributions are much greater than a single white paper and, most recently, include major contributions to the understanding and analysis of the semantic web.

Bergman discusses a number of significant and recent documents pertaining to the deep web in his AI3 Blog in the recent New Currents in the ‘Deep Web’ article. He also answers one of my burning questions: did he indeed coin the term “deep web?” It turns out he co-coined the term. He goes on to give kudos to Maureen Flynn-Burhoe for her recent publication of a very thorough timeline of deep web events in her blog. The timeline is the fruit of some very serious research. It goes back to 1980 and includes a very thorough bibliography. My one complaint – virtually no coverage of federated search vendors, a number of whom helped to popularize the deep web.

Bergman continues with a discussion of an article, Searching the Deep Web, by Alex Wright in the October 2008 edition of “Communications of the ACM.” (Click on the page image to zoom in, click again to zoom out.) The article, to which Bergman contributed, articulates well the differences between crawling the surface web and indexing the deep web. Additionally, the ACM article discusses new developments in technologies to mine the deep web and the emerging semantic web. Wright packs a remarkable amount of information into two pages. There are a good half dozen topics that warrant further exploration in those two pages.

Bergman next brings to our attention a recently published paper, “Toward the Semantic Deep Web,” by James Geller, of the New Jersey Institute of Technology, and colleagues Soon Ae Chun and Yoo Jung An. The paper introduces the semantic deep web as the fusion of some aspects of the semantic web with ontologies to mine deep web data.

Finally, Bergman provides a short bibliography of his own, including a reference to the 153 page 2008 Ph.D. thesis of Dennis Shestakov, Search Interfaces on the Web: Querying and Characterizing. Of particular interest to those following federated search technologies is the dissertation’s description of an architecture for automatically finding deep web search interfaces as well as discussion of approaches to automatically submitting queries and extracting results from the queried sources.

I’m delighted to see an increased focus on the deep web. To the extent that happenings in the deep web impact federated search I will happily cover them.

