I’m incubating a white paper about the Deep Web. The Deep Web is all that content (more than 99%) of the web that Google can’t find by crawling, right? It’s all that stuff that lives inside databases and can only be found by filling out forms, right? The main value add of Deep Web search engines is that they find only Deep Web documents, right? Not all that long ago I would have answered “yes” to all these questions. Today I’m confused.
Today I was chatting with Darcy from (blog sponsor) Deep Web Technologies’ marketing department about the white paper. I’ll refer to her as Deep Web Darcy. Well, Deep Web Darcy is asking me some rather “deep” questions about the Deep Web. We discussed harvesting, crawling, indexing, Deep Web searching, and so much more. If someone’s Deep Web content finds its way to Google has that content become surfaced and does that content no longer qualify as buried treasure? If one’s Deep Web content can be harvested, is it not really Deep Web content? If someone is browsing that content in the forest, with only one hand on the keyboard, does that content make a sound? So many koans. So little time. My brain hurts.
I suspect that the trend for content publishers to create sitemaps to allow the big crawlers to surface their Deep Web content will only continue to grow. More pages indexed by Google means more people will find your documents. Who wouldn’t want more traffic to their sites? If a federated search engine searches a number of Deep Web sources whose content is also available through Google then what value does the federated search engine provide? Lots. In particular, you won’t dredge up lots of junk along with the Deep Web treasures when you go with federated search. Federated search adds tremendous value when it selects in the good sources and selects out everything else. But, when Deep Web sources are delivering the same documents you can get from Google can one still say that the value of narrowing your search is enough?
Deep Web Darcy asked the same annoying question about harvestable content. If someone gets all your Deep Web documents via some harvesting protocol, e.g. OAI-PMH, and makes them all available on the Web, is that content still part of the Deep Web just because it had a search interface to it?
Is Google part of the Deep Web? Google is searched through a form, as well as through an API. Google and its collections (News, Images, Blog Search) are sometimes included in federated search applications alongside “real” Deep Web sources.
Deep Web Darcy had even more questions. Unfortunately (for her) I suddenly realized that my car desperately needed to be washed before the impending rainstorm and I had to terminate the very interesting (to her) conversation. Perhaps these paradoxical questions will leave me so confused that I’ll become enlightened. Perhaps if I had continued the interesting (to her) conversation my car would have washed itself.
Your thoughts on this important matter? Please enlighten me.