Jul
I’m incubating a white paper about the Deep Web. The Deep Web is all that content (more than 99%) of the web that Google can’t find by crawling, right? It’s all that stuff that lives inside databases and can only be found by filling out forms, right? The main value add of Deep Web search engines is that they find only Deep Web documents, right? Not all that long ago I would have answered “yes” to all these questions. Today I’m confused.
Today I was chatting with Darcy from (blog sponsor) Deep Web Technologies’ marketing department about the white paper. I’ll refer to her as Deep Web Darcy. Well, Deep Web Darcy is asking me some rather “deep” questions about the Deep Web. We discussed harvesting, crawling, indexing, Deep Web searching, and so much more. If someone’s Deep Web content finds its way to Google has that content become surfaced and does that content no longer qualify as buried treasure? If one’s Deep Web content can be harvested, is it not really Deep Web content? If someone is browsing that content in the forest, with only one hand on the keyboard, does that content make a sound? So many koans. So little time. My brain hurts.
I suspect that the trend for content publishers to create sitemaps to allow the big crawlers to surface their Deep Web content will only continue to grow. More pages indexed by Google means more people will find your documents. Who wouldn’t want more traffic to their sites? If a federated search engine searches a number of Deep Web sources whose content is also available through Google then what value does the federated search engine provide? Lots. In particular, you won’t dredge up lots of junk along with the Deep Web treasures when you go with federated search. Federated search adds tremendous value when it selects in the good sources and selects out everything else. But, when Deep Web sources are delivering the same documents you can get from Google can one still say that the value of narrowing your search is enough?
Deep Web Darcy asked the same annoying question about harvestable content. If someone gets all your Deep Web documents via some harvesting protocol, e.g. OAI-PMH, and makes them all available on the Web, is that content still part of the Deep Web just because it had a search interface to it?
Is Google part of the Deep Web? Google is searched through a form, as well as through an API. Google and its collections (News, Images, Blog Search) are sometimes included in federated search applications alongside “real” Deep Web sources.
Deep Web Darcy had even more questions. Unfortunately (for her) I suddenly realized that my car desperately needed to be washed before the impending rainstorm and I had to terminate the very interesting (to her) conversation. Perhaps these paradoxical questions will leave me so confused that I’ll become enlightened. Perhaps if I had continued the interesting (to her) conversation my car would have washed itself.
Your thoughts on this important matter? Please enlighten me.
If you enjoyed this post, make sure you subscribe to the RSS feed!
Tags: deep web, federated search
6 Responses so far to "If only I knew what the deep web was"
July 6th, 2009 at 7:10 pm
Sol,
The next time we are in a very interesting conversation (to me), please remember that I would be happy to search the “shallow” web for a nearby carwash so you don’t have to rush off so quickly! (smile)
I look forward to our next conversation. Perhaps we could ponder your proposed koan?
Deep Web Darcy
July 6th, 2009 at 8:19 pm
I’ve always thought ‘deep web’ was kind of a shifty concept.
Even though by far Google is the most popular free public search engine, it shouldn’t be measured based only on what _Google_ has indexed, should it?
But yeah, one way or another, I think more and more of the web will become indexed by free public search engines. For the portions that aren’t — and keeping in mind that this is not an ‘existential’ category, but just an ‘accidental’ one — perhaps ‘hidden web’? With some parts of it more hidden than others?
One part of the ’somewhat hidden web’ whose hidden nature is as much policy-based as technically-limited is for-pay licensed content.
In the academic meta-search market, providing access to this for-pay licensed material is one of the main motivators.
On the one hand, even vendors of for-pay licensed content and search indexes are trying to figure out how to expose their content to Google without giving away the cow for free. Sometimes this leads to odd arrangements where they are willing to share more with Google (for free) than they are with even their own paying customers, or to special arrangements with Google that they don’t allow other search engines to do. Or, on the other end, I’m guessing these special arrangements with Google sometimes involve Google giving them special dispensaton to violate a rule Google normally applies to the general public — allowing them to present a different version of a page to the google search engine that isn’t presented to the general public.
I’m not sure how long these kind of weird special arrangements can go on, but they’re trying to figure out their business model, same as everyone else.
July 6th, 2009 at 8:54 pm
The Deep Web is a place. If other people get to it, that doesn’t change its location.
What has been bothering me is how to frame discussing the places where Deep Web rubs up against Semantic Web.
July 7th, 2009 at 1:46 pm
An interesting and entertaining read!
July 7th, 2009 at 6:36 pm
Of course two other areas of the Deep Web (other than “for fee” content for academics - or anybody) are enterprise databases and volatile data.
Enterprise content is kept out of the surface web by design (not always all that well) so it is never going to be “surfaced” One could argue that the for-fee content of the publishers and aggregators is part of an enterprise data store, but it is really a “product” designed to be sold eventually. What I am talking about are the corporate internal resources - anything from HR records to secret recipes. They will stay hidden (or deep).
The volatile data is only deep because of its nature. Here today, gone in a flash. This data by its nature will never become part of the surface web. Although any one piece of information has a fleeting existence (the current temperature, a stock price, etc.) the data item is there long term, so it sort of flickers in and out of existence (not to get overly philosophical…) Interestingly one of the categories of these are search results where they happen to be displayed at the exact instant one of the search engines crawls that results page. A little cunning Google research will find a few interesting OPAC pages captured and indexed in various states. And, of course the advent of AJAX and friends means that some pages are partly static and partly volatile. And so it goes on.
One final question. Are Federated Search engines really associated only with the Deep Web? And if so should they be? Or should they range more widely?
July 7th, 2009 at 10:14 pm
If the Deep Web is defined as ‘hidden’ or inaccessible, once it becomes found and accessble, it’s no longer part of the Deep Web, right? If ‘owners’ of the deep web make some or part of their content spiderable and accessible, it’s no longer the deep web, right? To me, the Deep Web is a problem, and Goolgle (and other search engines) are finding solutions to unearthing it, and it no longer becomes a problem.