31
Oct

This is the second installment of my interview with Michael Bergman. The first part is here. Today, Michael tells about the early days at BrightPlanet and about his pioneering work with the deep Web. Plus, he tells us the origin of the term “deep Web.”

Michael is the third person I honor in this luminary series. You can read prior interviews with luminaries Kate Noerr and Todd Miller in the blog archives.

5. What were the early days of BrightPlanet like? What were the challenges and successes?

Exciting, exhilarating, exhausting and excruciating. And, sometimes all of those at once!

On the challenges side, I would have to put forward business model and financing (which are obviously related.)

Like many that survived the shakeout of the dot-com era, BrightPlanet went through a number of revenue and business model course changes including consumer product, enterprise platform, telesales, services, you name it. One key lesson for me today has been to put business model first and foremost in planning and positioning. I’ve seen enough of them now to be pretty skeptical about most models.

Also, while BrightPlanet was quite successful in venture financing, every single step of the way was a struggle. We had deals fall through at the last moment, dishonesty at times from some VCs, bait-and-switch negotiating tactics, and the need to accept unpleasant terms that go by their phrase of “taking a hair cut.” On the other hand, BrightPlanet was also blessed with many stalwart and supporting angels and local investors.

So, while your mileage may vary as do real needs for capital to finance growth, I would suggest trying to avoid financing for development alone. I humbly think putting revenue first over financing is a better formula for understanding and satisfying customer demand.

I have not been involved with BrightPlanet now for two years, but on my watch the successes for BrightPlanet resided in moving from a consumer commodity product, the Lexibot (which was the re-branded Mata Hari after we formed BrightPlanet), to an enterprise knowledge management platform with the Deep Query Manager and its Publisher module that automates categorized placements into a portal. There was much automation learned and the extension into post-harvesting analysis and support for multiple languages, encodings and file formats really expanded the DQM’s scope.

There was also considerable success (and, at times, frustration!) in working with the intelligence community, of which more can’t really be said.

6. A recent timeline of the Deep Web makes this statement: “Shestakov (2008) cites Bergman (2001) as the source for the claim that the term deep Web was coined in 2000.” Do you agree with the statement? Did you coin the term “deep Web?” If so, was there some other term you were considering instead? If not, who did coin the term?

Yes, yes, yes and no one else.

Thane Paulsen, my co-founder at BrightPlanet, and I settled on the ‘deep Web’ term over one memorable dinner at a Sioux Falls restaurant. We had been tossing around a number of options as I was nearing completion of my study and white paper. We just found ourselves using ‘deep Web’ as the natural phrase in our own excited discussions. Since that moment, the term has been correct and felt right.

By the way, that was in summer 2000. We self-published the white paper in late July, which all of us clearly remember. The day we first released the study we were picked up by CNN and put on the screen “crawl”. We had to add nine servers that day to handle the demand and eventually got covered by 400 news outlets or some number like that. Later, Judith Turner requested to publish the study and did a fine editing job, with some updated analysis from me, for more “formal” publication in 2001.

7. Your BrightPlanet white paper, “The Deep Web: Surfacing Hidden Value,” has become a classic source of quantitative information about the deep Web. As you know, your white paper has been quoted all over the Internet. Looking back, is there anything you would change in the methodology you used to estimate the size of the Deep Web?

Well, thank you for the compliment.

Looking back, no, I do not think I would or could do anything different. We used the Lawrence and Giles “overlap analysis” methodology, looked at large sites, looked at small sites, and combined multiple analytic techniques. For its time, and with the tools available, it was pretty thorough, if I may say. We also tried to report results as ranges. I think, for the reasons noted below, that perhaps others could have done a similarly thorough job and yet up come with different range results.

Now, were I to re-do the analysis today with better tools and better understanding, I do think there could be a considerable narrowing of the results ranges and indeed better estimates for the ranges. For instance, the Deep Query Manager is a much, much more capable platform than the Lexibot for some parts of the study we used at that time. In fact, there are quite a few academic studies that have done admirable jobs in recent years.
Qualitatively, today, I would say that the importance and size of foreign language deep Web sites is hugely being underestimated, while perhaps the English ones I concentrated on have more duplication than I earlier understood.

At a couple of points I was tempted to re-do the analysis, and updated the figures once and had much internal analysis I never published. But, in the end, my job was to run a company and not write academic papers, so I never took the time to complete and publish them.

8. Your white paper concluded that, among other things, deep Web documents were on, average, three times better, quality-wise, than surface Web documents. Can you elaborate on the linguistic techniques you used to compute document quality, and do you have a sense of what that figure might be today?

The scoring techniques were those in the Lexibot at that time. All Lexibot scoring techniques (also later in the DQM) are standard information retrieval (IR) statistical techniques or derivatives that originated with Gerard Salton. One specific method used in the study was the vector space model (VSM); another was a modification of the Extended Boolean information retrieval (EBIR) technique, based on proprietary BrightPlanet modifications that adjust for query term occurrence and normalize for document size. The latter consideration is important for much deep Web content since it tends to be smaller and more structured.

I think that the estimate for higher quality for the deep Web, if anything, is perhaps higher today. My intuition is that there is less duplication and spam in the deep Web, more domain-specific content, and content that is better structured and characterized (metadata).

9. For the benefit of readers, would you explain the distinction you make between the “invisible Web” and the “deep Web?”

Hahaha. Well, when first named, our issue was simply that we felt ‘invisible Web’ was incorrect. There was nothing invisible at all about the content, just that getting to it was different or harder to discover than “surface” content indexed by search engines. Indeed, I don’t think either us — proponents for the deep Web or proponents for the ‘invisible Web’ monikers — had terribly sophisticated arguments or distinctions when that terminology war was raging at that time.

In recent years I have been amused to see all sorts of high-falutin’ distinctions about ‘invisible Web’ v. the deep Web. Frankly it all seems rather silly to me and straining at gnats.

I actually think that another phrase, ‘hidden Web’ could perhaps be a legitimate alternative. But, obviously, we prefer deep Web and the trend and Wikipedia treatment seems to be in that direction.

10. In 2001, the same year that you published your white paper, Gary Price, Chris Sherman, and Danny Sullivan wrote a book: “The Invisible Web: Uncovering Information Sources Search Engines Can’t See.” Was there some overlap between their work and yours?

Actually, as I noted above, it was just about a year later, since our first deep Web white paper came out in summer 2000.

I had been following Gary Price for some time with his great compilation of ResourceShelf and Lists of Lists while he was a librarian at George Washington University, and I knew Danny Sullivan from his Search Engine Watch and having been a speaker at his conferences. Like John Battelle today, there was a small community of search engine watchers and researchers and so there was much shared information. A subscription to Search Engine Watch was de rigueur.

There was no collaboration, however, with their book. I wrote a review at the time feeling they were being overly conservative about the size of the deep Web and had no methodology behind their conclusions (or at least it was not presented in their book). But, their credentials about the topic were impeccable and I actually don’t think anyone really knows how big the deep Web is.

Stay tuned for Part III of the interview next week.

If you enjoyed this post, make sure you subscribe to my RSS feed!

Tags: ,

This entry was posted on Friday, October 31st, 2008 at 5:16 am and is filed under luminary. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

Leave a reply

Name (*)
Mail (*)
URI
Comment