This is the third installment of my interview with Michael Bergman. The interview started here. Today, Michael answers a variety of questions, about BrightPlanet, CompletePlanet, about federated search and about the deep Web.
11. What approaches did BrightPlanet utilize over time to make content searchable?
All material harvested by BrightPlanet technology was fully indexed and searchable with a very capable text and query engine. In fact, we often talked of bringing content up to a “highest common denominator” when managed by BrightPlanet because we could apply any technique possible to the local full-text index.
For example, very early on we had metadata or faceted indexing and searching. We could even support standard queries with regex-type expressions in combination with standard Boolean searches and with faceted additions.
The real approach, of course, was providing access to any searchable database. At its peak, we had 70,000 standard sources and had automated facilities for our enterprise customers to extend those targets at will and automatically. I know that many of our customers accessed databases in all sorts of weird languages and in all remote corners of the world and dealt with the returned results in a single, powerful, canonical way.
12. Did you consider real-time federated search as a content access approach for BrightPlanet? Why or why not?
Yes, definitely. In fact, Mata Hari and the Lexibot were themselves real-time federated search tools. While the downloads and processing were occurring, we also had developed a 10-point tracking scheme of the search progress that was displayed with entertaining speedometers, bar charts and other visual feedback.
The Deep Query Manager, however, marked a shift to a collaboration platform amongst knowledge workers. Since it was now server-based, sharing and scheduling and results (document) set management and analysis became the focus. While the DQM has a ‘Quick Search’ option that is basically a real-time federated search, the emphasis definitely shifted to collaboration.
The scheduler also became essential to the DQM as tools for monitoring and difference reporting (what is new? what has changed? how has it changed?) became more prominent demands from our user base.
13. CompletePlanet boasts “over 70,000 searchable databases and specialty search engines.” In February of 2004, according to the Wayback Machine, that number was as high as 103,000. How did you build this catalog? How did you select the sites? Did you have tools to identify search engines?
Well, over time, I would like to think we became quite sophisticated in the building and organizing of CompletePlanet. In its first build, which occurred about August 2000, we began with a hand-compiled listing of a few hundred high-quality link listing sites. These were sites we had been assembling for a few years, and represented some of the key reference points on the Web. Like Yahoo! and early directory sites, there were hundreds of these valuable resource hubs on the early Web.
We also did directed queries to search engines and controlled crawls from such seeds. At one time we also allowed site submissions on the CompletePlanet site. That grew to over 10,000 submissions per day, with only a minuscule number meeting our searchable database qualification criteria (yields were generally about 0.05% or lower). We also added cross-link analysis amongst sites to direct specific crawls as well.
These techniques for candidate identification were matched with candidate qualification and then query dialect detection and translation. Besides qualification, these, too, became more sophisticated over time. For example, one cause of failure for many earlier-qualified sites came about because of slight changes in the search form URL site or string. Directed crawls from these failed sites proved quite effective in re-finding the correct search form again.
Then, after qualification, the placement of these sites into CompletePlanet was guided by the same categorization system underlying the DQM Publisher module.
Your sharp eyes regarding the 103,000 search site number in 2004 represents a point prior to our better duplicate and spam identification. While our reported search site numbers were 70 K prior to that point, which then increased to 103 K, and then subsequently dropped back to 70 K, we never really explained the nuances. Our ability to reduce the reported count from 103 K to 70 K sources came about from improved duplicate and spam removal. The fact that before and after the high-water mark we had 70 K counts was pure serendipity; the latter 70 K were expanded and cleaner than the earlier 70 K that had duplicates.
At any rate, I fear CompletePlanet may be fading into the sunset since I don’t think its listing is being actively maintained.
14. Would you explain how Deep Query Manager is able to search 70,000 databases at once?
(Grin.) Technically the Deep Query Manager can search 70,000 databases at once, but I don’t know why anyone would want to. As specialty sites, most of those 70,000 (or others that the customer might add) would only pertain to a minority of queries or topics; if you would include them in most harvests you would only waste bandwidth. In practice, the largest monitoring pool I was aware of was about 4,000 newspaper sites across the globe.
So the first answer to this question is the need to be able to assign and manage or discover particular searchable databases that might be relevant to a given query or harvest. The DQM has some nifty search group management and sharing capabilities. It also has a facility to suggest engines based on the context of a given query or harvest. Typically, these search sites are organized and managed as a harvest profile; they can be readily changed or edited or shared.
Okay, once the target group of search sites is determined the question shifts to how the Deep Query Manager then “talks” to each search site.
At configuration time (or when updated) a small configuration file is created for each search site. This file contains standard HTTP and URL information for getting to the site, its name and other internal characteristics and metadata, and provides the “translation” syntax for how the site supports query construction such as phrases, Boolean operators and the like.
By the way, this configuration is totally automatic, very accurate, and can be periodically applied in the background to make sure the search sites are “self-healing” with respect to retrieval effectiveness.
At harvest or request time in the Deep Query Manager, the internal canonical request within DQM is then translated to the dialect supported by each individual engine. Multiple search threads are spawned and the requests are issued in parallel. Requests are round-robined from a pool to prevent slow-responding sites from corking up a thread. Other timeout and response metrics govern the actual harvest as it proceeds.
As limits for the number of downloads per search site are reached, a new site is queued until all candidate sites are completed. Typical harvests take about 25 min for 40-50 included searchable databases. Very long ones (such as the newspaper one I mentioned) can take a couple of hours. Thus, one prominent harvest mode for DQM users is to schedule update runs overnight and then inspect and analyze the results when they come to work in the morning.
As document results are returned, they are fully indexed, tested against the canonical query, tested for duplicates, and then committed to the results set if they pass. Other metadata characterizations might then be applied locally independent of the actual query conditions.
Results are saved to disk and time-stamped. Results sets can then be compared for monitoring or difference analysis.
15. Here’s a hard question: How big do you think the Deep Web is now and how much larger is it than the Surface Web?
Here’s an easy answer: I think it is bigger. I will leave the quantification of that question to those doing current analysis in the space.
16. Who do you think has the best quantitative data about the size of the Deep Web today?
I certainly see occasional academic papers. There is a group from the University of Illinois that seem to be doing much in this area.
Here are a couple of sources that probably offer better guidance than my own recent inattentions on this question: 1) Maureen Flynn-Burhoe’s timeline of the deep Web and its references; or 2) I also recommend Denis Shestakov’s Ph.D. thesis from earlier this year, which has a bibliography of some 115 references.
Next week we complete the interview series.