[ Note: Two very huge prime numbers were discovered recently, one last month, the other early this month. These primes, known as Mersenne primes, were discovered via a "divide and conquer" approaches, validating the distributed search approach for fields as distant as federated search :) ]

Jonathan Rochkind and Stephan Schmid left comments in response to my article on federating large numbers of sources. I’d like to respond to a part of Jonathan’s comment as well as to a piece of Stephan’s comment.

Jonathan wrote, in part:

It’s not clear to me why that multi-tiered federated approach would be neccesary with modern hardware and OSs. What’s the difference between doing that, and just having lots of thread/forks? Is your merging of result sets REALLY so CPU intensive that you need more than one machine? And if you do need more than one machine, wouldn’t some other method of “transparent” multi-server distribution (single tier, but distributed across various servers) under the covers be easier than a multi-tiered approach?

Stephan wrote, in part:

In my opinion, the scalability is mainly a matter of the software architecture, and if it scales nearly linearly, it can grow as needed. With my experience, in general I agree with Jonathan that it’s mostly simpler to go for a single tier approach. A multi-server distribution can make sense for sources that are geographically remote and build a topical unit.

Here’s my response but, first, a disclaimer. My thinking is based on the Deep Web Technologies federated search architecture, not that it’s incapable of scaling to searching a large number of sources from a single application running on a single host, but that there are situations in which it’s desirable to perform tiered searching. Other vendor architectures may support scalability in different ways.

  1. Multi-tiered federated search is most practical when the federated search engines that you’re going to federate already exist as standalone applications.

  2. In the Deep Web Technologies (DWT) examples I provided, DWT didn’t have to expend very much effort to create the federation of federations. And, DWT didn’t need to modify its product to create the hierarchy.

  3. For up to some particular number of sources you certainly could search them all from one machine with one application. What is the limit? Is it 500 sources? Is it 1000 sources? The divide and conquer approach exists for when you exceed whatever you discover that limit to be.

  4. There are situations in which federated search applications could be placed geographically close to a number of content sources to facilitate rapid data transfer of search results. This was one of Stephan’s points.

  5. In some cases, there may be value in creating a number of federated search applications even if one could create a single application to federate all of the sources. Consider the case of wanting to federate a very large number of sources. Assume that sources naturally fit into a number of categories. By distributing the sources into federations you allow for the creation of standalone applications. If I’m going to build a science-oriented federated search application with 1000 sources, divided into ten disciplines with 100 sources per discipline, I may want to build ten separate applications even if I could, technically, build just one and have different search pages depending on which scientific discipline a user wants to search.

Thanks, Jonathan, for the question and to Jonathan and Stephan for your comments. I hope my answer is helpful to you and to others.

If you enjoyed this post, make sure you subscribe to the RSS feed!


This entry was posted on Wednesday, September 24th, 2008 at 7:48 am and is filed under technology, viewpoints. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

Leave a reply

Name (*)
Mail (*)