Abe and I were recently discussing the federation of large numbers of sources and the question came up: “What would it take for a single application to federate hundreds or even thousands of sources?” The conversation turned to a discussion of an approach that this blog’s sponsor, Deep Web Technologies (DWT), had developed to federate a number of federated search applications. The discussion of this “divide and conquer” approach inspired this article. You can read more about the ideas discussed here in two of Abe’s presentations:

I should note that DWT’s approach is not the norm and that large source scalability is not something that many customers need to be concerned with today. But, I do believe that we’ll be seeing more federated search applications searching greater number of sources in the years to come.

In a nutshell, what DWT does is it federates a number of federated search applications which may, in turn, federated other such applications. If ten federated search applications each search thirty sources then, by federating the results from these ten applications, DWT (or anyone else implementing this approach) can search 300 sources at once. While DWT’s approach may appear difficult to implement, this is not the case, plus this approach can allow for source scalability where it might not otherwise be possible.

Before we consider what it takes to make “federation of federation” work, let’s look at the approach in action.

Scitopia.org, Science.gov and OSTI’s Eprint Network are themselves federated search applications. While DWT built the search engines for all of the “federating” and “federated” applications I just mentioned, that’s not the point. Any vendor can treat a federation of sources as a single source, if the federation can be searched from a single interface via screen scraping or through a vendor provided interface, and implement this approach. The advantages DWT has is that it avoids having to screen scrape, which is becoming increasingly difficult, and it can enhance the federation when working with two or more of its own applications. If industry standards were to be developed, other vendors could benefit from these enhancements as well.

Simple federation of federation is straightforward to implement but anyone developing such an approach will quickly discover that they want three enhancements:

  1. The ability to know which sources are available from a federated search application.

  2. The ability to federate only a subset of sources from a federated search application.

  3. The ability to perform asynchronous (at the same time) search of the federating application where results are returned as they are received, not all at once.

The first two enhancements are desirable because they allow for one application to federate several applications which may search some of the same sources without needing to know in advance which the duplicate sources are. Searching a single source twice because two of the federated applications include it would be wasteful and would create an unnecessary deduplication effort.

The first two enhancements are also desirable because someone may want to build an application that only searches a subset of sources from the federated application. Or, perhaps the application administrator wants to disable a non-functioning source that’s searched through the federation. An API from the federated application can provide for dynamic identification of supported sources and for selective searching. (DWT provides an API for source selection.)

The third enhancement allows for presenting results incrementally or for timing out a search of the federated application without losing all of its results. Without this functionality, the application searching the federating application would have to either wait for the federated application to return all of its results or, if that takes too long, return no results from any of the sources in the federated application. Neither option is great; waiting a long time for a slow federated application degrades the performance of the federated application and cutting off the search too soon will cause the loss of results from a number of sources. A design that allows for the federated application to stream results as it gets them from the sources it is searching handles both problems. (DWT will have this functionality soon.)

Lest you think that only DWT applications can participate in this federation of federations, DWT-built WorldWideScience federates the non-DWT Vascoda application. And, if you’re concerned that DWT will not allow other vendors to federate its applications, the decision is in the hands of the owner of the application. DWT will provide documentation to application owners who are willing to be federated by other applications.

Federation of federation has an interesting benefit that results from the grouping of sources. In the example I cited earlier of ten applications each federating thirty sources, the application that federates the ten applications has access to 300 sources but only has to maintain ten connectors. This is a tremendous time saver since the effort involved in maintaining connectors is so great.

I need to state that processing of asynchronous results from the federated application does introduce complexity, in particular if the federating application is displaying results incrementally. When the federated application is returning all of its results at once (synchronously) it can rank all of its potential results and return only the most highly ranked ones to the federating application. When that same federated application is returning its results as it gets them from the sources it’s federating (asynchronously), it can’t rank the full set of results because it hasn’t seen the full set yet. So, it returns a set of results early in the search and then it may have more results to return a few seconds later. How does the federated application tell the federating application that some of the new results are more relevant than some of the ones it sent a few seconds before? How does the federating application deal with the fact that it just displayed some results to the user which now are less relevant than some of the new results? If it keeps merging in the new results from the federated application then it will have disproportionately many results from that federated application. If it reranks and eliminates the least relevant of the results from the federated application then it faces the situation of potentially needing to remove results that the user has already seen before he asked for the next incremental display of the results. There’s no easy solution to this problem.

I believe that federation of federations is going to be more important as federated search evolves because I believe that users will demand the ability to search larger numbers of sources. The ability to search thousands of sources in parallel is already here but no standards exist to facilitate this divide-and-conquer approach to large scale federated search. Without standards, those wanting to federate federated search engines are usually forced to screen scrape. With the increasing use of AJAX in federated search applications, screen scraping is becoming difficult, if not impossible, in many cases. So, we will need those standards sooner or later if we’re going to divide and conquer.

If you enjoyed this post, make sure you subscribe to the RSS feed!


This entry was posted on Tuesday, September 16th, 2008 at 3:18 pm and is filed under technology, viewpoints. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

4 Responses so far to "Divide and conquer: federating many sources"

  1. 1 Jonathan Rochkind
    September 16th, 2008 at 6:42 pm  

    It’s not clear to me why that multi-tiered federated approach would be neccesary with modern hardware and OSs. What’s the difference between doing that, and just having lots of thread/forks? Is your merging of result sets REALLY so CPU intensive that you need more than one machine? And if you do need more than one machine, wouldn’t some other method of “transparent” multi-server distribution (single tier, but distributed across various servers) under the covers be easier than a multi-tiered approach?

    It also continues to be curious to me that SerialSolutions federated searching product claims to be able to search across 200 sources at once. I haven’t investigated it enough to know if there’s a hidden gotcha to that claim, or if it really is what it says it is.

  2. 2 Sol
    September 16th, 2008 at 7:54 pm  

    Jonathan - I do have some thoughts in response to your comments but I’d like to wait a day or two and see what other comments people post and then I’ll respond to all of them.

  3. 3 Sol
    September 22nd, 2008 at 9:01 am  

    Jonathan - I have a draft of a response to your comment that I’m fine tuning and will publish soon as a blog article. I’ve not forgotten you.

  4. 4 Stephan Schmid
    September 23rd, 2008 at 1:07 am  

    In my opinion, the scalability is mainly a matter of the software architecture, and if it scales nearly linearly, it can grow as needed. With my experience, in general I agree with Jonathan that it’s mostly simpler to go for a single tier approach. A multi-server distribution can make sense for sources that are geographically remote and build a topical unit.

    Some years ago I made a test with 250 test sources (HTTP and JDBC) that were queried multithreaded (on a rather old P4 machine running Linux). Up to four concurrent users worked pretty well - today with fast multicore CPU’s it should run even better. For high volume solutions I would anyway try multiplexed, non-blocking I/O.

Leave a reply

Name (*)
Mail (*)