Abe and I were recently discussing the federation of large numbers of sources and the question came up: “What would it take for a single application to federate hundreds or even thousands of sources?” The conversation turned to a discussion of an approach that this blog’s sponsor, Deep Web Technologies (DWT), had developed to federate a number of federated search applications. The discussion of this “divide and conquer” approach inspired this article. You can read more about the ideas discussed here in two of Abe’s presentations:
I should note that DWT’s approach is not the norm and that large source scalability is not something that many customers need to be concerned with today. But, I do believe that we’ll be seeing more federated search applications searching greater number of sources in the years to come.
In a nutshell, what DWT does is it federates a number of federated search applications which may, in turn, federated other such applications. If ten federated search applications each search thirty sources then, by federating the results from these ten applications, DWT (or anyone else implementing this approach) can search 300 sources at once. While DWT’s approach may appear difficult to implement, this is not the case, plus this approach can allow for source scalability where it might not otherwise be possible.
Before we consider what it takes to make “federation of federation” work, let’s look at the approach in action.
- WorldWideScience.org is a federated search application that searches Science.gov.
- DTIC Multisearch searches WorldWideScience.org and scitopia.org.
- The newly released Science.gov 5.0 federates OSTI’s Eprint Network.
Scitopia.org, Science.gov and OSTI’s Eprint Network are themselves federated search applications. While DWT built the search engines for all of the “federating” and “federated” applications I just mentioned, that’s not the point. Any vendor can treat a federation of sources as a single source, if the federation can be searched from a single interface via screen scraping or through a vendor provided interface, and implement this approach. The advantages DWT has is that it avoids having to screen scrape, which is becoming increasingly difficult, and it can enhance the federation when working with two or more of its own applications. If industry standards were to be developed, other vendors could benefit from these enhancements as well.
Simple federation of federation is straightforward to implement but anyone developing such an approach will quickly discover that they want three enhancements:
- The ability to know which sources are available from a federated search application.
- The ability to federate only a subset of sources from a federated search application.
- The ability to perform asynchronous (at the same time) search of the federating application where results are returned as they are received, not all at once.
The first two enhancements are desirable because they allow for one application to federate several applications which may search some of the same sources without needing to know in advance which the duplicate sources are. Searching a single source twice because two of the federated applications include it would be wasteful and would create an unnecessary deduplication effort.
The first two enhancements are also desirable because someone may want to build an application that only searches a subset of sources from the federated application. Or, perhaps the application administrator wants to disable a non-functioning source that’s searched through the federation. An API from the federated application can provide for dynamic identification of supported sources and for selective searching. (DWT provides an API for source selection.)
The third enhancement allows for presenting results incrementally or for timing out a search of the federated application without losing all of its results. Without this functionality, the application searching the federating application would have to either wait for the federated application to return all of its results or, if that takes too long, return no results from any of the sources in the federated application. Neither option is great; waiting a long time for a slow federated application degrades the performance of the federated application and cutting off the search too soon will cause the loss of results from a number of sources. A design that allows for the federated application to stream results as it gets them from the sources it is searching handles both problems. (DWT will have this functionality soon.)
Lest you think that only DWT applications can participate in this federation of federations, DWT-built WorldWideScience federates the non-DWT Vascoda application. And, if you’re concerned that DWT will not allow other vendors to federate its applications, the decision is in the hands of the owner of the application. DWT will provide documentation to application owners who are willing to be federated by other applications.
Federation of federation has an interesting benefit that results from the grouping of sources. In the example I cited earlier of ten applications each federating thirty sources, the application that federates the ten applications has access to 300 sources but only has to maintain ten connectors. This is a tremendous time saver since the effort involved in maintaining connectors is so great.
I need to state that processing of asynchronous results from the federated application does introduce complexity, in particular if the federating application is displaying results incrementally. When the federated application is returning all of its results at once (synchronously) it can rank all of its potential results and return only the most highly ranked ones to the federating application. When that same federated application is returning its results as it gets them from the sources it’s federating (asynchronously), it can’t rank the full set of results because it hasn’t seen the full set yet. So, it returns a set of results early in the search and then it may have more results to return a few seconds later. How does the federated application tell the federating application that some of the new results are more relevant than some of the ones it sent a few seconds before? How does the federating application deal with the fact that it just displayed some results to the user which now are less relevant than some of the new results? If it keeps merging in the new results from the federated application then it will have disproportionately many results from that federated application. If it reranks and eliminates the least relevant of the results from the federated application then it faces the situation of potentially needing to remove results that the user has already seen before he asked for the next incremental display of the results. There’s no easy solution to this problem.
I believe that federation of federations is going to be more important as federated search evolves because I believe that users will demand the ability to search larger numbers of sources. The ability to search thousands of sources in parallel is already here but no standards exist to facilitate this divide-and-conquer approach to large scale federated search. Without standards, those wanting to federate federated search engines are usually forced to screen scrape. With the increasing use of AJAX in federated search applications, screen scraping is becoming difficult, if not impossible, in many cases. So, we will need those standards sooner or later if we’re going to divide and conquer.
Tags: federated search