Jun
[ Editor's Note: This is a guest article by Daniel Tunkelang. (See his bio below.) Daniel is passionate about designing search systems that improve users' experience with information retrieval. This passion comes across very strongly in his book about faceted search, which I recently reviewed.
This article addresses a limitation with federated search that could be removed if content sources provided specific metadata to federated search engines to improve relevance ranking. Good food for thought. ]
Daniel Tunkelang is the Chief Scientist and a co-founder of Endeca, a leading vendor of search technology. Before joining Endeca’s founding team, he worked at the IBM T. J. Watson Research Center and AT&T Bell Labs. Daniel pioneered the annual workshops on human-computer information retrieval and recently published a book on faceted search. He blogs at The Noisy Channel.
The problem with federated search
The case for federated search is straightforward: no single organization has all of the answers, and therefore no single index can ever hope to complete satisfy its users’ needs. Federation allows the developer of a search application to hedge his or her bets by bringing in knowledge from outside resources.
But federation is no panacea, at least as it is implemented today. A federated search application brokers a query, sending it to multiple search providers (i.e., the search interfaces to a variety of content repositories), whose results it then attempts to assemble into a coherent whole. Unfortunately, since most search providers provide little more than the top-ranked result pages, federated search applications are largely reduced to assembling a unified ranking of those disparate result pages.
This functionality is significant, and I do not mean to dismiss it. But it is not enough. In particular, this approach to federation necessarily assumes a lowest common denominator of search functionality-a consequence of the requirement to evenhandedly broker among a variety of search applications that vary in the richness of their APIs.
What I would like to see is federation of set retrieval, not just of ranked retrieval. At first glance, this aspiration may seem impractical, since the sets being combined are often too large to be aggregated at query time. I certainly don’t expect a federated search engine to dynamically aggregate gigabytes or even terabytes of documents to process each individual query!
Instead, we need search engines to support, as a standard API capability, the ability to return a summary of a set of search results. Faceted search has led the way in demonstrating the value of such summarization: it offers users a much richer overview of the search results than the users could hope to obtain from a handful of top-ranked results. For the same reason, it offers far more information to a federation broker.
Indeed, even faceted search is not enough. It is unreasonable to expect all search applications to use the same faceted classification scheme, and thus federators find themselves confronting the infamous vocabulary problem-perhaps more familiar to practitioners as an aspect of master data management. How do we address this problem?
With yet more summarization. For example, a facet value corresponds to the set of documents assigned that value, and thus to a distribution of occurrence frequencies on the other facet values. In fact, we can take this idea further and summarize a facet value in terms of a distribution on the words and phrases used in documents assigned that value. Given access to these summarizations, a federator can at least make educated guesses to establish correspondences and relationships among the facets returned by different search applications.
Of course, there is far more work needed to make such an approach effective and efficient enough to be practical. Summarization is more computationally intensive than simply ranking results, and combining summarizations is more complex than simply harmonizing relevance scores. But these are the challenges that represent the best opportunity to make federation a successful strategy.
If you enjoyed this post, make sure you subscribe to the RSS feed!
Tags: federated search
5 Responses so far to "Daniel Tunkelang on the problem with federated search"
June 12th, 2009 at 2:06 pm
[...] wrote a guest post at Sol Lederman’s Federated Search blog entitled “The Problem with Federated Search“. Here’s an excerpt: The case for federated search is straightforward: no single [...]
June 15th, 2009 at 9:47 am
On the topic of “LCD” searches.
Daniel states: “In particular, this approach to federation necessarily assumes a lowest common denominator of search functionality–a consequence of the requirement to evenhandedly broker among a variety of search applications that vary in the richness of their APIs”
This assumption of LCD searching is taken as an obvious truism. But why?
If federated search systems are presumed to be capable of handling multiple record formats for data extraction from retrieved records, then why should they not be considered capable of generating Source specific search statements?
The reason (as for federated search itself) seems to be that most don’t because it is yet another messy thing to deal with on a Source by Source basis. Note that here I am talking about more than adding blanks and quotes to a search statement.
The next step is to be more aware of the actual search syntax. Is the index for an author search represented by “au=” or “/au” or some other string? This is a fairly common, but not universal, capability - many systems restrict their interaction to the language of standards such as RPN for Z39.50, or a commonly implemented search language such as Open Search.
Moving to proprietary APIs, and also to web search interfaces, provides the much larger variety Daniel posits as beyond the reach of federated search systems.
It ain’t necessarily so. There are federated search systems which match the search to the capabilities of the Source engine. Applying limits where they exist, using indices where they exist, mapping to alternatives where the requested function is not available. All under user control for the desired strictness of the query. Acting very much like a user would in the same circumstances - adapting the query to what the Source can handle.
This addresses a second point in the quote above: that the desire is to have an identical query sent to all Sources, rather than the “best” one each can handle. Why? It is contrary to the way users act. They attempt to get the best results from each Source adapting to what it offers in the way of search tools. Surely federated search systems should try to do no less.
Admittedly, very few federated search systems do go to these lengths, but some do and, of course, we believe that Muse provides one of the more advanced capabilities of this type, or I wouldn’t be mentioning it.
The bottom line is that some federated search systems do adapt the searches to the richness, or otherwise, of the Sources they access. The LCD approach is not a necessary evil, but a chosen one.
June 15th, 2009 at 10:05 am
The idea of Sources returning their faceted analysis of a set of results is interesting. And it of course happens in practice. The idea that they could return some of the “reasoning” behind the facets is very interesting. And is not happening so far as I know.
Facets are an attempt to extract the semantics of a set of results. And they work pretty well whether pre-coordinated against a vocabulary of some sort, or are just the natural outcome of the retrieved documents. Normalizing these semantics is the big problem as Daniel points out.
Returning not just the facet values, but also the terms which have been used to derive them, does indeed provide for the very useful possibility that the federated search system could derive common facet values across a number of Sources. They would of necessity be “fuzzy” in that the supplied terms would not be co-extensive across the facets across the Sources, but they would probably be fairly good approximations. And much better than nothing at all.
Processing would be increased all round, but probably not by an unacceptable amount, and the result could be not only a “normalized” set of facet values for the user, but also a set of documents which could be de-duped or clustered on those values. This should lead to a richer set of documents for the user.
I await the data so we can process it and see.
June 15th, 2009 at 6:34 pm
Peter, point taken: it is certainly possible to integrate with each source on a case-by-case basis and thus get beyond the least-common-denominator approach-though even there there is still the issue of amalgamating structure from the different sources. I’d love to see a live example of a federated search engine that does this well. I agree that it’s possible-indeed, I’d like to see it done!
June 16th, 2009 at 4:27 pm
Daniel Tunkelang on the problem with federated search…
via Federated Search - (Daniel Tunkelang):[ Editor’s Note: This is a guest article by Daniel Tunkelang. (See his bio below.) Daniel is passionate about designing search systems that improve users’ experience with information retrieval. This passion …