[ Editor’s Note: This is a guest article by Daniel Tunkelang. (See his bio below.) Daniel is passionate about designing search systems that improve users’ experience with information retrieval. This passion comes across very strongly in his book about faceted search, which I recently reviewed.
This article addresses a limitation with federated search that could be removed if content sources provided specific metadata to federated search engines to improve relevance ranking. Good food for thought. ]
Daniel Tunkelang is the Chief Scientist and a co-founder of Endeca, a leading vendor of search technology. Before joining Endeca’s founding team, he worked at the IBM T. J. Watson Research Center and AT&T Bell Labs. Daniel pioneered the annual workshops on human-computer information retrieval and recently published a book on faceted search. He blogs at The Noisy Channel.
The problem with federated search
The case for federated search is straightforward: no single organization has all of the answers, and therefore no single index can ever hope to complete satisfy its users’ needs. Federation allows the developer of a search application to hedge his or her bets by bringing in knowledge from outside resources.
But federation is no panacea, at least as it is implemented today. A federated search application brokers a query, sending it to multiple search providers (i.e., the search interfaces to a variety of content repositories), whose results it then attempts to assemble into a coherent whole. Unfortunately, since most search providers provide little more than the top-ranked result pages, federated search applications are largely reduced to assembling a unified ranking of those disparate result pages.
This functionality is significant, and I do not mean to dismiss it. But it is not enough. In particular, this approach to federation necessarily assumes a lowest common denominator of search functionality–a consequence of the requirement to evenhandedly broker among a variety of search applications that vary in the richness of their APIs.
What I would like to see is federation of set retrieval, not just of ranked retrieval. At first glance, this aspiration may seem impractical, since the sets being combined are often too large to be aggregated at query time. I certainly don’t expect a federated search engine to dynamically aggregate gigabytes or even terabytes of documents to process each individual query!
Instead, we need search engines to support, as a standard API capability, the ability to return a summary of a set of search results. Faceted search has led the way in demonstrating the value of such summarization: it offers users a much richer overview of the search results than the users could hope to obtain from a handful of top-ranked results. For the same reason, it offers far more information to a federation broker.
Indeed, even faceted search is not enough. It is unreasonable to expect all search applications to use the same faceted classification scheme, and thus federators find themselves confronting the infamous vocabulary problem–perhaps more familiar to practitioners as an aspect of master data management. How do we address this problem?
With yet more summarization. For example, a facet value corresponds to the set of documents assigned that value, and thus to a distribution of occurrence frequencies on the other facet values. In fact, we can take this idea further and summarize a facet value in terms of a distribution on the words and phrases used in documents assigned that value. Given access to these summarizations, a federator can at least make educated guesses to establish correspondences and relationships among the facets returned by different search applications.
Of course, there is far more work needed to make such an approach effective and efficient enough to be practical. Summarization is more computationally intensive than simply ranking results, and combining summarizations is more complex than simply harmonizing relevance scores. But these are the challenges that represent the best opportunity to make federation a successful strategy.
Tags: federated search