Daniel Tunkelang, Endeca co-founder and Chief Scientist, wrote a guest article articulating a particular problem with federated search. In the article, Daniel wrote:
But federation is no panacea, at least as it is implemented today. A federated search application brokers a query, sending it to multiple search providers (i.e., the search interfaces to a variety of content repositories), whose results it then attempts to assemble into a coherent whole. Unfortunately, since most search providers provide little more than the top-ranked result pages, federated search applications are largely reduced to assembling a unified ranking of those disparate result pages.
This functionality is significant, and I do not mean to dismiss it. But it is not enough. In particular, this approach to federation necessarily assumes a lowest common denominator of search functionality–a consequence of the requirement to evenhandedly broker among a variety of search applications that vary in the richness of their APIs.
Note my emphasis of the phrase “lowest common denominator.” Peter Noerr, Chief Technology Officer for MuseGlobal, left a detailed comment which reads, in part:
This assumption of [lowest common denominator] LCD searching is taken as an obvious truism. But why?
If federated search systems are presumed to be capable of handling multiple record formats for data extraction from retrieved records, then why should they not be considered capable of generating Source specific search statements?
The reason (as for federated search itself) seems to be that most don’t because it is yet another messy thing to deal with on a Source by Source basis. Note that here I am talking about more than adding blanks and quotes to a search statement.
Peter raises an excellent point. There is a prevalent myth that federated search applications search as poorly as their most simple-minded source. But the myth makes no sense. If one of a dozen sources doesn’t allow an author search then the LCD myth implies that you’d get no author results from any source. Taken to its logical conclusion, given enough sources, your users would only ever get titles and URLs returned because, for any set of searchable fields, some source will fail to support one or more of them.
I suspect that what some people consider to be LCD behavior is that if, for example, you do an author search against a source that doesn’t have a searchable author field that you’re going to get nothing back from that source. As unpleasant as it sounds, depending on the source, it might be the right behavior, i.e. return nothing rather than return irrelevant results. In other cases it might be better to perform a full text search against that source than to not search the source at all if the user enters text into the advanced search author field. The whole issue of what is the lowest common denominator is a messy one because it’s not clear what the right behavior is. So, it’s not fair to say that federated search engines do the “wrong thing” when searching multiple sources.
Assuming that we agreed that LCD meant suboptimal behavior by the federated search engine then it’s certainly NOT true that federated search is forced to do LCD. Author search is a great example of how a federated search application can do much better than LCD. Author search is a pain. There are a variety of formats that a source could provide for specifying the author name. The source could expect LASTNAME,FIRST NAME (with or without the comma) or FIRSTNAME LASTNAME. Then there are first and middle names and first and middle initials to deal with. There are many ways a user could enter a name and many ways the source could want it. Plus, in some cases, a source could recognize more than one name format. A human could want to search for A S EINSTEIN, ALBERT S EINSTEIN, A SCIENTIST EINSTEIN, ALBERT SCIENTIST EINSTEIN, ALBERT EINSTEIN, or simply EINSTEIN. What happens if the source expects A EINSTEIN and you search for ALBERT EINSTEIN? Will the source do the right thing? Maybe. Maybe not. Smart connectors deal with these kinds of issues by translating, or mapping, the user’s search terms into a form that will yield the best results from a particular source. A smart connector would turn ALBERT EINSTEIN into A EINSTEIN just for the one source that needed that in order to give relevant results.
The reality to replace the LCD myth is that not all federated search engines are created equal and that some deal better with picky source behavior than do others. When I worked full time for blog sponsor Deep Web Technologies I dabbled in connector building and I worked closely with their connector developers on some projects. The Deep Web connectors have remarkably complex logic for what I used to think was a simple task. Dealing with search syntax, phrases, wildcards, booleans, and a host of other factors is far from trivial. The Deep Web connector developers put a lot of sweat and testing into each connector they build.
Connector quality is not the whole LCD story. Sometimes a source will return very few results compared to other sources. This puts the source at a disadvantage because the more results you have from a source the better the relevance ranking you can perform. A smart connector can try to get multiple results pages from that source. A source may be slow to return any results. Rather than ignore results from that source a smart federated search engine can initially show results from sources that did respond quickly and then update the results to merge in the late arrivals. Another major problem for federated search applications is to perform good relevance ranking when a source returns no snippet, or just the beginning of the abstract instead of a context sensitive snippet.
There’s more to the LCD story. Rather than repeat it all here I recommend you read a fairly in depth article I wrote a while back on what determines the quality of search results. I also recommend a white paper I wrote that distills the “quality of search results” ideas into four pages.
Hopefully, the next time someone tells you that federated search is confined to delivering the lowest common denominator of results you’ll be able to tell them that it “ain’t so” and you’ll be able to tell them why.
Tags: federated search