Jun
Daniel Tunkelang, Endeca co-founder and Chief Scientist, wrote a guest article articulating a particular problem with federated search. In the article, Daniel wrote:
But federation is no panacea, at least as it is implemented today. A federated search application brokers a query, sending it to multiple search providers (i.e., the search interfaces to a variety of content repositories), whose results it then attempts to assemble into a coherent whole. Unfortunately, since most search providers provide little more than the top-ranked result pages, federated search applications are largely reduced to assembling a unified ranking of those disparate result pages.
This functionality is significant, and I do not mean to dismiss it. But it is not enough. In particular, this approach to federation necessarily assumes a lowest common denominator of search functionality–a consequence of the requirement to evenhandedly broker among a variety of search applications that vary in the richness of their APIs.
Note my emphasis of the phrase “lowest common denominator.” Peter Noerr, Chief Technology Officer for MuseGlobal, left a detailed comment which reads, in part:
This assumption of [lowest common denominator] LCD searching is taken as an obvious truism. But why?
If federated search systems are presumed to be capable of handling multiple record formats for data extraction from retrieved records, then why should they not be considered capable of generating Source specific search statements?
The reason (as for federated search itself) seems to be that most don’t because it is yet another messy thing to deal with on a Source by Source basis. Note that here I am talking about more than adding blanks and quotes to a search statement.
Peter raises an excellent point. There is a prevalent myth that federated search applications search as poorly as their most simple-minded source. But the myth makes no sense. If one of a dozen sources doesn’t allow an author search then the LCD myth implies that you’d get no author results from any source. Taken to its logical conclusion, given enough sources, your users would only ever get titles and URLs returned because, for any set of searchable fields, some source will fail to support one or more of them.
I suspect that what some people consider to be LCD behavior is that if, for example, you do an author search against a source that doesn’t have a searchable author field that you’re going to get nothing back from that source. As unpleasant as it sounds, depending on the source, it might be the right behavior, i.e. return nothing rather than return irrelevant results. In other cases it might be better to perform a full text search against that source than to not search the source at all if the user enters text into the advanced search author field. The whole issue of what is the lowest common denominator is a messy one because it’s not clear what the right behavior is. So, it’s not fair to say that federated search engines do the “wrong thing” when searching multiple sources.
Assuming that we agreed that LCD meant suboptimal behavior by the federated search engine then it’s certainly NOT true that federated search is forced to do LCD. Author search is a great example of how a federated search application can do much better than LCD. Author search is a pain. There are a variety of formats that a source could provide for specifying the author name. The source could expect LASTNAME,FIRST NAME (with or without the comma) or FIRSTNAME LASTNAME. Then there are first and middle names and first and middle initials to deal with. There are many ways a user could enter a name and many ways the source could want it. Plus, in some cases, a source could recognize more than one name format. A human could want to search for A S EINSTEIN, ALBERT S EINSTEIN, A SCIENTIST EINSTEIN, ALBERT SCIENTIST EINSTEIN, ALBERT EINSTEIN, or simply EINSTEIN. What happens if the source expects A EINSTEIN and you search for ALBERT EINSTEIN? Will the source do the right thing? Maybe. Maybe not. Smart connectors deal with these kinds of issues by translating, or mapping, the user’s search terms into a form that will yield the best results from a particular source. A smart connector would turn ALBERT EINSTEIN into A EINSTEIN just for the one source that needed that in order to give relevant results.
The reality to replace the LCD myth is that not all federated search engines are created equal and that some deal better with picky source behavior than do others. When I worked full time for blog sponsor Deep Web Technologies I dabbled in connector building and I worked closely with their connector developers on some projects. The Deep Web connectors have remarkably complex logic for what I used to think was a simple task. Dealing with search syntax, phrases, wildcards, booleans, and a host of other factors is far from trivial. The Deep Web connector developers put a lot of sweat and testing into each connector they build.
Connector quality is not the whole LCD story. Sometimes a source will return very few results compared to other sources. This puts the source at a disadvantage because the more results you have from a source the better the relevance ranking you can perform. A smart connector can try to get multiple results pages from that source. A source may be slow to return any results. Rather than ignore results from that source a smart federated search engine can initially show results from sources that did respond quickly and then update the results to merge in the late arrivals. Another major problem for federated search applications is to perform good relevance ranking when a source returns no snippet, or just the beginning of the abstract instead of a context sensitive snippet.
There’s more to the LCD story. Rather than repeat it all here I recommend you read a fairly in depth article I wrote a while back on what determines the quality of search results. I also recommend a white paper I wrote that distills the “quality of search results” ideas into four pages.
Hopefully, the next time someone tells you that federated search is confined to delivering the lowest common denominator of results you’ll be able to tell them that it “ain’t so” and you’ll be able to tell them why.
If you enjoyed this post, make sure you subscribe to the RSS feed!
Tags: federated search
6 Responses so far to "The “lowest common denominator” myth"
June 22nd, 2009 at 7:35 pm
But there is a real LCD problem with federated search. You can argue that it is reduced in some circumstances, but I don’t think it helps to pretend it doesn’t exist.
Let’s say I federate full-text queries to sources A, B, and C. Sources A and B can filter on a geographic bounding box, but source C cannot. If a searcher gives me a full-text query term and a bounding box, I can only include results from A and B. If the searcher needs to query all three, she’s limited to full-text.
Federated search may still be very useful in this scenario, but the LCD problem needs to be recognized and understood.
June 23rd, 2009 at 6:11 am
Yeah, I’m still a believer in some amount of LCD effect too.
Providing the kind of advanced features you require is exceedingly difficult (meaning expensive), and if you want to cover a large range of sources will take constant (expensive) maintenance.
And then, even when you’ve done your best, searchers can still receive unexpected results — even in the simplest case of the author search you give, both options are somewhat undesirable. Leaving a source out, the searcher may not realize the source has been left out and may think she has searched a source that in fact remains un-examined. Searching full text when the user asked for an ‘author’ search may produce results that in fact don’t meet the specifications of the search, frustrating the user. And then there’s sophisticated searching criteria like Dave mentions, a more common example might be various kinds of controlled vocabularies, from MeSH terms to molecular designations.
I’m a big believer in federated search as an important service to scholarly researchers, because the convenience justifies the flaws. But there certain inherent flaws that are difficult or impossible to get around.
June 23rd, 2009 at 6:12 am
I meant “providing the kinds of advanced features you describe”, not “you require”, above.
I am also curious to see any particular examples you can point to of federated search providers that manage to do well what you describe in this post.
June 24th, 2009 at 9:49 am
Not surprisingly, I agree with Dave and Jonathan. I’d love to see a live example of a federated search engine that addresses the LCD problem. I agree that it’s possible in theory, but I’d like to see it done in practice!
June 24th, 2009 at 12:53 pm
Everyone,
I think of the myth as the blanket statement that federated search engines rollover/suck because some sources are not easy to search well.
But, I’m interested to know what you all think LCD means. Abe and I discussed it for a while and realized that we weren’t quite sure what it means. So, please tell me.
At the same time there’s a different discussion about how some federated search engines handle tricky sources better than others. I’ll do what I can to find examples of smart connectors searching sources particularly well.
June 24th, 2009 at 6:39 pm
Since I started “LCD”, I’ll have a go.
As an operation I agree with the comments by Dave, Sol and Jonathan. LCD to me is literally where the Fed Search system issues a search of the same functionality (but possibly different syntax) to each Source. That implies that the functionality of the search is limited to that available from the least functional Source. I have deliberately used “functionality” to be as inclusive in the definition as possible. Indices, operators, relations, limits, even vocabularies are all examples of “functionality”.
Thus Dave’s A, B, C case is a perfect example of what I would call an LCD situation - in the second instance. Here the search is issued only as a full text search, because that is the only functionality supported by all three Sources. The other instance (where only A & B are searched) is dealt with below.
A couple of comments need to attach to this. Firstly it is interesting to ponder that the Fed search system must know enough about the different Sources to be able to determine what the “common” functions are. If it can do this, it is halfway(-ish) towards being able to handle Source Specific Searches (SSS). So why dumb down? Who knows? I don’t.
The second point is that the first of Dave’s instances (search A&B only) uses what we call a “strict” mode for the search. That is: if the Source can’t handle the search in it’s entirety, then fail for that Source, and return 0 results. This is not LCD, but rather a strange form of “HCF” (if we want to stick with basic arithmetic acronyms) where only those Sources which meet ALL the requirements (support all the functionality) are sorted. Different category, but a problem none-the-less for certain types of searches.
Where non-LCD searching is possible (SSS as used above) then it is possible to switch to a “relaxed” mode and allow some portion of the search to be processed and produce at least some results. (You want details - I knew you would. OK, the most obvious example is where a Source does not support a particular index and any terms for that Source are mapped to one it does support - in 99% of cases this is “keyword” or its functional equivalent. )
This mapped and relaxed operation does allow results from Sources which would be “0″s to a strict search and, in our experience, most people would rather get something back which is somewhat like what they asked for, rather than nothing. Often because they weren’t too sure about the query in the first place. Again a good result for some case, but not all. However take the union of the two cases and you cover what virtually everybody would want - and what you would get from the native Sources - always an important touchstone. (And this whole comment raise a host more issues for Sol to tease out into other posts - like the issue of notifying users of what the search has done.)