Beyond federated search? The conversation continues | Federated Search BlogFederated Search
20
Mar

Yesterday I wrote “Beyond federated search?” where I raised the concern about using services that provide indexed content as a way to bypass federated search and its associated challenges.

Jonathan Rochkind left two thoughtful comments which I’d like to respond to.


“Until every single content provider makes the full-text of all of their documents that can be federated available for harvesting and indexing”

EVERY SINGLE content provider does NOT make their content available for federated search in the first place. Of the approximately 800 licensed databases we have listed in our collection, only about 300 are federated search-able. The remainder are largely not there because of lack of functionality on the content provider’s end, not on our fed search vendor’s end.

So that’s a false comparison.

Jonathan, of course you’re right. Not all content can be federated. And, at the same time, not all content is available for harvesting and indexing. In both cases, access to content is controlled by the content provider. My point is that, given that plenty of excellent content isn’t available for harvesting, I don’t see the solution as being to ignore such content. Also, I’m curious to know why 500 of the 800 sources your library uses can’t be federated. While I understand that there are some sources that are very difficult or impossible to build connectors for, I’d be concerned about any federated search vendor that could only build connectors for 38% of sources I put on my list. Can you explain further your statement that “The remainder are largely not there because of lack of functionality on the content provider’s end?”

If Summon can provide access to about the same amount of content as federated search, including our most important/most used content, it’ll be a contender.

This deeply concerns me. Some people go to CNN for their news. Others go to the BBC. Who should decide which news sources are more valuable? I strongly believe it needs to be the library or research organization that is serving its patrons, not the subscription service provider. I argue that, for all the tremendous benefits of harvesting and indexing, it’s not a complete solution. So, why isn’t your library picking its sources?

A hybrid local index/broadcast search system is an obvious idea. But it’s tricky to figure out how to search both classes of content in one search without bringing things down to the lowest common denominator of fed search.

Jonathan, yes, this is a major issue. How do you merge results from sources where one set comes from searching the unified index and another set comes from federated search? It’s not an easy problem but I think it’s a critical one for the federated search industry to solve. One possibility is to not merge the two sets of results but to put them into separate tabs, as ugly as it might seem. As an aside, I’m interested to know if Summon indexes full-text or, more likely, metadata. The quality of the metadata index is only as good as the metadata itself. If it is indeed the case that Summon is only searching metadata then its relevance ranking will be poorer than that of federated search against sources where the underlying search engine is performing a full-text search.

PS: And certainly there might be SOME content that is available via broadcast search but not summon. And vice versa. Sure. For the academic research market, that’s not important: What’s important is we’re already not being able to offer unified search of ALL content, so switching to a different set of “not all” with a much better user experience will be a win, if it’s the right different set comparable in scope.

I don’t agree that “switching to a different set of ‘not all’ with a much better user experience will be a win.” A major value of federated search is that the client gets to select the sources and that, in most cases, if the source has a search page then a connector can be built for it. I don’t think it’s desirable to make an either/or decision. Take the best of both solutions and merge the two together. Not easy but critical, in my opinion.

The trick is indeed herding all that indexed metadata from many various sources. Summon’s promise is that the vendor will do that for you. If it can be done reliably at an affordable price, and can encompass a range of content _comparable for our needs_ (not identical) to existing broadcast search solutions… it’ll be a serious contender.

I do agree with you that services like Summon will become major players in unified search. I do have the concern, though, about competition among content providers. Serials Solutions is a business unit of ProQuest, a content publisher. Summon provides access to content from ProQuest and other publishers. Publishers don’t always play nicely together so I’d be nervous about being locked into offerings from any given set of publishers, some of whom might go away in the future.

The proof will be in the pudding of course. From talking to SerSol folks, they know these are the hurdles they need to clear to make it a realistic product for the academic research market. If they didn’t think they had a chance of clearing them, they probably wouldn’t be sending their R&D money down a black hole.

I’m looking forward to seeing how the service is received and how the federated search industry engages with and responds to this new offering. It’s worth noting that Summon has an API so an organization (or federated search vendor) can build a hybrid solution with Summon as one component. In fact, I wouldn’t be the least bit surprised if Serials Solutions used their federated search expertise to build their own hybrid product.

If you enjoyed this post, make sure you subscribe to my RSS feed!

Tags:

This entry was posted on Friday, March 20th, 2009 at 3:55 pm and is filed under discovery service, viewpoints. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

5 Responses so far to "Beyond federated search? The conversation continues"

  1. 1 Jonathan Rochkind
    March 20th, 2009 at 7:36 pm  

    MANY of those databases simply offer no reliable machine search access. Sure, you COULD write an HTML screen-scraping solution, but when you’re talking about hundreds of resources… actually maintaining that in any reliable way would cost more than we can pay. :)

    I haven’t actually done an analysis to see exactly why each of those resources can’t be federated. Again, see limited resources. But in the academic market, this is typical, I don’t believe my vendor has significantly smaller (or larger) coverage than other vendors in the market. We have lots of websites from small vendors, often non-profit association vendors, that just don’t have a lot of functionality, even though they have important content.

    Of course there’s value in letting the user choose what to search. My point is not taking that away. My point is that their choices are _neccesarily_ constrained by what is _available_ in the unified search interface. If you switch to a different set of content, you’re maybe going to take away some people’s favorite content, and give other people their favorite content they didn’t have before. The net effect on your user community may be a wash. Of course we’d LIKE to provide all content in the world, but meanwhile back to reality.

    One of the tricks of letting people choose ‘what they want’, is that when we’re talking hundreds of specialty licensed databases whose names are not household terms — the typical user has no idea what any of these are, or how to pick from them. Add to that, in a broadcast search environment, the units of selection are neccesarily the individual ‘databases’ or ‘search engines’ or ‘resources’ — collections of content chosen by someone else already, that can then be mixed and matched in bulk.

    The promise of the Summon approach is that you can choose content according to entirely different categories, at the _journal title_ level, crossing the boundaries of different existing ‘databases’. Summon hopes to make just such subject-selected categories of individual journals. And presumably allow librarians (or individual users) to create their own too. You aren’t constrained to just mixing and matching pre-existing collections, you can slice through what existing vendors happen to have chosen as collections.

    Indeed if they can full text to index, they’ll be able to do more htan just metadata. I believe they DO have full text where they can get it, and just metadata in other places. Depending on what the source has or is willing to give them. Of course, even with broadcast search, _some_ remotely searched databases may just contain metadata anyway, others may contain fulltext. But yeah, it all depends on what Summon can get, and how well they can do with it. The proof will be in the pudding, but I don’t think this particular issue is the most likely weak point.

    I agree that two seperate tabs, one for ‘indexed’ content, and one for ‘broadcast search’ content, is probably the only decent way to offer both without bringing indexed content down the lowest common denominator of broadcast search. The trick here is it’s going to make no sense to users why some content is only avail in one tab, and other content is only in the other, and others might be in both.

    If I had to choose only one of these tabs, either because it was too confusing to the users to have both, or more likely because I didn’t have the resources to maintain both (it’s going to be more expensive in terms of local maintanance and/or costs to vendors to have both) — I’d pick whatever one worked best, naturally.

    Which works best will be something we’ll find out once Summon is done. But if it’s not Summon at first — my prediction is that it will be the Summon approach eventually. They might have to fine tune some things, we might need to wait until more content providers are willing/capable to share their metadata and/or indexable fulltext with a vendor like SerSol. But the Summon approach is the one with long-term legs on it in my opinion, it’s the one I’d put my money behind in the long term.

    This analysis applies mainly the academic scholarly research market — with a main focus on with searching the scholarly peer reviewed literature. That’s the market/use-case/environment I’m familiar with, and it has some special challenges that may not apply to other meta-search applications. Like the need to include a significant portion of the scholarly universe in order to serve it’s function, a scholarly universe which is split over hundreds if not thousands of publishers, platforms, aggregators, and other vendors. Also the absolute need to get structured citation metadata for the scholarly article ‘hits’, so it can be passed off to a ‘link resolver’ to actually deliver the article to the user (often from a different source than the citation was found in).

  2. 2 Jonathan Rochkind
    March 20th, 2009 at 7:40 pm  

    Another unique challenge of the academic scholarly meta-search use case is that different databases from different vendors may contain the _same_ content, with or without electronic full text, in unpredictably overlapping ways. So you’ve got a de-dup issue too when you combine them.

    There are more too. I’m not sure how much experience Deep Web Tech has in the academic scholarly search arena, but it really does have it’s own special complexities, beyond just, say, aggregating different silos of data or web pages from within a certain federal agency or whatever.

  3. 3 Paul R. Pival
    March 23rd, 2009 at 8:27 am  

    Definitely agreeing with Jonathan on this issue. We’re currently federating a little less than 200 of our nearly 800 databases, in part because some don’t return good results, and sometimes because they return overwhelming results (newspaper articles). It seems the biggest kickback we’ve seen since implementing federated search (SS 360) is that while folks appreciate the ability to cross search, the time it takes to do so seriously turns people off. From what I’ve seen on a couple of demos of Summon, the speed issue has completely gone away. What that suggests to me is that, as with Google, if you run a search and don’t like what you get back, the fact that almost no time was invested means you’ll tweak and try again until you like what you’re seeing in the results. That just does not happen with federated search, from what I’ve seen – our users aren’t waiting around for the slowest common denominator to return results.

    One of my predecessors in my position, now retired, insisted many years ago that the only way “cross database searching” would work is if the content was indexed locally. I agreed, but was sure that could never happen because of course the publishers and vendors would never play together. Now that it appears to be a possibility, I’m very eager to see how it plays out.

    I agree again with Jonathan on the issue around comprehensive coverage. Summon (and to my mind all federated search) is a *discovery* tool, not a replacement for the native interfaces and content providers. The researchers who need to dive deeply into their areas of specialization will still use individual databases, but for the majority of searchers I think the 80% (just an example number) of coverage that Summon would provide would be just fine. I also think that if Summon does a good job out of the gate, they’ll attract more content providers, snowballing their content coverage.

  4. 4 Jonathan Rochkind
    March 23rd, 2009 at 4:03 pm  

    Well, the native interfaces are discovery tools too, generally. :) If I _could_ provide a unified meta-search with as much power and flexibility as the native interfaces, surely I would, but it’s not realistic at present.

    So we ‘provide’ both. You can use native interfaces if you want — and our meta-search tool tries to give you options to do so, not hide them from you. But many users at many times will prefer meta-search, despite it’s limitations. They largely prefer it because it’s more convenient — no need to learn a specialty interface, no need to understand what native tool does what.

    If we can make it even more convenient, while ALSO making it work better… everyone wins. And those native interfaces are still there just as they ever were. That’s the promise of Summon, to me. I think they have a good chance of pulling it off.

    Maybe a federatedseachblog interview with someone from Summon?

  5. 5 Sol
    March 26th, 2009 at 8:23 pm  

    Thank you, everybody, for all the great comments on both posts.

    I’m having a dialogue with Serials Solutions and I expect to get a response to publish here.

Leave a reply

Name (*)
Mail (*)
URI
Comment