A couple of weeks ago I wrote about the Stanford Alumni Association giving federated search access to a number of high profile scholarly sources. Abe read the article and called me, asking why I thought that Stanford was giving federated search access. In that moment I got the “deer in the headlights look” and had the urge to defend my position, even though I hadn’t yet had access to the service Stanford was providing. I noticed myself stating that “there are tons of sources, they must be federated.” In that moment I knew I was in trouble.
Well, I almost have access to the Stanford sources (I need to resolve a login issue) and I’m going to find out really quickly that Abe was (darn it) probably right. In particular, if all of my searches return results super quick then I’m probably searching indexed content. I’ll leave you in suspense for just a few days, wondering whether the Stanford sources are federated or not but I thought that my brazen, and likely incorrect, assumption was good fodder for an instructive article.
So, what happened? Why did I assume that Stanford was providing federated search access? One source of confusion is that documents are being delivered from thousands of sources by four content providers (ABI/INFORM, Business Source Alumni Edition, Dow Jones Factiva, and EBSCO Academic Alumni Edition.) There is potential for federation among the four content providers and there may be federation within one or more of the content providers. I had just assumed that Stanford was providing a federated search interface to the search engines of the four providers and that the content providers were also federating all of their content. I may be wrong on both counts.
To be honest, the combination of tons of sources, scholarly content, and not having seen the search engine(s) in action biased me to assume federation. Thinking more clearly (after a whack in the side of my head) I should have asked myself some questions:
- Is access to the four content provider search engines available from a single search interface that searches the four content providers simultaneously? If so, then this is indeed a federated search application although the more interesting question is whether access to the thousands of underlying publications is federated by the four content providers or not.
- Do the four content providers have access to the full-text of the articles they deliver? My guess is that these content providers do have access to the full-text.
- Do the content providers have a relationship with the scholarly and business publishers? Yes, they do.
- Do the content providers have a mechanism for determining that new content is available and do they have mechanisms for grabbing the new content and indexing it? I bet they do.
- Do the content providers have any incentive to federate content? My guess is that they don’t.
Let’s consider the reasons for federating content and you’ll probably agree with me that the documents from those thousands of publications are probably indexed and that the sources aren’t federated. Read Crawling vs. Deep Web Searching for a deeper comparison of the two access approaches.
- The full-text of the content isn’t available to retrieve and index. This one reason drives the entire federated search industry. Federated search exists because most scholarly content providers do not give away their entire documents to search engines. In other words, there is no relationship between the federated search engine and the content provider. This is clearly not the case with the four content providers in the Stanford offering: the publishers likely provide full-text of all of their articles to their partners, the four content providers.
- There’s no way to know when new content is available or what that new content is. Federated search is very useful when searching underlying sources whose content is frequently changing and when it’s not possible or practical to keep adding new content to an index. In the case of the four content providers, they have relationships with the scholarly publishers. They have mechanisms for learning of the existence of new content and for integrating the new content into their collections.
- It is desirable to merge content from different sources. This is where I was thinking the least clearly. Federation is about aggregating content from different sources, right? Well, Google aggregates content from millions of sources but it doesn’t federate. Federation is about the “live” aggregation of content from different sources. If there’s no incentive to aggregate content in real-time then there’s no incentive to federate.
- It is desirable to homogenize the ranking of documents from the different publishers. Since there are only four content providers delivering search results in the Stanford offering, and since each provider likely has access to the full-text of articles from the hundreds of publishers it provides access to, each content provider probably applies a consistent relevance ranking algorithm to the articles it provides from all of the publishers it does business with. In the federated search model, all of the sources searched are unrelated and the federated search engine needs to rank documents independently of the source ranking to deliver a consistent ranking experience. But, in the Stanford offering, the publishers accessed by the four content providers are NOT unrelated. One could certainly make an argument that federation of the four content providers is valuable.
So, what do you think? Do the four content providers federate the content they deliver or do they index it?
Tags: federated search