A couple of weeks ago I wrote about ten unrealistic expectations that some people have about federated search. A few days ago The Krafty Librarian published a blog article expressing frustrations over PubMed having been down for a number of hours with no notification from PubMed. In my book, this makes for unrealistic expectation number 11:
11. If a source ever goes down, the content owner should immediately and widely broadcast this information.
I consider this to be an unrealistic expectation for the simple reason that this expectation is hardly ever met. When I was very actively engaged in supporting applications for Deep Web Technologies in only VERY rare situations did a content provider let us know that their search interface was down. I believe that expecting a source provider to tell you that their source is down is like asking drivers on the highway to never cut you off. It’s not going to happen and you’re going to suffer a lot expecting people to be different than they are.
In the case of content providers there may be some explanations (not that I condone any) for not notifying the user community of problems:
- The support team might have believed that the problem was going to be quick to fix and they didn’t want to broadcast a problem if that was indeed the case.
- The PubMed management may not have wanted to embarrass themselves by publicly acknowledging the problem.
- It may have been an oversight on PubMed’s part.
The blog post about the PubMed problem does raise an interesting and very valid concern: it can be difficult to determine the source of a problem when a federated search of a source consistently returns no results. The problem has to do with the introduction of programmatic search interfaces. A programmatic interface is provided by the owner of a particular search engine and allows a federated search application to search the source much more efficiently than by using the web form that a human would use. In the old days (5 years ago) there were many fewer programmatic interfaces to search engines. Federated search applications would perform “screen scraping” against virtually all sources and when a source is screen scraped diagnosing the source of problem is usually straightforward. Today’s federated search applications ideally use a programmatic interface when available to lessen the burden on the source’s search engine and to simplify their own work in searching and retrieving results from the source. The content access basics articles on screen scraping and on XML provide some background on how federated search applications interface with sources to perform searches and retrieve documents. We’ll see, in a few paragraphs, how the existence of programmatic search interfaces complicates the matter of debugging federated search source access problems.
Let’s consider different failure scenarios. For all scenarios we assume that the federated search application is consistently returning no results from PubMed for several queries that would normally return results and that the problem isn’t a blatantly obvious problem with the federated search application or with the library’s local network. We eliminate the possibility of the problem of falsely blaming PubMed by verifying that the federated search application can get results from other sources.
Here’s one scenario: We can’t get to pubmed.gov to even attempt a search. Checking the source directly when federated search returns no results from the source is the first diagnostic step. If one can’t even get to pubmed.gov then I’d suspect a major problem, likely on their end, although it could be some kind of DNS or network problem somewhere between the federated search application and PubMed. Note that the fact that federated searches of other sources return results doesn’t eliminate the possibility of a network outage somewhere between you and PubMed’s network.
Let’s consider a second scenario: the search page at pubmed.gov is available but searches from the search form at pubmed.gov yield no results. Does this failure convince us that the federated search application can’t get any results from PubMed because of a problem on PubMed’s end? There’s a good chance that this is indeed the case but not necessarily. Here’s why. PubMed provides a programmatic interface. Depending on the source of a problem either or both of the programmatic interface and human-searched interfaces might be unavailable. It’s possible, although not highly likely, that PubMed’s human-search interface is down and that its programmatic interface is up but federated searches are failing due to some totally unrelated problem. If both paths to the source are down it’s likely to be a common problem.
Let’s consider a third scenario: searches from pubmed.gov return results but searches from the federated search application don’t. Where does the problem lie? We don’t know without further investigation. It could be that the programmatic interface is broken. It could be that the programmatic interface changed and that the federated search vendor needs to modify or rewrite the connector. Users or librarians are not going to easily be able to diagnose the problem in this scenario. One would need to know how the federated search application connector was configured to search PubMed and be able to reproduce its access method; reproducing the method requires esoteric skills and access to the code in the connector; most users will possess neither of these.
The upshot is this: given the existence of multiple methods of accessing sources, with some methods being complex, it’s harder than ever to diagnose and debug the source (pun intended) of source problems. However, programmatic interfaces change less often than do web-based interfaces so their presence is welcome by federated search vendors. Also, as federated search applications search larger numbers of sources it gets more difficult all the time to tell if one particular source is down, unless you happen to be looking for results from that particular source.
Another question, which I’ll address in a future article, is “What are your expectations of federated search vendor response to source problems?” Should vendors monitor sources 24 hours a day? How and how quickly should they inform you of a source problem? Should they inform you via the federated search application’s search page that a particular source is down? How quickly should they correct a connector problem?
What are your thoughts and experiences regarding the above questions?