My fur was raised when I saw Serials Solutions’ claim that their discovery service was an evolutionary step beyond federated search. I raised my concerns a couple of times: here and here. My beef isn’t with Serials Solutions as a business, it’s with their position that it’s fine to not search content that they don’t provide access to. There’s no room (yet) in their discovery service model to include access to quality content that can only be searched live, i.e. via federated search. Carl Grant joined the conversation and various people commented, making the topic a very lively one.

My concern was, and is, that libraries and research organizations would consider giving away their responsibility to select quality sources for their patrons for what I imagine to be two primary reasons: (1) library patrons don’t like to wait 30 seconds for federated search results, and (2) (possibly) cost savings. I don’t have a lot of sympathy for the Google generation. Even though I’m an American and my culture has taught me that immediate gratification is a good thing I think 30 seconds is a small price to pay to see better results. Cost I can’t speak to as I don’t have any figures.

One of my colleagues pointed me to an article by scientist and writer Michael Nielsen, Is scientific publishing about to be disrupted?, which only strengthens my belief that access to content from aggregators only supplements access via other methods such as federated search.

Michael Nielsen is a very accomplished scientist. His bio lists some of his impressive credentials:

Michael Nielsen is one of the pioneers of quantum computation. Together with Ike Chuang of MIT, he wrote the standard text on quantum computation. This is the most highly cited physics publication of the last 25 years, and one of the ten most highly cited physics books of all time (Source: Google Scholar, December 2007). He is the author of more than fifty scientific papers, including invited contributions to Nature and Scientific American. His research contributions include involvement in one of the first quantum teleportation experiments (related), named as one of Science Magazine’s Top Ten Breakthroughs of the Year for 1998, quantum gate teleportation, quantum process tomography, the fundamental majorization theorem for comparing entangled quantum states, and critical contributions to the formula for the quantum channel capacity. A full list of papers is here.

Nielsen’s article argues that there is impending disruption of scientific publishing. The article is fascinating, Nielsen is a compelling and well-informed writer and I recommend you read the fairly long article and, if you have time, that you follow at least some of the numerous links. I want to also add that I had the opportunity to spend some time with Nielsen at a conference he helped to organize at the Perimeter Institute and I very much appreciate how incredibly down to earth the man is.

What I found most valuable in Nielsen’s writing were various examples of science being published in non-traditional ways.

One example is Nielsen’s response to a New York Times editorial about the death of newspapers. Here’s a snippet from the editorial:

There’s a great deal of good commentary out there on the Web, as you say. Frankly, I think it is the task of bloggers to catch up to us, not the other way around… Our board is staffed with people with a wide and deep range of knowledge on many subjects. Phil Boffey, for example, has decades of science and medical writing under his belt and often writes on those issues for us… Here’s one way to look at it: If the Times editorial board were a single person, he or she would have six Pulitzer prizes…

And here’s Nielsen’s poignant response:

[The New York Times editorial piece] demonstrates a deep commitment to high-quality journalism, and the other values that have made the New York Times great. In ordinary times this kind of commitment to values would be a sign of strength. The problem is that as good as Phil Boffey might be, I prefer the combined talents of Fields medallist Terry Tao, Nobel prize winner Carl Wieman, MacArthur Fellow Luis von Ahn, acclaimed science writer Carl Zimmer, and thousands of others. The blogosophere has at least four Fields medalists (the Nobel of math), three Nobelists, and many more luminaries. The New York Times can keep its Pulitzer Prizes.

Nielsen’s point is clear. The blogosphere is a tremendous resource to scientists. Libraries and research organizations miss huge amounts of valuable and current resources if they only provide access to content from major publishers (or their aggregators.) I do realize that the writings of probably all of the bloggers that Nielsen mentioned is available through Google and might not make sense to federate. The problem with searching Google for excellent science is that you need the time and discernment to find the good stuff. But, however one might access science content, the power of traditional publishers is waning which is a really good reason to not depend on them for all the science worth reading.

Here’s another excerpt from Nielsen’s article, this one on innovative ways to communicate science that are sprouting up everywhere:

What’s new today is the flourishing of an ecosystem of startups that are experimenting with new ways of communicating research, some radically different to conventional journals. Consider Chemspider, the excellent online database of more than 20 million molecules, recently acquired by the Royal Society of Chemistry. Consider Mendeley, a platform for managing, filtering and searching scientific papers, with backing from some of the people involved in Last.fm and Skype. Or consider startups like SciVee (YouTube for scientists), the Public Library of Science, the Journal of Visualized Experiments, vibrant community sites like OpenWetWare and the Alzheimer Research Forum, and dozens more. And then there are companies like WordPress, Friendfeed, and Wikimedia, that weren’t started with science in mind, but which are increasingly helping scientists communicate their research.

These Web 2.0 science offerings, at least the ones that provide an API or other mechanism for efficient search, are prime candidates for federation as they constantly generate new content.

One last quote from Nielsen. I very much enjoyed the great examples Nielsen packed into this paragraph of outstanding science being found in blogs of all places.

It’s easy to miss the impact of blogs on research, because most science blogs focus on outreach. But more and more blogs contain high quality research content. Look at Terry Tao’s wonderful series of posts explaining one of the biggest breakthroughs in recent mathematical history, the proof of the Poincare conjecture. Or Tim Gowers recent experiment in “massively collaborative mathematics”, using open source principles to successfully attack a significant mathematical problem. Or Richard Lipton’s excellent series of posts exploring his ideas for solving a major problem in computer science, namely, finding a fast algorithm for factoring large numbers. Scientific publishers should be terrified that some of the world’s best scientists, people at or near their research peak, people whose time is at a premium, are spending hundreds of hours each year creating original research content for their blogs, content that in many cases would be difficult or impossible to publish in a conventional journal. What we’re seeing here is a spectacular expansion in the range of the blog medium. By comparison, the journals are standing still.

At SLA 2009, Abe delivered a presentation: A Journey to 10,000 sources. The talk was about (this blog’s sponsor) Deep Web Technologies‘ efforts to search initially hundreds, then thousands, and eventually 10,000 sources. The accompanying paper makes this important argument for making a wider range of science information available to researchers:

By relying on only the content available from the major publishers and aggregators, researchers miss other important content, in particular the output of scientists who do not publish in mainstream journals. The world is shrinking, the brain pool is growing, and the output of science is everywhere.

While one may argue about the merits of federation vs. crawling and indexing vs. discovery services those arguments frequently focus on the technological merits of particular approaches. The more important question, I think, is what information is worth your while to see? For most of us that information can’t all be federated, or all indexed, or all provided to us by a discovery service. I think federated search will continue to evolve into this hybrid being where multiple technologies are enlisted to give scientists what they need.

If you enjoyed this post, make sure you subscribe to my RSS feed!
This entry was posted on Wednesday, July 1st, 2009 at 8:21 pm and is filed under discovery service, viewpoints. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

3 Responses so far to "Science source selection"

  1. 1 Jonathan Rochkind
    July 2nd, 2009 at 6:51 am  

    Sol, I’ve said this before, but I don’t understand where you’re coming from on the “it’s fine to only search content you provide access to” point, because as far as I can tell, _federated search is exactly the same thing_. Even with broadcast search, you can only search content that is accessible to broadcast search by your tool, which is not ALL content!

    With either broadcast search or an aggregated index, you only have access to what you have access to. And in either case, providing access to any arbitrary resource may or may not be possible, and will take a non-zero amount of ‘development’ time to add. With either case, the user using the tool only has access to searching content included in the tool, unless they leave the tool, which they can do in either case.

    It seems to me you are assuming that broadcast search will have a lower barrier to including additional resources, and end up having more coverage? This to me is something that needs to be seen in practice, not assumed theoretically. The key thing will be comparing two actual tools, and seeing which tool includes more resources, and more relevant and quality resources, for some particular context (audience and use cases).

    If SerSol’s aggregated index comes up short, then that will be to it’s detriment. But I don’t see how that can be assumed without an actual evaluation of two actual deployed tools side by side.

  2. 2 Sol
    July 2nd, 2009 at 7:53 am  


    I think the answer is a hybrid service. Index what you can (discovery service), federate what you can, harvest what you can, etc. A good federated search framework can integrate access to content that has been obtained by a number of different technologies.

    In other words, lead with the question of what sources are important not with the question of what technology is the most convenient. Then find ways to get access to each of them. I’ve never said that federated search makes discovery services or crawling or harvesting obsolete.

  3. 3 Peter Noerr
    July 3rd, 2009 at 2:09 pm  

    There is generally another aspect to the “accessibility” of sources other than the technical, which is what Jonathan talks about. And that is commercial.

    The commercial conditions for getting a set of records into an aggregated search environment seem to often be more onerous than allowing a broadcast search of the native source. Part of the problem lies in perception of the aggregated environment as being a competitor to the source of the records. This may be a true perception, a false one, or a wishful one.

Leave a reply

Name (*)
Mail (*)