Archive for October, 2010


The AIIM Digital Landfill Blog recently published an article about making information findable in the enterprise:

With the explosion of content in the enterprise, findability has become a major concern for most organizations. Tens of millions of documents can be scattered between multiple file systems, databases and content management applications. Each application has a different access method and no common interface exists for a user who needs a piece of knowledge to get their job done. Putting all of this content in a common location and providing an interface to find this data is the job of enterprise search.

The “8 keys to findability” apply to federated search every bit as much as to enterprise search. Here are the keys with my thoughts on each:

  1. Data in Many Types of Files. Combining files of different types from multiple sources into a single index is nice in theory but very difficult in practice. An approach that is frequently more practical is to federate the content and to access each source (regardless of its content type) through a specially crafted connector.
  2. Data Locked in Systems. This is the content discovery problem. Useful content is scattered throughout the enterprise. “Out of sight out of mind” is not a good thing when it applies to enterprise data. Federating multiple sources keeps anything that might be relevant in sight.
  3. Not all Text is the Same. Context is important. Enterprise search places great value on going beyond keyword search to providing search results in context. Federated search could do more in this area.
  4. Users Can’t Spel so Gud. A spell checker is a start but it’s not enough. A domain-specific thesaurus can widen the net just enough to find relevant content that would be missed otherwise.
  5. Word Forms should not be Important. Not a whole lot to say here. Stem. Lemmatize. Do reasonable things with stop words.
  6. Relevancy is not the same for Everyone.
  7. Relevance and personalization come together. Allow tuning of what fields matter most in search results. Ideally, customize these parameters for each user.

  8. Security is Paramount. Access control is a big deal in the enterprise but is more a special case in federated search.
  9. Effective Search Experience. Usability. Usability. Usability. Not all federated search applications are created the same. If they can’t use it they can’t find what they need.

The AIIM article concludes with a fitting summary:

These eight items are critical to consider when implementing an enterprise search solution. Keep in mind however, that most often this needs to be an iterative process. The goal of findability requires regular reviews of user activity and tuning to ensure that the application is still effective. The core, however, is a good search engine and some domain knowledge about the content available to an enterprise. While these projects can be difficult, achieving findability for your knowledge store will make your users productive. And that is certain to make management happy.


This great quote from a blog post, Is Discovery Better Than Google Scholar?, got my attention:

Users want one-stop shopping. But “The Internet is the world’s largest library; it’s just that all the books are on the floor! It’s time to start picking the up.”

It’s a great image of the challenges of searching, aggregating, and organizing the Web’s scholarly information.

The article is a summary of an Internet Librarian 2010 presentation which explored usage of Serials Solutions Summon vs. Google Scholar:

Four presenters from two institutions, the University of Manitoba (UM) and Stony Brook University did some usability testing and compared usage patterns before and after implementing Serials Solutions’ Summon system.

Read the full article here.


The Axiell blog published an article: Data well - the kiss of death? Here’s the first paragraph:

When Federated Search software failed to deliver “One search to find them all” in a reasonable and comfortable way alternatives were brought to the table. Winning concepts for the time being are “Federated indexes” or “Data Wells”. What are these creatures? They could be described as containers of metadata. Such containers are commercial alternatives like Ebsco Discovery and Serial Solutions Summon and publicly funded like Summa from Århus University Library.

The article makes a key point that is often overlooked:

If the “Well” is constructed as a Web Service the local library webs can choose how they want to present the metadata and it will then be a part of the local service image. It will serve the local library web platform interactivity.

The point here is that a library needs to add value to information it makes available to its patrons otherwise the value of the library can come into question. With a web service interface to data wells the library can combine offerings from different wells and build services their users need.

The Axiell article provides a number of excellent questions to consider about how to make metadata available to patrons:

  • What metadata wants the library to present
  • How does the library want to present the local metadata
  • What local services does the library want to couple (mash up) with external metadata
  • What external services does the library want to couple with the local metadata
  • What metadata should be coupled to searches in the libraries web site
  • What metadata does the library want to couple to external or local Social Media functionality
  • What metadata should be available in the libraries mobile phone solutions
  • What metadata does the library want to stream to services in the local civil society
  • What metadata should be searchable together with metadata from other cultural institutions in the local community
  • What metadata can be used by the library patrons
  • What metadata is available to partners
  • What metadata should be presented in which order, which form? Physical media, own digital collection, chosen electronic media, externally available media, national collections etc… In one, two or three steps?

Read the whole Axiell article here.


This tweet pointed me to a Wall Street Journal article: ‘Scrapers’ Dig Deep for Data on Web.

At 1 a.m. on May 7, the website noticed suspicious activity on its “Mood” discussion board. There, people exchange highly personal stories about their emotional disorders, ranging from bipolar disease to a desire to cut themselves.

It was a break-in. A new member of the site, using sophisticated software, was “scraping,” or copying, every single message off PatientsLikeMe’s private online forums.

There’s a huge and growing market for deep Web information for marketing, competitive intelligence, background checking and other purposes. The deep Web isn’t just about finding scholarly documents in scientific, technical, or business journals. Private information on web forums may not be as private as we would like.

The market for personal data about Internet users is booming, and in the vanguard is the practice of “scraping.” Firms offer to harvest online conversations and collect personal details from social-networking sites, résumé sites and online forums where people might discuss their lives.

If you thought that you were safe because you don’t use your real number in online forums, think again:

New York-based PeekYou LLC has applied for a patent for a method that, among other things, matches people’s real names to the pseudonyms they use on blogs, Twitter and other social networks. PeekYou’s people-search website offers records of about 250 million people, primarily in the U.S. and Canada.

Read the whole WSJ article here.


Here’s an entertaining article from DomainGang.

Yahoo to eliminate all Bots, Crawlers and Web Spiders

Posted by Lucius “Guns” Fabrice on September 20, 2010

Since the early 90′s the automated search and retrieval of the so-called “deep web” has become the ultimate goal of every search engine. While Google recently introduced the Instant Search feature that delivers results on the spot, trends might change soon.

After almost 15 years on the go, Yahoo announced that it’s pulling the plug on its bots, crawlers and web spiders.

“Our business model is simple: automation kills jobs”, said Matt Jiggerson, chief engineer at Yahoo.

“All this software, searching thousands of web sites and storing millions of gigabytes of information does not fit in the current state of bad economy we are facing”.

Read the whole article.