23
Nov

Resource Shelf alerted me to research by Harvard Professor Benjamin Edelman: “Hard-Coding Bias in Google “Algorithmic” Search Results.” Edelman, who discloses that he consults for companies who compete with Google (which I do as well, consulting for this blog’s sponsor Deep Web Technologies), writes about the disconnect between Google’s commitment to providing unbiased results and its efforts to keep its users on its own properties.

A cynical user might expect Google to prominently link to its own services. After all, keeping users on Google properties means more opportunities to show ads — hence greater revenue. And every click Google sends through a no-cost algorithmic link is a lost revenue opportunity.

But on numerous occasions, Google has promised not to succumb to temptation to bias its search results. To the contrary, Google has committed to provide users with the best possible links, chosen fairly and even-handedly.

I have to admit that I was a bit surprised to see such an article about Google biasing some search results with its own content since I expect Google and every other search engine that is driven by search revenue to feature its results first. What was surprising to me, though, was how strong Google’s promise was “not to succumb to temptation to bias its search results.”

Read the rest of this entry »

7
Nov

Information Today columnist Don Hawkins recently published “A Blunt Assessment of Search Discovery Tools.” Hawkins highlights some concerns that the Montana State University library raised with two discovery services that they experimented with, WorldCat Local and Summon.

WorldCat Local

  • Many records didn’t have OCLC numbers so did not show up in the database.
  • Some known items (mainly government documents) were not found.

Summon

  • The vendor promised a simple implementation, but loading one digital collection was slow.
  • Problems occurred in the details: deleted items continued to appear; known item searches may not work.
  • Database name searches may require an exact match.
  • One experiment resulted in a 29% failure rate for a subject search.
  • Sometimes discovery tools search the full text, but not always, and we don’t know when they do.
  • Relevance is not good yet.

Ex Libris Chief Librarian Carl Grant raised concerns of his own, but from a different slant. In “Gladiators” to perform sleight-of-hand at Charleston Conference.” Grant makes a pretty strong assertion, referring to EBSCO and Serials Solutions:

These two particular firms are, as Library Journal says, in the “greatest competition” because they are, first and foremost, publishers/aggregators fighting head-to-head for their first line of business, which is content and content aggregation services. The discovery solution is secondary to them and it is shown in numerous ways by their actions.

He proceeds to provide questions to discovery service providers to understand their true motivations. These questions, of course, reflect the interests of discovery service provider Ex Libris. Nonetheless, those exploring discovery services need to ask these questions.

The upshot — federated search isn’t dead yet and discovery services are not the magic bullet the marketing material would have you believe.

29
Oct

The AIIM Digital Landfill Blog recently published an article about making information findable in the enterprise:

With the explosion of content in the enterprise, findability has become a major concern for most organizations. Tens of millions of documents can be scattered between multiple file systems, databases and content management applications. Each application has a different access method and no common interface exists for a user who needs a piece of knowledge to get their job done. Putting all of this content in a common location and providing an interface to find this data is the job of enterprise search.

The “8 keys to findability” apply to federated search every bit as much as to enterprise search. Here are the keys with my thoughts on each:

  1. Data in Many Types of Files. Combining files of different types from multiple sources into a single index is nice in theory but very difficult in practice. An approach that is frequently more practical is to federate the content and to access each source (regardless of its content type) through a specially crafted connector.
  2. Data Locked in Systems. This is the content discovery problem. Useful content is scattered throughout the enterprise. “Out of sight out of mind” is not a good thing when it applies to enterprise data. Federating multiple sources keeps anything that might be relevant in sight.
  3. Not all Text is the Same. Context is important. Enterprise search places great value on going beyond keyword search to providing search results in context. Federated search could do more in this area.
  4. Users Can’t Spel so Gud. A spell checker is a start but it’s not enough. A domain-specific thesaurus can widen the net just enough to find relevant content that would be missed otherwise.
  5. Word Forms should not be Important. Not a whole lot to say here. Stem. Lemmatize. Do reasonable things with stop words.
  6. Relevancy is not the same for Everyone.
  7. Relevance and personalization come together. Allow tuning of what fields matter most in search results. Ideally, customize these parameters for each user.

  8. Security is Paramount. Access control is a big deal in the enterprise but is more a special case in federated search.
  9. Effective Search Experience. Usability. Usability. Usability. Not all federated search applications are created the same. If they can’t use it they can’t find what they need.

The AIIM article concludes with a fitting summary:

These eight items are critical to consider when implementing an enterprise search solution. Keep in mind however, that most often this needs to be an iterative process. The goal of findability requires regular reviews of user activity and tuning to ensure that the application is still effective. The core, however, is a good search engine and some domain knowledge about the content available to an enterprise. While these projects can be difficult, achieving findability for your knowledge store will make your users productive. And that is certain to make management happy.

26
Oct

This great quote from a LibConf.com blog post, Is Discovery Better Than Google Scholar?, got my attention:

Users want one-stop shopping. But “The Internet is the world’s largest library; it’s just that all the books are on the floor! It’s time to start picking the up.”

It’s a great image of the challenges of searching, aggregating, and organizing the Web’s scholarly information.

The article is a summary of an Internet Librarian 2010 presentation which explored usage of Serials Solutions Summon vs. Google Scholar:

Four presenters from two institutions, the University of Manitoba (UM) and Stony Brook University did some usability testing and compared usage patterns before and after implementing Serials Solutions’ Summon system.

Read the full article here.

21
Oct

The Axiell blog published an article: Data well – the kiss of death? Here’s the first paragraph:

When Federated Search software failed to deliver “One search to find them all” in a reasonable and comfortable way alternatives were brought to the table. Winning concepts for the time being are “Federated indexes” or “Data Wells”. What are these creatures? They could be described as containers of metadata. Such containers are commercial alternatives like Ebsco Discovery and Serial Solutions Summon and publicly funded like Summa from Århus University Library.

The article makes a key point that is often overlooked:

If the “Well” is constructed as a Web Service the local library webs can choose how they want to present the metadata and it will then be a part of the local service image. It will serve the local library web platform interactivity.

The point here is that a library needs to add value to information it makes available to its patrons otherwise the value of the library can come into question. With a web service interface to data wells the library can combine offerings from different wells and build services their users need.

The Axiell article provides a number of excellent questions to consider about how to make metadata available to patrons:

  • What metadata wants the library to present
  • How does the library want to present the local metadata
  • What local services does the library want to couple (mash up) with external metadata
  • What external services does the library want to couple with the local metadata
  • What metadata should be coupled to searches in the libraries web site
  • What metadata does the library want to couple to external or local Social Media functionality
  • What metadata should be available in the libraries mobile phone solutions
  • What metadata does the library want to stream to services in the local civil society
  • What metadata should be searchable together with metadata from other cultural institutions in the local community
  • What metadata can be used by the library patrons
  • What metadata is available to partners
  • What metadata should be presented in which order, which form? Physical media, own digital collection, chosen electronic media, externally available media, national collections etc… In one, two or three steps?

Read the whole Axiell article here.

17
Oct

This tweet pointed me to a Wall Street Journal article: ‘Scrapers’ Dig Deep for Data on Web.

At 1 a.m. on May 7, the website PatientsLikeMe.com noticed suspicious activity on its “Mood” discussion board. There, people exchange highly personal stories about their emotional disorders, ranging from bipolar disease to a desire to cut themselves.

It was a break-in. A new member of the site, using sophisticated software, was “scraping,” or copying, every single message off PatientsLikeMe’s private online forums.

There’s a huge and growing market for deep Web information for marketing, competitive intelligence, background checking and other purposes. The deep Web isn’t just about finding scholarly documents in scientific, technical, or business journals. Private information on web forums may not be as private as we would like.

The market for personal data about Internet users is booming, and in the vanguard is the practice of “scraping.” Firms offer to harvest online conversations and collect personal details from social-networking sites, résumé sites and online forums where people might discuss their lives.

If you thought that you were safe because you don’t use your real number in online forums, think again:

New York-based PeekYou LLC has applied for a patent for a method that, among other things, matches people’s real names to the pseudonyms they use on blogs, Twitter and other social networks. PeekYou’s people-search website offers records of about 250 million people, primarily in the U.S. and Canada.

Read the whole WSJ article here.

9
Oct

Here’s an entertaining article from DomainGang.

Yahoo to eliminate all Bots, Crawlers and Web Spiders

Posted by Lucius “Guns” Fabrice on September 20, 2010

Since the early 90′s the automated search and retrieval of the so-called “deep web” has become the ultimate goal of every search engine. While Google recently introduced the Instant Search feature that delivers results on the spot, trends might change soon.

After almost 15 years on the go, Yahoo announced that it’s pulling the plug on its bots, crawlers and web spiders.

“Our business model is simple: automation kills jobs”, said Matt Jiggerson, chief engineer at Yahoo.

“All this software, searching thousands of web sites and storing millions of gigabytes of information does not fit in the current state of bad economy we are facing”.

Read the whole article.

28
Sep

[ Editor's Note: This is a guest article by David Jenkins. ]

David Jenkins has worked in libraries since 2005. He has experience in the public and academic library sectors, starting as a Library and Information Assistant with Sheffield City Council and moving on to become an Assistant Librarian with the Electronic Service Development Team (ESDT) at Manchester Metropolitan University (MMU) in 2009. He holds an MA Librarianship from the University of Sheffield, graduating in 2009. This course sparked an interest in the relationship between libraries and technology that has informed his practice since. David is Web Liaison for the CILIP North West Branch committee.

Implementing Search it! at Manchester Metropolitan University (MMU)

Note – Search it! is only accessible to current MMU students and staff. This
is due to licensing restrictions imposed by the publishers, whose content is
accessible via Search It! For further insight into using Search it! please see
our helpsheet, FAQ and video.

For the 2010/2011 academic year, Manchester Metropolitan University (MMU) Library has launched Search it!, a federated search solution for its students and staff. This piece will define Search it!, describe they way it has been implemented and examine why it has been implemented in this way. I hope that this will give you an insight into one academic library’s perspective on implementing a federated search product.

Search it! is based on Metalib by Ex Libris. It is simply MMU’s branding and configuration of its own instance of Metalib.

Read the rest of this entry »

15
Sep

[ Editor's note: The following is a guest article by Dr. Peter Noerr.

Peter Noerr’s background is in information retrieval, where his extensive design and development experience has culminated in the creation of successful information technology product lines. Dr. Noerr was educated in South Africa and the UK, completing a Doctorate in Information Science from The City University, London. He spent six years working for the British Library as Head of Systems Development. In 1980 he left the Library to co-found IME Ltd. Dr. Noerr designed and produced the Tinman/Information Navigator line of library automation software for the company, selling over 3,000 systems throughout the world by the time the company was sold in 1996. Since then, Dr. Noerr has consulted for a variety of organizations on information management and retrieval. Dr. Noerr has authored many articles and publications and is frequently invited to speak at international conferences. Dr. Noerr is co-founder of MuseGlobal, Inc. and chief architect of the Muse product line. Dr. Noerr currently serves as Chief Technology Officer of MuseGlobal, Inc. ]

Federated Search or federated search

A little while ago New York Law School announced the unveiling of their DRAGNET system where searchers are able to use an Application built using Google’s Customized Search Engine (CSE) to find answers to their questions from a stable of 72 legal websites. The announcement runs:

The New York Law School’s Mendik Library has recently developed DRAGNET, a search tool that allows the user to find a topic simultaneously in more than 80 legal web sites and databases. DRAGNET stands for “Database retrieval access using Google’s new electronic technology.”
It is located at http://www.nyls.edu/library/research_tools_and_sources/dragnet

Leaving aside the difference in the number of Sources, it is a well engineered, and targeted system for its intended clientele. And it is intended for a particular purpose.

DRAGNET can be a good tool to begin a research project, giving you a sense of what kinds of materials can be found on your topic.

What is of interest to me is that it has been touted by commentators (on the web4lib listserv for example) as a “federated search tool.” Now, admittedly this use of federated search (FS) does not include capital letters, and the actual phrase has something of an identity crisis laden history, but DRAGNET (which does not use the name) is not a federated search system by whatever name you wish to call the technology.

Read the rest of this entry »

31
Aug

In May, search consultant Avi Rappoport delivered a presentation at the Enterprise Search Summit: Federated vs. Aggregated Search Architectures.

Avi Rappoport is an enterprise search consultant, helping companies improve search engine functionality for websites and intranets. She has a degree from UC Berkeley’s (then) School of Library and Information Science and spent 10 years in software development before becoming a search consultant. She is the editor of SearchTools.com and a frequent speaker and author, providing a strong focus on search usability in the broadest sense and sharing her conviction that search engines can always be better.

Avi created a web page with a summary of and links to a couple of versions of her presentation.

I greatly appreciate Avi’s consideration of the pluses and minuses of federation aggregation (i.e. discovery service) in a world that is often polarized about one approach being better in all cases.

My research for this presentation indicated that each is useful in specific circumstances (I know, no surprise there). Many data sources are obviously best accessed by one or the other, but it’s the corner cases that are tricky. Aspects to consider include:

  • size of the content in the source
  • how often your users need that content
  • content change rate
  • importance of real-time access control permissions changes
  • content licensing rules
  • available tools for indexing / querying
  • difficulty of extracting and indexing
  • quality of the internal search engine
  • difficulty of sending queries and receiving results

The final slide has some sage advice:

Be open-minded, analyze the benefits of each approach for each data source.

One size does NOT fit all.