23
Nov

Beyond search result bias

Author: Sol

Resource Shelf alerted me to research by Harvard Professor Benjamin Edelman: “Hard-Coding Bias in Google “Algorithmic” Search Results.” Edelman, who discloses that he consults for companies who compete with Google (which I do as well, consulting for this blog’s sponsor Deep Web Technologies), writes about the disconnect between Google’s commitment to providing unbiased results and its efforts to keep its users on its own properties.

A cynical user might expect Google to prominently link to its own services. After all, keeping users on Google properties means more opportunities to show ads — hence greater revenue. And every click Google sends through a no-cost algorithmic link is a lost revenue opportunity.

But on numerous occasions, Google has promised not to succumb to temptation to bias its search results. To the contrary, Google has committed to provide users with the best possible links, chosen fairly and even-handedly.

I have to admit that I was a bit surprised to see such an article about Google biasing some search results with its own content since I expect Google and every other search engine that is driven by search revenue to feature its results first. What was surprising to me, though, was how strong Google’s promise was “not to succumb to temptation to bias its search results.”

As you might suspect, I cite federated search as a search technology that is more inherently unbiased. When an organization has a search engine built for their users they select the sources, the vendor builds connectors to those sources and, at least in principle, there’s no bias in the algorithm used by the federated search engine in picking the results. I say “in principle” because, in practice, there are at least three biases that can taint results:

Some sources can return results faster than others. A slow source may timeout during a search and its results may not be seen at all.
More advanced federated search products allow for applying of different weights to results from different sources. This essentially makes one or more sources “featured sources” in that their results will appear first.
The mere act of selecting which sources to include in a federated search product creates a bias as content from other sources is never included.

Federated search biases aside, the organization owning the federated search application can control the biases in some cases and disclose the biases when it can’t control them (e.g. when a source does not reliably return results quickly.)

Blog sponsor Deep Web Technologies is the federated search technology I’m most familiar with. They’ve built search engines for a number of customers including the federal government. Prominent search portals such as Science.gov and WorldWideScience.org use Deep Web Technologies’ search engine to federate results and, there being no profit motive or any other reason that I’m aware of to bias results, researchers of these portals don’t have to wonder if the results are somehow skewed. They can look at the list of sources and know that, subject to the speed of returning search results, the results are unbiased. (Disclosure: I also consult for DOE OSTI, the organization that spearheaded and now stewards Science.gov and WorldWideScience.org.)

The concern that Edelman raised regarding content neutrality in Google is what makes me very nervous about discovery services. While I can’t deny that having a large index that returns results really quickly is a very nice feature I have this nagging concern about what we’re giving up to get this speed. Deep Web Technologies founder, president, and CTO Abe Lederman (yes, he’s my brother) raised this concern in the Deep Web Technologies blog.

Not everything is going to be in the index – Discovery Services index content that is of general enough interest such that it makes business sense for these products to index. Discovery Services also need to establish business relationships with the owners of content in order to index it and may also require the owner of the content to expend effort in making their content available to the Discovery Service. This means that not all content is going to be in their indices, particularly niche (long-tail) content.

So, if a library desires to provide their patrons with comprehensive, one-stop access to all the content that they subscribe to, the use of Federated Search is still required. EBSCO acknowledges this while Serials Solutions does not.

Vendor neutral — Both Summon and EDS are Discovery Services provided by multi-billion dollar publisher/aggregator companies whose main business is making sure that libraries purchase their content. Shouldn’t librarians worry whether results returned by Summon might be biased towards higher ranking of ProQuest results? Or even worse, shouldn’t librarians worry whether content from a competitor to the Discovery Service they are considering subscribing to is going to be missing. Carl Grant’s blog post addresses this issue in more detail.

The article by Carl Grant, “Gladiators” to perform sleight-of-hand at Charleston Conference, whose company also sells a discovery service, also raises the neutrality issue.

What is “content-neutrality”, who offers it and why is it important?

Content neutrality means that the library, not the publisher/content aggregator or vendor, can minimally control the following:

What content is included in their discovery tool.
The relevance ranking on that content. Can you force the content that is unique to your library to the top of the result sets? Can you control the relevancy ranking of all the content being offered through your discovery layer?
Control of the facets offered by the system. Facets are a very quick way for users to quickly sort through a lot of content, but in order for you to meet the specific needs of your users, these must be under your control. If they’re not under your control, careful analysis of those offered and why they’re being offered is needed on the library’s part before proceeding.

One other troubling concern from Carl Grant.

Also remember that when you sign with a publisher/aggregator for their discovery tool and you use their aggregate index that it has their competitors data loaded into it. That means they can now see the usage of not only their own content, but also that of their competitors. They can see what titles are used; they can see how often they’re used. It’s certainly possible, if you don’t control the relevancy ranking as described above, that they might force their content to rank higher than their competitors and therefore encourage greater use. I may be naïve, but no one is ever going to convince me that this information isn’t going to be mighty handy to have when it comes time for these publisher/aggregators to define the content packages for next year, what titles are in them and how they’re going to position and price them against their competitors.

More troubling than the bias that one would expect in a discovery service that is only going to include content from publishers and aggregators that it has relationships with is the lack of transparency from the major discovery service providers about what that content is.

Abe Lederman has publicly challenged Serials Solutions and EBSCO to become transparent in what content they’re providing:

I want to use this post to challenge all providers of Discovery Services, not just Summon and EDS, to become completely transparent by clearly listing on their web sites the content that has been indexed for use by your Discovery Service. For each publisher’s content that you index please indicate what journals / databases you are indexing, the period of coverage, whether you have indexed only an item’s meta-data or also its full-text, and how much your index typically lags behind current content available from each publisher.

If Lederman’s challenge is ignored, which I imagine it will be, you might want to ask yourself and Serials Solutions and EBSCO what they have to lose by disclosing whose content they’re including.

Does Summon include EBSCO content? Does EBSCO Discovery Service contain Serials Solutions content? Lederman is not sure:

I have been told, but have not had a chance to confirm, that EBSCO content is not in Summon and ProQuest content is not in EDS.

In summary, if you as a library or research organization value control over what content you are providing your patrons, you’ve got some hard questions to ask. Content neutrality is far from being a given.

If you enjoyed this post, make sure you subscribe to my RSS feed!

Tags: federated search

This entry was posted on Tuesday, November 23rd, 2010 at 7:45 am and is filed under discovery service, viewpoints. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

2 Responses so far to "Beyond search result bias"

1 Alan Cockerill
November 23rd, 2010 at 2:29 pm
“I have been told, but have not had a chance to confirm, that EBSCO content is not in Summon and ProQuest content is not in EDS.”

Summon harvests it data from many places but does focus on agreements with publishers rather than aggregators - this bypasses an intermediary like Ebsco - so even though they aren’t harvesting Ebsco’s data they are likely catching a fair percentage of it via the publishers Ebsco (and the database maintainers it hosts) are also indexing.

That’s a point that librarians new to the service struggle with. It doesn’t harvest A&I dbs necessarily, but it does attempt to harvest the same source content as A&I dbs. Can’t speak for EDS - which I believe is still partly a traditional federated search engine.
2 Sol
November 28th, 2010 at 5:01 pm
Alan,

Thank you for the information. It fills a gap in my knowledge.

Beyond search result bias

2 Responses so far to "Beyond search result bias"

Leave a reply

Categories

Archives

Pages

Sponsored By

Subscribe via RSS

Subscribe via Email

Proud Member

Recent Posts

Recent Comments