15
Sep

[ Editor's note: The following is a guest article by Dr. Peter Noerr.

Peter Noerr’s background is in information retrieval, where his extensive design and development experience has culminated in the creation of successful information technology product lines. Dr. Noerr was educated in South Africa and the UK, completing a Doctorate in Information Science from The City University, London. He spent six years working for the British Library as Head of Systems Development. In 1980 he left the Library to co-found IME Ltd. Dr. Noerr designed and produced the Tinman/Information Navigator line of library automation software for the company, selling over 3,000 systems throughout the world by the time the company was sold in 1996. Since then, Dr. Noerr has consulted for a variety of organizations on information management and retrieval. Dr. Noerr has authored many articles and publications and is frequently invited to speak at international conferences. Dr. Noerr is co-founder of MuseGlobal, Inc. and chief architect of the Muse product line. Dr. Noerr currently serves as Chief Technology Officer of MuseGlobal, Inc. ]

Federated Search or federated search

A little while ago New York Law School announced the unveiling of their DRAGNET system where searchers are able to use an Application built using Google’s Customized Search Engine (CSE) to find answers to their questions from a stable of 72 legal websites. The announcement runs:

The New York Law School’s Mendik Library has recently developed DRAGNET, a search tool that allows the user to find a topic simultaneously in more than 80 legal web sites and databases. DRAGNET stands for “Database retrieval access using Google’s new electronic technology.”
It is located at http://www.nyls.edu/library/research_tools_and_sources/dragnet

Leaving aside the difference in the number of Sources, it is a well engineered, and targeted system for its intended clientele. And it is intended for a particular purpose.

DRAGNET can be a good tool to begin a research project, giving you a sense of what kinds of materials can be found on your topic.

What is of interest to me is that it has been touted by commentators (on the web4lib listserv for example) as a “federated search tool.” Now, admittedly this use of federated search (FS) does not include capital letters, and the actual phrase has something of an identity crisis laden history, but DRAGNET (which does not use the name) is not a federated search system by whatever name you wish to call the technology.

The New York Law School does not call it federated search, and gives a pretty clear idea of what it is doing.

A DRAGNET search is like a Google search, except that it runs in only a select group of websites, produced by the organizations and entities listed below. The sites were chosen by our Library staff for their reliability and utility to legal researchers. Your search retrieves the top 100 hits, ranked for relevance by Google’s search engine.

This gives a good excuse to discuss some of the technical differences and decide if they actually matter.

I ran a couple of example searches. Because the first two included sites are the ABA Family Legal Guide and the American Academy of Matrimonial Lawyers, I decide to try two topics which should be dear to their hearts “buying a house” and “divorce” (sorry, but lawyers get more involved in divorces than marriages). My search “buy house” featured results from the SEC(5) and IRS(4) and Justice Department(5), but none from AB or AAML. (Numbers are from the first 20 results.) Trying “buying house” did manage to get one record from the ABA, but now 6 from the IRS, and none from the SEC. Turning to “divorce”, the top sites are Womenslaw (7), Hieros Gamos (5), and the State of NY(3). In passing there were no records in the top 100 for “divorce” from either of my two targeted sites.

I am not competent to judge if these results meet the criteria of the average searcher, and am not complaining that they are a bad set of results. They are just entirely different to what would be obtained from a federated search engine. We have horses for courses, and the users need to know the differences.

Obtained results

The search on the Google CSE is a search of the whole of Google. The results are obtained, ranked according to Google’s secret sauce, and then filtered for records from the desired sites. Because the ranking is done across the whole of Google’s database, pages from the desired sites are ranked by their linkage (and other criteria) to sites completely outside the list of desired sites. Their ranking is global, not specific. Thus the government sites will rank highly because of links from consultancies, software vendors, and law practices. This drives the specialist sites down (and off) the list of results. Federated search systems query each site selected and let them send back their “top 10″. These are aggregated and ranked within the result set. Thus even the specialist sites get a say in the results.

The result records

Google (thus the CSE) looks at website pages – these are the results returned. Federated search engines query the site search engine. This means a FS will get no results if the site does not have a search box it can use. If the site uses Google to index it, then it will get the same web pages as the Google CSE, but differently ranked. If the site has its own search engine, then the federated search engine will get results from that database – which will be invisible to Google. (Invisible unless those results are all from web pages which Google can crawl and index, that is.) Thus even from the same sites it is very likely that the returned results will just be different, leaving aside the “chopping” effect of the ranking.

Searching

In general Google uses simple “keyword” indexing, where the terms are all that matter. This provides a simple and extremely powerful search across a multitude of different sites, but it treats all words the same. Thus the plaintiff and the lawyer are one and the same to Google. A federated search engine is built to recognize the intricacies of each site and its search engine. Thus more precise searches are possible – but not through a simple single “search box” interface. Precision is what the federated search engines are about, whereas Google is about recall.

Processing

Finally, it is possible in all good federated search engines to post process the results; to re-rank or sort them, to de-duplicate them, to filter or even manually delete records. Analyses of different sorts are available to home in on the facet of the results the user is interested in. But on Google you have to scan the list. It takes some time.

The two mechanism operate very differently even if the user types words and see lists of results. The results they see are quite different. And the two should not be confused by the professionals of this business who should be looking for the best tool for the job, not a handy name they can borrow.

If you enjoyed this post, make sure you subscribe to the RSS feed!

This entry was posted on Wednesday, September 15th, 2010 at 7:29 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

2 Responses so far to "Federated Search or federated search"

  1. 1 Gregor Erbach
    September 15th, 2010 at 11:34 pm  

    Nice explanation of the differences between federated search and a customized search engine. There are a few more:
    1. Federated search may find documents that are missed by the crawler of a search engine.
    2. A search engine indexes all words on a page, including page header, footer, advertisements, navigation elements, while the search engine queried by a federated search system can be more selective.
    3. With a crawler-based search engine there will always be a delay before a new document is found and indexed, whereas the search engine queried by the federated search system can be updated at the same time as new documents are created.

  2. 2 David Goessling
    October 25th, 2010 at 10:23 am  

    I had commented on thsi article previsouly on LinkedIn. A few more comments now that I’ve spent some more time with Google CSE:

    My understanding of the Google CSE is a bit different from that described in Dr. Noerr’s article, and if it is incorrect I would appreciate being corrected.

    Dr. Noerr says that “The search on the Google CSE is a search of the whole of Google. The results are obtained, ranked according to Google’s secret sauce, and then filtered for records from the desired sites. Because the ranking is done across the whole of Google’s database, pages from the desired sites are ranked by their linkage (and other criteria) to sites completely outside the list of desired sites. Their ranking is global, not specific.”

    Is this true? I understood that when I choose a set of sites to include in my CSE, Google is essentially putting those sites in a “black box” and building a localised index of just those sites, and presenting the results using the Google algorithms, including PageRank, based on this localised usage. Sort of like having a somewhat crippled Google Search Appliance, but in the cloud. SERPs are built using the multi-weighted index of just those sites, not “big-Google.” Also, it is possible to influence the results set(s) over time using the CSE Synonyms and Refinements features. It’s also possible to boost and give weight to specific sites in the collection of sites you choose to “point” the CSE at. So, again, you can influence the results set.
    We might presume that an instance of CSE can also be influenced by many of the same “controls” as “big Google.” So, for example, if I am pointing at a group of sites that I can somewhat “influence” by getting their webmasters to write an informative meta=description for their site, then that will have the same “SEO” effect as it does in “big Google”, since it’s fairly well known that the latter gives some priority to that metadata field in indexing. There are of course other unknowns about CSE behavior… let’s share what we learn in the Google CSE forums!

Leave a reply

Name (*)
Mail (*)
URI
Comment