There are lots of terms used interchangeably with “federated search.” I was interested to know how frequently each of these synonyms was used since I maintain a bunch of alerts related to various of these terms to keep an eye out for interesting web pages, blog posts, and events. So I set out to do some exploration with Google. Here are the eight terms (quoted phrases) I googled for (in alphabetical order) and noted the document counts for:
- broadcast search
- cross-database search
- deep web search
- distributed information retrieval
- distributed search
- federated search
- meta-search (same as “meta search”)
Here are then eight terms and their counts in decreasing order of frequency:
- metasearch: 5,480,000
- meta-search (same as “meta search”) : 3,500,000
- broadcast search : 156,000
- federated search : 141,000
- cross-database search : 135,000
- distributed search : 98,100
- deep web search : 33,600
- distributed information retrieval : 32,400
Did I learn anything interesting? I did.
- Google result counts vary between queries. When I ran a number of the queries more than once, I sometimes got different result counts the second time.
- A query for metasearch, without quotes, retrieves many more results than a search for “metasearch”. There must be some stemming that goes on when the quotes are omitted.
- “broadcast search” finds more pages than does “federated search”.
- “cross-database search” is nearly as popular as “federated search”.
I’d better set up some Google alerts for “broadcast search” and “cross database search”.
In googling around I discovered that you can ask Google to show you documents (and counts) within certain periods of time. So, for example, I can ask Google to show me documents about federated search that were indexed (for the first time, I assume) in the past month. This discovery led me to the idea of seeing which of the federated search terms were “gaining popularity.” In other words, when new documents are indexed by Google, do they tend to favor some terms over others?
I redid the eight searches with varying time periods: 3 months, 6 months, 12 months. I recorded those numbers, in order, and tacked on the “anytime” counts to the end of each query. Here are those numbers. Note that WordPress isn’t letting me embed a table in this blog article so I’m using a fixed width font for the table below. I hope it renders properly in your browser.
- metasearch 163,000 239,000 252,000 5,480,000
- meta-search 77,300 107,000 140,000 3,500,000
- broadcast search : 27,200 35,300 36,000 156,000
- federated search : 8,760 13,400 19,500 141,000
- cross-database search : 1,030 2,010 2,330 135,000
- distributed search : 2,610 4,470 7,070 98,100
- deep web search : 1,330 1,570 1,760 33,600
- distributed info... retrieval : 1,030 1,650 2,240 32,400
3 mo. 6 mo. 12 mo. anytime
So, what did I learn from this exercise?
- In most cases, the 3-month counts are around 75% of the 6-month counts. That implies that of the new documents found for a term in the last 6 months most were found in the last 3 months. This seems suspicious to me. More likely is that the Google index favors recent documents. Even more likely is that Google has a harder time dating documents than it wants to admit and that these date counts don’t mean as much as I’d like them to mean.
- “broadcast search” is increasing in document count at a much higher rate than the other terms.
- When I look at a number of the results for “broadcast search” I discover that most of them aren’t related to federated search. This is also a problem, to a smaller extent, with the term “metasearch” which often is used to refer to search engine aggregators like dogpile, which may not be of interested to people following federated search in libraries.
- The ratio of the 3-month count to the anytime count for the term “federated search” is about .06. This is much higher than the same ratios for other terms, except for “broadcast search”, which I should have left out altogether.
Given my suspicion of Google’s dating of documents I can’t really tell how much of their counts are influenced by their index growing vs. their date algorithm.
All in all, though, this has been an interesting diversion. It’ll be interesting to revisit these counts in a year.