A new paradigm for federated search | Federated Search BlogFederated Search
27
May

Steve Jurvetson is a Managing Director of Draper Fisher Jurvetson, a leading venture capital firm. Jurvetson, at last week’s 11th annual Churchill Top Tech Trends event, spoke about a trend toward decentralized search which could dramatically change how many of us think about federated search. WebWare covered the event and discussed the trend:

Venture capital whiz-kid Steve Jurvetson gave an impassioned pitch for this trend, which he called, “The triumph of the distributed Web.” He said the aggregate power of distributed human activity will trump centralized control. His main point was that Google, and other search engines that analyze the Web and links, are much less useful than a (theoretical) search engine that knows not what people have linked to (as Google does), but rather what pages are open on people’s browsers at the moment that people are searching. “All the problems of search would be solved if search relevance was ranked by what browsers were displaying,” he said.

Jurvetson believes that the future is “federated search,” in which the Web’s users don’t just execute search queries, they participate in building the index by the very act of searching, immediately and directly.


You can watch Jurvetson speak about “the Triumph of the Distributed Web” on this YouTube video:
YouTube Preview Image
Abe and I had a long conversation about Jurvetson’s very interesting idea. In the current Google-centered search world we are limited by Google’s speed in indexing content. Plus, Google presents first the pages that have the most links to them. So, when we search for something, we’ll find the pages that people made the effort to link to and that Google has found and indexed. Jurvetson’s paradigm shift is to not rely on Google as the central web authority — to not be dependent on Google to index the pages that matter to us in a timely manner — but for all of us to “vote with our browsers.” We tell the world what matters to us, in this moment, by browsing web pages (including those with images, videos, and audios) that matter to us and spending time on those pages. No effort is needed to vote by linking, and there’s no delay waiting for our votes to be counted. Plus, when yesterday’s news is old news, there are no out of date links to remove or discount. We move on to today’s pages and the search engines follow us.

Jurvetson’s paradigm could introduce a fascinating shift to federated search as we know it today. Imagine being able to search millions of sources, in real time, including Deep Web sources that make up most of the Web and that Google doesn’t access, without having to build a single connector! This is a very real possibility albeit with a couple of serious caveats. Let’s look at how it could work.

There would exist a “collective” real-time ever-changing database of web pages that members of the Collective — ideally all of the Web’s users — are viewing and the average amount of time that each page is being viewed. Through a browser plugin every browser in the Collective transmits browser page and time usage information to the big database. To address privacy issues the plugin never sends information that could identify an individual. The plugin could even be open sourced so that privacy advocates could verify its correct operation. The content of web pages transmitted would be indexed for searchability. Nothing about the browsing behavior of Collective members changes. Members of the Collective could search the big database. Ranking of search results would be influenced by the popularity of web pages, i.e. the average length of time members spent on any given page. Of course, there wouldn’t be much value to the Collective if Web citizens only searched the Collective. I imagine that a world would evolve in which rankings from the Collective would be embedded into search results from many search engines making the machinery of the Collective largely invisible to most Web users.

How is this federated search? If you consider the millions of web sites and Deep Web repositories that the Web collective searches every day as content sources, that the documents within these sources are searched in real time (because the index is updated in real time,) and that results are relevance ranked (by browser behavior) then it looks a lot like today’s federated search.

While I find Jurvetson’s paradigm to be compelling there are two sobering limitations to consider. First, the Collective only knows about content that its members have seen. Obscure content may never be seen by any Collective members and would thus never get indexed. This is different from today’s federated search paradigm where the underlying (Deep Web) search engines will return documents even if no one before has ever asked for them. The second consideration is that researchers who are looking for high quality scholarly material will have to separate the wheat from the chaff themselves, the way they need to with Google. This is different than the current model of federated search where the scholarly search engines return only scholarly content. I do imagine, however, that tools will emerge that will help to identify the scholarly sources, with help from the Collective.

Jurvetson does acknowledge, in his video, that his approach isn’t going to replace Google, Wolfram Alpha, or other search engines that provide factual information. While Jurvetson claims that his approach works best for harnessing the power of the Real-time Web I believe that even people searching for information that isn’t about what’s happening right now can benefit. A researcher seeking out scholarly articles can benefit from knowing how much attention particular documents are getting now, even if those documents are years old. And, the Collective may help to locate those old documents if more current ones are receiving less attention.

Caveats aside, I’m very intrigued with the possibilities of a Web 2.0 crowdsourcing approach to building a better search engine.

Dan Rua has an insightful commentary on Jurvetson’s ideas in his blog:

Jurvetson’s observations about distributed search are very similar to some I’ve been investigating. For every explicit action people take on the web (e.g. creating a link that GOOG indexes), there are 10X+ implicit actions (e.g. browsing, scrolling, video abandons) that are not being indexed for intelligence today. There are distinct efforts happening within content indexing, social feeds and webmaster analytics (largely implicit action data) that, when combined, will deliver maximum search intelligence. Much of the “Real-time web” excitement around Twitter, FriendFeed, Facebook and others still focuses on explicit actions (e.g. status updates), but the best value will come from the real-time, socially-curated, implicitly-indexed web.

And, lest you think Google is going to be left in the dust if the world of search changes, consider this piece of a blog article by Joanne Cummings at the Google Subnet Blog:

While most pundits tend to agree that search is on the brink of morphing into something completely different–far beyond the current list of “10 blue links” when it comes to usefulness–dismissing Google’s participation in such a change is probably not the wisest course. Each time a small idea-rich yet cash-poor start-up delivers an interesting new twist on search, Google’s competitive streak goes into overdrive to not only match the capability but usually surpass it entirely. Just look at latest round of updates Google announced at Searchology, which took straight aim at search wannabes like Twitter and Wolfram Alpha.

Yes, search is poised to change into something completely un-Google-like. But don’t count Google out in the process. It may just end up leading the charge.

What do you think? Share your thoughts in the comments.

If you enjoyed this post, make sure you subscribe to my RSS feed!

Tags:

This entry was posted on Wednesday, May 27th, 2009 at 6:57 am and is filed under Uncategorized, viewpoints. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

6 Responses so far to "A new paradigm for federated search"

  1. 1 Daniel Tunkelang
    May 27th, 2009 at 12:49 pm  

    I wish I could have been there to challenge Jurvetson on some of his points–and that I could refer to a transcript to rebut them. I cover some of this on my blog:

    http://thenoisychannel.com/2009/04/05/google-already-knows-what-youre-thinking/

    I also think that readers interested in crowd-sourcing relevance should take a look at the just-launched Topsy, which I blogged about today:

    http://thenoisychannel.com/2009/05/27/topsy-tippling-the-stream-of-conversations/

  2. 2 Paul T. Jackson
    May 27th, 2009 at 9:59 pm  

    I haven’t seen Daniel’s blog, but one of the problems with citation indexing…ranking of articles that is done by ISI, is that the numbers sometimes relate to access; how accessible is it.
    The same thing would be true of such a collective index. And as for timing the time one is logged on to a particular site just documents that someone either found it interesting or someone couldn’t find the information on the site they were trying to find…but kept trying. It doesn’t say it was relevant to the persons original search.

  3. 3 Edwin Stauthamer
    May 28th, 2009 at 2:57 pm  

    I like the idea but I am confused about the fact that they are calling this “federated search”.

    In the enterprise search world we define “Federated search” as the distribution of a search action over two or more search environments. The “distributed” engines deliver results and those results are aggregated by the centralized search engine and presented to the user.

    I think it would be more appropriate to name the mentioned method of indexing “distributed indexing” or “federated indexing”.

  4. 4 Webhamer Weblog: Search & ICT-related blogging » links for 2009-05-28
    May 28th, 2009 at 3:06 pm  

    […] A new paradigm for federated search » Federated Search Blog “The triumph of the distributed Web.” He said the aggregate power of distributed human activity will trump centralized control. His main point was that Google, and other search engines that analyze the Web and links, are much less useful than a (theoretical) search engine that knows not what people have linked to (as Google does), but rather what pages are open on people’s browsers at the moment that people are searching. “All the problems of search would be solved if search relevance was ranked by what browsers were displaying,” he said. (tags: search, searchtrends) […]

  5. 5 Avi Rappoport, SearchTools
    May 28th, 2009 at 6:09 pm  

    This actually makes a lot of sense to me, it’s a dynamic version of robots following links. However,the privacy issues can’t just be hand-waved away. Some kind of anonymizing proxy or something would have to be set up. I’m going to keep thinking about it, will you keep posting?

    Avi

  6. 6 Gregg Boethin
    September 19th, 2009 at 9:10 pm  

    I strongly agree that the natural human element needs to be included in future search engine algorithms, to ensure more accurate relevancy and popularity ratings. The problem is that, just like with link popularity, this is something that can be manipulated, and would be manipulated by those wishing to improve their search engine rankings.

    Even if this weren’t a foreseeable issue, this is far from the end-all solution to search.

Leave a reply

Name (*)
Mail (*)
URI
Comment