Mar
Kosmix is the search engine that produces “topic pages” on millions of subjects. Kosmix creates these topic pages by searching APIs of deep web sources (in real time). In other words, Kosmix relies heavily on federated search for the content of their topic pages. (Actually, Kosmix combines federated search with crawling technology. More about this later in this article.)
Kosmix co-founder, Anand Rajaraman, recently spoke at PARC, the prestigious Palo Alto Research Center. Rajaraman’s talk: “What lies beneath: harnessing the deep web.” A video of the hour-long talk is available at the PARC web-site. The slides are available at the Kosmix Blog.
Rajaraman has very impressive credentials. He is also co-founder of the VC firm Cambrian Ventures, he teaches a class for Stanford’s Computer Science department, and he is former Director of Technology at Amazon.com where:
he was responsible for technology strategy. Anand helped launch the transformation of Amazon.com from a retailer into a retail platform, enabling third-party retailers to sell on Amazon.com’s website. Third-party transactions now account for over 25% of all US transactions, and represent Amazon’s fastest-growing and most profitable business segment.
Rajamaran has his own Kosmix topic page.
Rajamaran begins his talk with a flashback to the summer of 1994, when he did work searching early Internet directory services (WAIS, Gopher, and Archie) before there was much of a Web. He cites this as early work with federated search and muses on how Kosmix is doing federated search, “just like old times.”
Rajamaran explains how web crawlers like Google and Bing only access a small fraction of the content of the Web. The challenge is to get at the deep Web content, which lies hidden in databases behind search forms. Google is making attempts to “surface” deep web content, i.e. to fill out search forms, extract search results and then index them for users to search. Google does this surfacing by trying to guess what combinations of values to use for web forms. Surfacing, when it works, is very nice because it doesn’t slow down searches the way federation does since those deep Web pages are already in the index. But, and this is a big “but,” predicting the right combination of search parameters is difficult. And, it’s not practical to search all possible combinations. Rajaraman claims that Google has crawled roughly 3,000,000 deep Web sites, in 50 languages, and in hundreds of subject domains and that they perform 1,000 queries per second against the deep Web. Google’s approach to surfacing involves performing a search, looking at the text in the results and using some of that text as keywords for the next deep Web search.
Kosmix recognizes that much web content is structured (i.e. it lives in the deep Web, in databases, the data is fielded, often accessible through an API, and there may be metadata) and takes advantages of this structure. Kosmix doesn’t use the surfacing approach at all; it takes a federation approach. Unlike traditional federated search, however, Kosmix has to deal with documents of different media types which means it must redefine relevance and rank. Depending on the topic area and specific source, relevant documents could be videos, tables, images, sound files, or other types. So, Kosmix incorporates documents of multiple types into its topic pages. The aim is to provide a discovery environment, or what Kosmix calls a 306 degree view, for users on a topic, not to determine which document is “best.”
Rajamaran noted in his presentation that Kosmix received strong validation for its topic page concept in building pages for RightHealth.com. Here are some impressive statistics from the slide show about RightHealth and Kosmix:
- RightHealth is the second most visited health site on the Web per Hitwise
- RightHealth gets 6.4 million monthly unique users per comScore
- Kosmix was launched in December of 2008, and has already grown to 4.1 million unique visitors
- Total unique visitors to Kosmix have increased 2,635.1% since August of 2008
- RightHealth gets 10 million unique visitors per month
- RightHealth gets 35 million searches per month
Stay tuned for Part II.
If you enjoyed this post, make sure you subscribe to the RSS feed!
Tags: deep web, federated search
One Response to "Federated search powers Kosmix (Part I)"
March 25th, 2010 at 5:48 am
Really good post, HERE is a good article that adds some additional detail to the topic and a good set of links to the deep web search engines and other helpful sites. Those are amazing numbers for RightHealth, I’ll have to check it out.