[ This is a continuation of Part I. ]
The Kosmix approach to federation relies heavily on APIs to structured data in different domains of specialization. APIs can be searched in real-time to generate topic pages with very current information. Slides 28 through 32 give some excellent reasons for why “trends favor the federated approach:”
- Social Media (content volume grows rapidly, access controls can prevent indexing, opportunities for personalization)
- Real-Time Information (Earthquake in China (2008), US Airways 1549 Hudson landing (2009), Iran elections (2009))
- Specialized search engines (It’s a shame to take all this richness and compress it into 10 results links!)
- Innovative visualizations
- Business Model issues
- Algorithmic Content
- Availability of APIs
Rajaraman notes that Twitter can break news faster than Google can and that Google acknowledges this by including Twitter content in its search results.
The major challenge, of course, in building topic pages, is to determine which are the right sources for any given query. How does Kosmix know which sources to query if the user types “pumpkin pie” vs. “Toyota Prius?” Kosmix knows about thousands of sources and can’t search every source for every query. And, even if it could, how would it decide which sources to include in a topic page — it can’t include content from 1,000 sources.
Kosmix solves the source selection problem by building a complex taxonomy with millions of nodes and relationships between them. The taxonomy took them three years to build and combines machine algorithms with human curation. An interesting note is that the taxonomy is refreshed DAILY as world events require updates that frequently. Data sources are associated with nodes in the taxonomy.
Kosmix has also created its own categorization service to associate a query to nodes in the taxonomy. Thus, what Kosmix does is to map a query to a number of “near” sources that are relevant to the query. Results from the near sources make up the panes of a topic page.
Rajaraman references two scholarly papers in his presentation. The first paper is Rajaraman’s “Kosmix: High-Performance Topic Exploration using the Deep Web,” available for a fee from ACM. The paper goes into more details of the content in the slide show. It doesn’t, unfortunately, go into much detail about Kosmix’s hybrid approach (deep Web crawling plus federation) which is the information gap I was trying to fill in by reading the paper. For those of you who don’t want to spend an hour listening to the presentation, reading the 5 1/4 page paper is an excellent alternative.
The second paper Rajaraman references is “Google’s Deep Web Crawl.” I could have sworn that I once found this free on the Web. It is available, like the first paper, for a fee from ACM. It may also be the case that ACM members can view both articles for free. The crawling paper goes into a moderate amount of detail as to how Google approaches the surfacing of deep Web content.
Kosmix is doing interesting stuff. I highly recommend listening to the whole presentation as Rajaraman makes some interesting points that are not obvious just by looking at the slides.
Tags: federated search