Today I wrap up my interview with Erik Selberg, which began with a preview here. Erik answers questions about federated search, about his work at Microsoft and Amazon, and a couple of other questions.
16. What do you think are the major challenges of federated search today?
It’s not clear to me now that there are lots of interesting engineering problems in today’s federated search. I’m sure there are some, but they don’t strike me as the big issues. Rather, what I do see are some of the age-old issues of federated search:
- Resource Discovery. How are search engines discovered and added into a federated search engine? Is there a standard, such as Sherlock from Apple or the MyCroft-style plugin for Mozilla, that can be used to integrate engines appropriately?
- Customer Acquisition. OK, so you have a nice federated search service. How do you get users to use it versus the primary search engines?
- Business Model. Web meta-search, like MetaCrawler, works nicely in that they’re just serving up ads from the primary search engines and doing a rev-share. But fundamentally, the idea behind federated search is that a customer goes to one web site and gets results back from others. The big question here is what is business model to the primary engines powering the federated search engine. What I have seen over the years is that often there isn’t one, but that’s OK as the driver is more from government or research grants, not business — e.g. linking a number of library search systems or so.
17. What do you think will be different about federated search in ten years?
I think it will become more specialized. The core search engines, such as Google, are busy writing wrappers and incorporating popular deep-web queries into their own search results — such as Weather, WikiPedia, sports scores, and so on. They’re also trying to federate queries into their own index of heterogeneous data, meaning news, images, and such, and surface that content on the main results. Given their traffic, I think that will dominate most of the main Web-facing space, leaving federated search to focus on niche areas that require more specialty — health, government, local, and legal are the four that come immediately to mind.
18. University of Washington has a web page about MetaCrawler and two applications based on MetaCrawler: HuskySearch and Grouper. The page explains that HuskySearch was developed to study query refinement and Grouper implemented clustering. What became of those two applications?
HuskySearch was my local version of MetaCrawler that I used to continue to do research. It lasted a few years after I graduated, and then we finally had to shut it down. Grouper was research done by Oren Zamir, now at Google, on top of MetaCrawler to explore clustering using the snippets of web pages returned by search engines. It was great at clustering; the problem we discovered, like others who use clustering, was that as far as users are concerned, clusters aren’t a compelling interface.
19. According to the article about you in Wikipedia, you’re a senior manager now at Amazon.com. What kinds of projects are you involved in?
My main project at Amazon is a specialized search project. One of the key differentiators at Amazon over other e-commerce sites like eBay is that Amazon presents a single page about a given item — whereas eBay highlights a number of pages for the same item, depending on who is selling it. This way, all the item information, such as customer reviews, images, and so on, can be found at one location, and our customer can make a good, informed choice. As Amazon now allows third party merchants to list items on Amazon, similar to a mall setting, we need to match those third-party items with items in our catalog to ensure there’s only one page. So, the problem is simply taking these merchant listings and finding them in our catalog.
Sounds easy? Well, the problem is that many items are pretty close (e.g. the 4G iPod vs the 8G iPod), we often don’t receive a unique key (e.g. the UPC), and the price of a mistake is high — either a false positive is created, and a customer may receive the wrong thing, or a duplicate is created, and thus one page won’t have the reviews and other appropriate data. So, it equates to search with 100% precision and 100% recall — quite the challenge, which I enjoy.
20. Before Amazon.com, Wikipedia says you were a senior software developer at Microsoft working on Live Search. What was your contribution to search there?
I was the Program Manager for Relevance for a year and some on MSN / Live Search, and helped drive the initial relevance work. A good portion of this was putting together the people between the product team and research, the end result being RankNet, the neural net ranker that Live Search uses for relevance. I then joined SearchLabs and worked on some personalization aspects of search. These haven’t shipped yet, so I can’t quite say more on that.
21. Does the current search engine at MetaCrawler.com bear any resemblance to your original MetaCrawler?
Surprisingly enough, yes. Fundamentally, the interface is still the same — a customer enters a query, MetaCrawler issues it to the biggest and best search engines, and it returns collated results. The user experience is nicer, the speed is faster, and there are more features, but fundamentally, it’s the same general application I created over a decade ago — which does provide me with some nice validation.
I hope you’ve enjoyed this interview. Others are coming so please watch these pages.