Yesterday morning I read a blog article that got my attention. Giv Paraneh wrote a post: It’s not always about the technology. Here’s the attention-getting part of the post. The emphasis, in bold, is mine:
I was initially hired to work on a federated search tool that would eventually be used by two online learning applications. The search had to query the collections databases of all 9 museums and return the results in a single aggregated list. Having worked on similar applications in the past I estimated no more than 2-3 weeks to complete this tool. How hard is it to pull in 9 collections and stick them in a database?
This statement from the post also got my attention:
These days any average developer can put together a federated search application. There are loads of open source tools, frameworks, databases and scrapers that will let you do this quickly. And if the search is not enough, you can take advantage of a dozen web services and APIs to pretty-up your results.
As someone who has been around federated search for the past six years I have to wonder what kind of a federated search system one could build in three weeks. I’ll be the first to admit that my experience is very heavily biased to blog sponsor Deep Web Technologies’ framework and that I’ve not done any federated search development work. But, I know what a number of the obstacles are and I seriously doubt that anyone could build more than a very basic system so quickly. I realize that Giv’s federated search system was built to meet a very specialized requirement and that the sources were known in advance. In other words, Giv was not building a general purpose federated search application.
In federated search the devil is in the details, especially when you’re building an engine that deals with numerous special cases. Here are a few of the places where the devil lives.
- Connectors. They can be very difficult to build, especially if a source needs to be screen scraped. The proliferation of AJAX makes screen scraping even harder. Also, connectors need to be monitored for problems and updated periodically.
- Authentication and session management. A significant part of connector building is connecting to the sources. This can involve dealing with sessions, cookies, multi-phase logins, and various authentication mechanisms. This stuff ain’t trivial.
- Sophisticated fielded search. Federated search is only as good as its ability to work with complex search forms. If a connector only does basic searches against sources that provided advanced (fielded) searches then the relevance of the results is going to be much poorer than desirable.
- Relevance ranking. Not all federated search systems do relevance ranking. That’s because it’s not easy to do well, especially when many sources rank poorly and when only a small amount of metadata is available to the federated search engine.
- De-duplication. Another difficult problem.
If you’re new to federated search, I recommend “What determines quality of search results.” You’ll get a good feel for where the devil lives.
One other thing that caught my interest about Giv’s article was this:
How hard is it to pull in 9 collections and stick them in a database?
This confused me. Is Giv doing federated search or is he crawling and indexing content? Or some hybrid method? What database is Giv talking about? Hopefully he will comment on this article and clarify what he’s doing because, unless the connectors are well behaved and trivial to build and the functionality of the federated search system is very minimal, I don’t get how he built it so quickly.
Tags: federated search