Feb
Yesterday morning I read a blog article that got my attention. Giv Paraneh wrote a post: It’s not always about the technology. Here’s the attention-getting part of the post. The emphasis, in bold, is mine:
I was initially hired to work on a federated search tool that would eventually be used by two online learning applications. The search had to query the collections databases of all 9 museums and return the results in a single aggregated list. Having worked on similar applications in the past I estimated no more than 2-3 weeks to complete this tool. How hard is it to pull in 9 collections and stick them in a database?
This statement from the post also got my attention:
These days any average developer can put together a federated search application. There are loads of open source tools, frameworks, databases and scrapers that will let you do this quickly. And if the search is not enough, you can take advantage of a dozen web services and APIs to pretty-up your results.
As someone who has been around federated search for the past six years I have to wonder what kind of a federated search system one could build in three weeks. I’ll be the first to admit that my experience is very heavily biased to blog sponsor Deep Web Technologies’ framework and that I’ve not done any federated search development work. But, I know what a number of the obstacles are and I seriously doubt that anyone could build more than a very basic system so quickly. I realize that Giv’s federated search system was built to meet a very specialized requirement and that the sources were known in advance. In other words, Giv was not building a general purpose federated search application.
In federated search the devil is in the details, especially when you’re building an engine that deals with numerous special cases. Here are a few of the places where the devil lives.
- Connectors. They can be very difficult to build, especially if a source needs to be screen scraped. The proliferation of AJAX makes screen scraping even harder. Also, connectors need to be monitored for problems and updated periodically.
- Authentication and session management. A significant part of connector building is connecting to the sources. This can involve dealing with sessions, cookies, multi-phase logins, and various authentication mechanisms. This stuff ain’t trivial.
- Sophisticated fielded search. Federated search is only as good as its ability to work with complex search forms. If a connector only does basic searches against sources that provided advanced (fielded) searches then the relevance of the results is going to be much poorer than desirable.
- Relevance ranking. Not all federated search systems do relevance ranking. That’s because it’s not easy to do well, especially when many sources rank poorly and when only a small amount of metadata is available to the federated search engine.
- De-duplication. Another difficult problem.
If you’re new to federated search, I recommend “What determines quality of search results.” You’ll get a good feel for where the devil lives.
One other thing that caught my interest about Giv’s article was this:
How hard is it to pull in 9 collections and stick them in a database?
This confused me. Is Giv doing federated search or is he crawling and indexing content? Or some hybrid method? What database is Giv talking about? Hopefully he will comment on this article and clarify what he’s doing because, unless the connectors are well behaved and trivial to build and the functionality of the federated search system is very minimal, I don’t get how he built it so quickly.
If you enjoyed this post, make sure you subscribe to the RSS feed!
Tags: federated search
3 Responses so far to "Do-it-yourself federated search in 2-3 weeks?"
February 16th, 2009 at 8:03 am
These kind of statements sound very familiar to me. Mostly, people behind such statements mix prototyping or a proove of concept with a production-ready software system that is scalable and maintainable.
Why would software companies spend years in developing robust software frameworks, if they could achieve the same results in a much shorter time?
February 18th, 2009 at 7:17 am
Thanks for the comments regarding my post. I guess what I left out in my post was that the requirements for this particular search project was extremely basic.
The search only grabs a title and short description for each record, collates them into a repository and then displays a ranked list. That’s all.
This is done in real-time so I guess you could call it a hybrid harvest/index search which works well for this particular project.
In response to Stephan’s message I should add that the final product is in fact production-ready and it really did take me a couple of weeks to put together due to the simplicity of the project.
My main point in the post was that I spent way more time dealing with intellectual property issues than the technical ones.
Regards,
Giv
February 18th, 2009 at 8:47 am
Giv, thanks for responding and for elaborating on the scope of your project.