This is the third part in a series of articles that explore how federated search engines (FSEs), especially those that search the deep web, process search results from search engines. Part I looked at screen scraping of search result data from search engines that only provide HTML intended for human consumption. Part II looked at the more pleasant situation of processing XML that a growing number of search engines are returning. This article looks at the emerging OpenSearch standard and how FSEs can benefit from it.
Wikipedia summarizes OpenSearch pretty well:
OpenSearch is a collection of technologies that allow publishing of search results in a format suitable for syndication and aggregation. It is a way for websites and search engines to publish search results in a standard and accessible format. OpenSearch was developed by Amazon.com subsidiary A9 and the first version, OpenSearch 1.0, was unveiled by Jeff Bezos at the Web 2.0 in March, 2005. Draft versions of OpenSearch 1.1 were released during September and December 2005. The OpenSearch specification is licensed by A9 under the Creative Commons Attribution-ShareAlike 2.5 License.
The “format suitable for syndication and aggregation” mentioned above refers to two standards, RSS 2.0, and Atom 1.0, both of which present their data in XML.
The official OpenSearch web-site has more in-depth information including documentation and specification of the standard.
How does OpenSearch relate to federated search? For one thing, OpenSearch is built on XML, and as we discussed in Part II of this series XML documents contain highly structured search result data that FSEs find easy to process. Also, the OpenSearch standard supports a mechanism for telling a deep web search engine how to query it. So, if an OpenSearch compliant search engine provides this information it is a straightforward task for the federated search engine connector builder to create a search interface to this source. In particular, if the OpenSearch search engine supports REST format (search string and other relevant parameters specified as GET parameters in the URL) then building a connector is even easier.
OpenSearch provides much more value than generic XML. It provides standards for how to search a content source, i.e., what parameters to send. Additionally, results are returned in XML documents with standard field names so that no work is required by the federated search engine to determine these. In the case of a search engine returning generic XML there is the work to determine what fields are returned with search results, what their names are, and what the meanings of fields are. This is all specified in OpenSearch. OpenSearch also provides a “description document” mechanism to describe a particular search engine. This is somewhat analogous to the WSDL standard in the Web Services model.
So, how prevalent is OpenSearch? A standard is not practical if it hasn’t been widely adopted. Wikipedia provides a list of search engines and software that support OpenSearch. This includes client side software and browsers as well as a number of search engines. The Wikipedia list contains approximately 40 search engines. A9 and Wikipedia are noteworthy entries on the list. The A9 search engine reports 617 sources as of this writing that it is federating although it’s not clear if all of those sources are OpenSearch sources. It is worth noting that Firefox 2.0 and Internet Explorer 7 support OpenSearch. Also, Microsoft Search Server has OpenSearch support. And, there are a number of applications that produce OpenSearch results. So, it appears that OpenSearch is being adopted sufficiently to warrant more wide support among federated search engine vendors.
In summary, OpenSearch is gaining traction as a non-proprietary Creative Commons license that is supported by a growing number of server and client implementations, including popular browsers. It has a growing number of programmatic interfaces available to utilize the standard in a wide range of programs. It’s definitely a standard to pay attention to.
[ Update 1/16/08: Part IV of the series is available. It is about SRU/SRW/Z39.50. ]