4
Jan

This is the third part in a series of articles that explore how federated search engines (FSEs), especially those that search the deep web, process search results from search engines. Part I looked at screen scraping of search result data from search engines that only provide HTML intended for human consumption. Part II looked at the more pleasant situation of processing XML that a growing number of search engines are returning. This article looks at the emerging OpenSearch standard and how FSEs can benefit from it.

Wikipedia summarizes OpenSearch pretty well:

OpenSearch is a collection of technologies that allow publishing of search results in a format suitable for syndication and aggregation. It is a way for websites and search engines to publish search results in a standard and accessible format. OpenSearch was developed by Amazon.com subsidiary A9 and the first version, OpenSearch 1.0, was unveiled by Jeff Bezos at the Web 2.0 in March, 2005. Draft versions of OpenSearch 1.1 were released during September and December 2005. The OpenSearch specification is licensed by A9 under the Creative Commons Attribution-ShareAlike 2.5 License.

The “format suitable for syndication and aggregation” mentioned above refers to two standards, RSS 2.0, and Atom 1.0, both of which present their data in XML.

Read the rest of this entry »

If you enjoyed this post, make sure you subscribe to the RSS feed!

30
Dec

Part I of this series on content access basics explained how screen scraping is used by many federated search engines (FSEs) performing deep web searches to process search results plus the problems associated with this approach. This article provides an introduction to how XML-formatted search results are processed by FSEs.

FSEs use jargon such as “XML gateway” or “XML interface” to refer to the fact that they have a way of interacting with a particular content source using XML. It may be that the FSE generates XML and submits an XML query or that search results are generated by the remote search engine and returned as an XML document. In this article we are going to focus on the processing of XML results.

So, what is XML? Wikipedia has a nice introduction to XML plus a few examples. Here’s a nice simple tutorial on XML. The important idea about XML is that there is no ambiguity about where to find information. XML is intended for consumption by computer programs. It is very highly structured.

Read the rest of this entry »

If you enjoyed this post, make sure you subscribe to the RSS feed!