Part I of this series on content access basics explained how screen scraping is used by many federated search engines (FSEs) performing deep web searches to process search results plus the problems associated with this approach. This article provides an introduction to how XML-formatted search results are processed by FSEs.
FSEs use jargon such as “XML gateway” or “XML interface” to refer to the fact that they have a way of interacting with a particular content source using XML. It may be that the FSE generates XML and submits an XML query or that search results are generated by the remote search engine and returned as an XML document. In this article we are going to focus on the processing of XML results.
So, what is XML? Wikipedia has a nice introduction to XML plus a few examples. Here’s a nice simple tutorial on XML. The important idea about XML is that there is no ambiguity about where to find information. XML is intended for consumption by computer programs. It is very highly structured.
If you read the first part of this series, the article on screen scraping, then you see that a number of the content access problems arise when the FSE is not able to easily identify where in the HTML result page a document’s summary information starts, where it ends, which field is the title, which is the author, which fields are missing for specific records, and so on. These problems all disappear when result data is provided in XML format.
A search engine that is returning XML will typically return summary information for the user’s query along with well structured fielded data. The summary section of the XML document might look like this:
This tells the FSE that the content provider’s search engine found 12,350 results matching the query, that it is returning 10 result records to the FSE and that it is returning records beginning with record 0, the first record. If the FSE wants to request another page of records from this source it can request 10 more by providing a startIndex value of 10.
Here is the XML data for one of the fictitious 10 records:
<title>Federated Search in the Corporate Environment</title>
<summary>This article summarizes trends in Federated Search in 2008</summary>
The one field that may not be obvious in this example is the metaDataUrl field. This refers to the URL of the metadata record, typically title, abstract, and other descriptive information about the document. Some search engines might return a full text URL field instead of, or in addition to the metaDataUrl field. In fact, different search engines return different sets of fields. It is the job of the FSE to sort out the differences and present a unified display of results to the user.
Note that there are two author records in this example. The FSE software doesn’t need to, in this particular case, pull apart two author names that are presented in one field.
Notice how all of the screen scraping problems don’t exist with XML results:
- There is no extraneous data (e.g. headers, footers, advertisements).
- Missing fields are not the problem they are with the screen scraping approach.
- There is no ambiguity about which is the title or any other fields.
- There is no complex page navigation. The FSE simply requests a new set of results by providing the XML interface with a starting record number and a desired record count.
- There is no concern about the consistency of the HTML based on how many result records are because there is no HTML.
- It is very unlikely that a search engine would produce syntactically incorrect XML as XML parsers are very strict about what structure they expect and would not be able to proccess bad XML. So, XML problems would quickly be fixed unlike problems with HTML which are often not fixed because major browsers compensate for bad HTML, leaving the problem unexposed and uncorrected.
Compare and contrast the pluses and minuses of XML vs. screen scraping and you’ll see that XML is always more desirable to the FSE. Fortunately publishers are moving in the direction of providing more of their content in XML format.
In future articles in this series we’ll look at other content access methods: SRU/SRW, OpenSearch, and Z39.50.