Dec
In this multi-part series we will look at a number of different approaches that federated search engines (FSEs) take to access content from remote databases.
FSEs are always at the mercy of the content provider when it comes to searching and retrieving content. FSEs perform deep web searches since they access content that lives inside of databases. Read the earlier articles on crawling vs. deep web searching and introduction to the deep web for background information on deep web searching. Also, read the article about connectors to understand how the query processing and search engine submission process works for deep web searching.
When FSEs search deep web databases they often do so by filling out search forms much like humans do and they also process result lists (summaries of documents generated by the remote search engines) much like the way humans examine the search results in their browsers. Processing a list of search results by reading and dissecting the HTML that a search engine provides is called “screen scraping.” Wikipedia has an article about screen scraping.
Screen scraping is the most difficult way to obtain search results because the result data is not structured in a way that makes it easy to identify the fields in the result records. Unfortunately, however, screen scraping is also the most prevalent approach to extracting field data because a majority of the content that is published electronically is expected to be consumed by humans and not by search engines. Fortunately, the use of structured result data is increasing.
Let’s look at what happens when a human searches a deep web database to better under how screen scraping works. First he enters his search terms into one or more search fields and submits the form. Then he examines the results that are returned by the form. The results are returned as HTML, which the browser renders, or draws, to look nice on the screen, displaying result text in different fonts and styles. To the user the results are nicely structured. One record after another is displayed. It’s obvious to him where a record starts and where it ends.
Software that has to read the HTML doesn’t have as easy a time as does the human for a number of reasons:
- The screen scraping software has to determine which sections of the HTML document containing the search results it must ignore. Headers, footers, and advertisements are examples of parts of the HTML that need to be ignored. In other words, the screen scraper must determine where a record starts and where it ends.
- Some result fields may be missing in some of the result records. The screen scraper needs to be able to determine when fields are missing and keep processing the results while not getting confused about what fields it is extracting.
- It is not obvious to the screen scraper which field is the title, which is the author, which is the publisher, and so on.
- Some FSEs retrieve multiple pages of results from a source. When result data is returned in XML or other structured format it is a straightforward process to request subsequent sets of results. Screen scraping software however, must be configured for every possible next-page scenario. Sometimes there’s a “next” button to press, sometimes there’s an arrow button to go to the next page, sometimes there’s a page number with a link to click. Also, the page navigation elements may be anywhere on an HTML page.
- Data may be returned in an inconsistent format between records. A date may appear as “January 1, 2008″ in one record and as “1/1/2008″ in another. While this is not very common within results from one publisher it becomes a major issue when the FSE is aggregating results from multiple sources which return date, author, and other fields in different formats and must normalize the field data (convert it to one format) in order to sort by one of those fields.
- Authentication to access restricted content is often problematic for the screen scrapers. When result data is returned in a structured format the expectation is that a computer program will be processing that data and that a computer program that is not a browser will be performing the authentication steps. Thus, there are usually fewer hoops to jump through when authenticating to retrieve structured results data. Additionally, the authentication steps are likely to be documented. In the screen scraping approach the FSE typically has to deal with session information, cookies, and perhaps IP-based authentication. The FSE connector developer has to manually reverse engineer the authentication steps (since there’s usually no documentation) and implement them on a per-source basis.
- The HTML may not be consistent from one record to another. In particular, the first or last record in a result list often has HTML tags around it that are different from the tags around the other records in the results list. And, if a search returns only one result the HTML tags around that result are often different than in the multiple result case.
- The HTML may not be correct. Small errors in HTML may be corrected by browsers and may trip up screen scrapers.
There are two important points to make about the issues with screen scraping we just enumerated. First, humans can deal with ambiguous structure and missing fields much more easily than computers can. Humans have no trouble identifying titles, authors, and other fields without being explicitly told what the fields are. Humans can ignore ads, footers, and headers with little effort. Humans can find links to subsequent results. Computers have to work much harder to obtain the same results.
The second point is that the problems of screen scraping are exacerbated when federation occurs. It is no longer enough to identify the date field in a result record. The format of that date field has to be normalized across all results from all sources. Granted, this is not a problem unique to screen scraping but, in screen scraping we must always be on the lookout for data that is inconsistent within results from the same source. When the data is structured, as XML for example, the data is more likely, but never guaranteed, to be more consistent in its format.
It is important to note that the arguments about whether it is better to screen scrape or not are nonsensical ones. Federated search engine vendors do screen scraping when there is no better method to access content from a source. Vendors who brag about not screen scraping are also telling you indirectly that there’s a large pool of sources that they just don’t search.
In subsequent parts of this series we will look at XML gateways, SRU/SRW, OpenSearch, Z39.50, and other methods of accessing content.
[ Update: 1/16/08: Part II of this series, about XML, is available as are Part III, about OpenSearch and Part IV, SRU/SRW/Z39.50 ]
Tags: deep web, federated search, screen scraping
4 Responses so far to "Content access basics - Part I - screen scraping"
December 30th, 2007 at 9:04 am
[...] I of this series on content access basics explained how screen scraping is used by many federated search engines (FSEs) performing deep web searches to process search results plus [...]
December 30th, 2007 at 9:35 pm
[...] XML gateways and some Web Services-based interfaces to content which is a really good thing as “screen scraping” connectors are becoming harder to build and maintain. Content providers are starting to use [...]
December 30th, 2008 at 8:24 pm
also you could make your own screen scraper here:
http://www.knowlesys.com/products/custom_web_screen_scraper.htm
October 27th, 2009 at 1:19 pm
Nice post on screen scrapers, simple and too the point , For screen scrapers i use python for simple things, but for larger projects i used extractingdata.com screen scraper software which worked great, they build custom screen scrapers and data extracting programs