Deep web searching is fundamentally a different beast than surface web searching. Surface web crawlers like Google follow a known set of links to discover new web pages and to grow their list of links. While they’re following links the surface web crawlers are also grabbing the content and indexing it for human search.
Deep web search engines don’t follow links to find content, they fill out and submit search forms much the way humans do. Federated search engines, like deep web search engines, don’t use the crawl approach; they search content sources using either the deep web approach or via some other mechanism to access its documents. Each content owner provides its own mechanism for content search and document retrieval.
Central to accessing search results and documents from a content source is use of a combination of software and data called a connector. Think of a connector as a plug-in that connects the federated search application with the search and retrieval mechanism provided by the content source. Federated search engines must provide a unique connector for each source searched making the building of connectors a time-consuming process.
To understand the complexity of building connectors consider these components of each:
- Translate a user’s query syntax. Different search engines support different search syntaxes. Some require Boolean words (AND, OR, NOT) to be included in the search expression, others consider them optional, and others don’t support them at all. Some search engines support wildcard searches. Others don’t. Some search engines support quoted phrases while others don’t. A major task of a connector is to translate, or rewrite, what a user typed in into what a particular search engine requires. Mistakes in this part of the connector will result in too few, too many, or less relevant than expected results.
- Map search fields filled in to those available in the target source. Many federated search engines support advanced searches (those consisting of multiple fields, e.g. title, author, date). What varies widely from one search engine to another is how well it handles the fielded (advanced) search on the target search engine. What happens if the user enters a journal name into the journal field and the remote search engine doesn’t support a journal field but instead supports a collaboration field? What happens if a user enters text into a “full record” search field, intending to search the full text of all documents at the target site but the site only supports search of title, author, and abstract? A smart federated search engine will allow creation of a connector that translates the user supplied fields to the relevant remote ones, determined on a case by case basis.
- Submit the search. This process is more complicated than it sounds. One component of the submission process is to fill out an HTML search form the way a human would or otherwise provide the required query expression in the proper syntax to the target search engine. This task can be difficult if there are a number of form variables that need to be set, especially if it’s not obvious to the person building the connector which variables need to be set and what their values should be. A further complication can arise if a search needs to be performed in multiple stages, or if cookie, session or authentication information needs to be managed.
- Retrieve the search results page. Depending on the search engine, results may come back as HTML or XML. This step, retrieving the content, is the most straightforward step although consideration needs to be made for results that come slowly, or not at all. Plus, errors need to be managed.
- Parse the search results. This process can be easy or difficult depending on what format the source uses to returns its content as well as other factors. XML is much easier to parse for document fields than HTML. Additionally, HTML output may be inconsistent, and the document fields need to be separated from other text and from HTML formatting information on the results page. Missing fields, especially in HTML output, need to be accounted for.
In many ways the connector is the work horse of the federated search engine. It is the component that has the intimate contact with each underlying search engine. It has to handle the idiosyncrasies of each source and do something intelligent with the source regardless of the user’s input. This is no small task for the underappreciated federated search engine connector.