17
Dec

There are two approaches to gathering content for searching on the Internet. The most well known approach is crawling and indexing. This is where the search engine starts with a list of known web pages, extracts the text from these pages, and follows the links from them to find new pages and new text to extract. All of this text is indexed for rapid search and retrieval of relevant documents.

The second approach is to perform live searches of content in web-sites that lives in databases. This content is typically accessed by filling out web forms, in much the same way that humans fill out forms when they are searching databases. Searching via forms, which is what federated search largely does, is also known as deep web searching.

This article compares the pros and cons of the two approaches in five areas and illustrates why both methods of accessing content are necessary. This assessment allows us to see that the arguments about which approach is better are fallacious. More importantly, we’ll conclude that for many users federated search, which accesses content from different sources in different ways, and merges the results together, provides the best of both worlds.


Access to content. Much of the content on the Internet can’t be crawled. Bibliographic databases, for example, which contain a very large percentage of the high quality scientific and technical information available to the public or by subscription, need to be accessed through deep web search technologies. The reality is that much managed content, whose publishers care very much about its quality, is housed in databases. This content is not always accessible to web crawlers.

Speed of searching. Crawled and indexed content can usually be searched much more quickly than deep web content. Google demonstrates this very well. Deep web search engines are at the mercy of the underlying search engines in terms of how quickly they can return results to the user whereas the crawlers have already extracted the content and can control the speed of delivery of results through how much hardware and network infrastructure they have. Federated search engines may take up to 30 seconds or longer to return all of their results depending on how many of the sources are deep web sources and how slow they are to respond. What some federated search engines do to ameliorate the slowness is to return results to the user incrementally as they get them from underlying search engines.

Age of content. Content accessed from a crawled index is only as current as the index. If a web site is crawled only once per week then a surface web search for six day old content will not find it. Deep web search engines don’t have this problem since they search each source live for every query submitted. As soon as a deep web database is updated to include a new document the very next search will find it. Age of content is a very significant issue when performing research in rapidly changing fields of study.

Merging of content from different sources and relevance ranking. Searching for content from Google or other surface crawlers will only yield documents found by the particular crawler. Federated search engines can be configured to search for content from multiple sources, including deep web and crawled content sources. Additionally, the better federated search engines will not only merge and remove duplicates from results, they will also perform relevance ranking of the results much the same way that surface web crawlers do.

Maintenance cost. An organization that is maintaining a search engine based on crawling and indexing of its own content needs to consider its ongoing cost. The time to recrawl and reindex content plus the associated network, hardware, CPU and storage costs are the variables to consider. These costs will vary, of course, depending on the size of the collections being crawled and indexed. Deep web search has different maintenance needs, the greatest of which is keeping the connectors current when the underlying search engines change their query (form) interface. Crawling, indexing, and storage are not concerns in the deep web approach.

I hope this article has demonstrated that there are strengths and weaknesses to both search approaches. Fortunately, you don’t have to pick one or the other. A federated search approach can serve the needs of those wanting the best of both worlds. Plus a federated search engine can often be easily integrated into an existing enterprise search environment, seamlessly increasing the value of search to its users.

If you enjoyed this post, make sure you subscribe to the RSS feed!

Also, check out our writing contest with awesome prizes!

Tags: , , ,

This entry was posted on Monday, December 17th, 2007 at 1:07 pm and is filed under basics. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

One Response to "Crawling vs. deep web searching"

  1. 1 Content access basics - Part I - screen scraping » Federated Search Blog
    December 27th, 2007 at 12:51 pm  

    […] web searches since they access content that lives inside of databases. Read the earlier articles on crawling vs. deep web searching and introduction to the deep web for background information on deep web searching. Also, read the […]

Leave a reply

Name (*)
Mail (*)
URI
Comment