17
Dec

There are two approaches to gathering content for searching on the Internet. The most well known approach is crawling and indexing. This is where the search engine starts with a list of known web pages, extracts the text from these pages, and follows the links from them to find new pages and new text to extract. All of this text is indexed for rapid search and retrieval of relevant documents.

The second approach is to perform live searches of content in web-sites that lives in databases. This content is typically accessed by filling out web forms, in much the same way that humans fill out forms when they are searching databases. Searching via forms, which is what federated search largely does, is also known as deep web searching.

This article compares the pros and cons of the two approaches in five areas and illustrates why both methods of accessing content are necessary. This assessment allows us to see that the arguments about which approach is better are fallacious. More importantly, we’ll conclude that for many users federated search, which accesses content from different sources in different ways, and merges the results together, provides the best of both worlds.


Access to content. Much of the content on the Internet can’t be crawled. Bibliographic databases, for example, which contain a very large percentage of the high quality scientific and technical information available to the public or by subscription, need to be accessed through deep web search technologies. The reality is that much managed content, whose publishers care very much about its quality, is housed in databases. This content is not always accessible to web crawlers.

Speed of searching. Crawled and indexed content can usually be searched much more quickly than deep web content. Google demonstrates this very well. Deep web search engines are at the mercy of the underlying search engines in terms of how quickly they can return results to the user whereas the crawlers have already extracted the content and can control the speed of delivery of results through how much hardware and network infrastructure they have. Federated search engines may take up to 30 seconds or longer to return all of their results depending on how many of the sources are deep web sources and how slow they are to respond. What some federated search engines do to ameliorate the slowness is to return results to the user incrementally as they get them from underlying search engines.

Age of content. Content accessed from a crawled index is only as current as the index. If a web site is crawled only once per week then a surface web search for six day old content will not find it. Deep web search engines don’t have this problem since they search each source live for every query submitted. As soon as a deep web database is updated to include a new document the very next search will find it. Age of content is a very significant issue when performing research in rapidly changing fields of study.

Merging of content from different sources and relevance ranking. Searching for content from Google or other surface crawlers will only yield documents found by the particular crawler. Federated search engines can be configured to search for content from multiple sources, including deep web and crawled content sources. Additionally, the better federated search engines will not only merge and remove duplicates from results, they will also perform relevance ranking of the results much the same way that surface web crawlers do.

Maintenance cost. An organization that is maintaining a search engine based on crawling and indexing of its own content needs to consider its ongoing cost. The time to recrawl and reindex content plus the associated network, hardware, CPU and storage costs are the variables to consider. These costs will vary, of course, depending on the size of the collections being crawled and indexed. Deep web search has different maintenance needs, the greatest of which is keeping the connectors current when the underlying search engines change their query (form) interface. Crawling, indexing, and storage are not concerns in the deep web approach.

I hope this article has demonstrated that there are strengths and weaknesses to both search approaches. Fortunately, you don’t have to pick one or the other. A federated search approach can serve the needs of those wanting the best of both worlds. Plus a federated search engine can often be easily integrated into an existing enterprise search environment, seamlessly increasing the value of search to its users.

If you enjoyed this post, make sure you subscribe to my RSS feed!

Tags: , , ,

This entry was posted on Monday, December 17th, 2007 at 1:07 pm and is filed under basics. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

3 Responses so far to "Crawling vs. deep web searching"

  1. 1 Content access basics - Part I - screen scraping » Federated Search Blog
    December 27th, 2007 at 12:51 pm  

    [...] web searches since they access content that lives inside of databases. Read the earlier articles on crawling vs. deep web searching and introduction to the deep web for background information on deep web searching. Also, read the [...]

  2. 2 A Federated Search Primer - Part III of III | Deep Web Technologies Blog
    February 24th, 2009 at 1:41 am  

    [...] * 99 Resources to Research & Mine the Invisible Web * Beyond Google: The Invisible Web * Crawling vs. deep web searching * Glossary of search industry terms * Introduction to the deep web * Invisible Web: What it is, Why [...]

  3. 3 dave tribbett
    April 12th, 2010 at 10:15 pm  

    Great post and comments, HERE is an article that adds additional detail to the topic and a good set of links to the deep web search engines and other helpful sites.
    The Internet of Things will add considerably content to the deep web creating huge information shadows for each device or thing connected. Couple this with the continued growth of mobile computing and you can see where this goes. I think the end result is that specialized search engines become more and more important as only they will be able to traverse and catalog the content in a way that makes it accessible beyond a link. Call it consumable results instead of link results.
    Good post.

Leave a reply

Name (*)
Mail (*)
URI
Comment