Followup on Matt’s deep web crawler journey | Federated Search BlogFederated Search

Last November I wrote about Matt, a software developer and graduate student in computer science. Matt had blogged about a deep web crawler he was building. Five months later, I’m curious to know how you’re doing, Matt. Please let us know if you’ve done more work on your crawler since your last post mentioning the subject on November 21.

Matt blogged five times about the crawler:

Creating a deep-web crawler with .NET: Background (Nov. 10)

For one of my graduate courses, I’ve decided to tackle the task of creating an intelligent agent for deep-web (AKA hidden-web) crawling. Unlike traditional crawlers, which work by following hyperlinks from a set of seed pages, a deep-web crawler digs through search interfaces and forms on the Internet to access the data underneath, data that is typically not directly accessible via a hyperlink (and would therefore be missed by a traditional crawler).

Deep-web crawling with .NET: Getting Started (Nov. 12)

DeepCrawler.NET is written in C# for Microsoft .NET 3.5. While there is intelligence behind it, at it’s core it is doing nothing more than by automating Internet Explorer. The crawler’s “brain” examines a page in IE, then tells IE what to do, such as populate a form field with a value, click a link or button, or navigate to a new URL. To facilitate this automation, I’m currently using the open-source WatiN API. WatiN is actually designed for creating unit tests for web interfaces, but it’s proving to be a fairly nice abstraction over the alternative method of automating IE from C# (that is using the raw COM APIs).

DeepCrawler.NET: Alive and Kicking (Nov. 14)

Much to my surprise, getting DeepCrawler.NET up and working with basic functionality was surprisingly easy. It’s far from finished, and I haven’t exhaustively tested it, but it does work. In this post, I’ll describe the current implementation with respect to how I’ve addressed some of the barriers raised in my last post.

Crawling results in DeepCrawler.NET (Nov. 17)

In the last post, I laid out DeepCrawler.NET’s (primitive) strategy for finding search forms, populating them, and submitting their contents using WatiN and a heuristic search mechanism. As I mentioned at the end of the previous post though, submitting a query is only the first step in a complicated process. Assuming nothing goes wrong, submitting a query will get us back one or more pages of results. The problem now becomes parsing and crawling the results.

Friday Grab-Bag (Nov. 21)

Next week I hope to write more about DeepCrawler.NET (there’s a chance that my current employer may even adopt the code, which means I could actually get paid to work on it!), but if anyone would rather see/hear about something else, let me know.

I encourage technically minded readers to read Matt’s articles. Matt has done a very significant amount of development work to get as far as he has and he has tackled some non-trivial problems. Here are some of things Matt has achieved via his deep web crawler:

  1. Finds search forms
  2. Populates search forms
  3. Submits query via search form
  4. Crawls links it finds on result pages
  5. Tries to avoid login forms
  6. Tries to guess which text box on a form to put the query terms into
  7. Examines buttons and images to try to identify the “submit” mechanism for the form

Matt, can we see a demo? And, can we get an update on your progress? Being one who does a little software development work from time to time and knowing how tough it is to write a deep web crawler I’m fascinated with your effort and want to know how you’re doing.

If you enjoyed this post, make sure you subscribe to my RSS feed!


This entry was posted on Wednesday, April 22nd, 2009 at 9:56 pm and is filed under technology. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

Leave a reply

Name (*)
Mail (*)