12
Nov

Although I do much less programming than I once did, I can still relate to the thrill of writing code to solve a fun and challenging problem. Matt, a software developer and graduate student in computer science, is tackling a very difficult problem; he’s going to write software to “crawl” content that lives behind search forms. And, he’s blogging about it. The first installment of his journey is in his Try-Catch-FAIL blog.

For those of you who aren’t familiar with what it takes to automatically find content behind a search form, Matt identifies some of the hurdles:

  • No two search engines are the same.
  • There are no standards for coding of web forms.
  • Client-side scripting. This is mainly AJAX.
  • SSL

To Matt’s list I add a few more challenges:

  • Cookies - setting them and reading them
  • Knowing which form parameters to submit and how to set them. Don’t forget the hidden ones
  • Knowing what query terms to use for automatic searching
  • Parsing of search results - breaking results up into fields and determining which those fields are (e.g. title, author, …)

To learn more about the challenges of automatically searching the deep web, I recommend some of my previous articles:

Matt is implementing an automated browser based on Internet Explorer to get around the AJAX issues. I think Matt’s got exciting and difficult challenges ahead of him. I look forward to watching his progress.

If you enjoyed this post, make sure you subscribe to my RSS feed!

Tags: ,

This entry was posted on Wednesday, November 12th, 2008 at 11:39 am and is filed under viewpoints. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

One Response to "Journey to building a deep web crawler"

  1. 1 danny
    December 30th, 2008 at 8:34 pm  

    Nice article!
    in fact a person could make his owns web crawler via services such as :
    http://www.knowlesys.com/products/custom_web_data_crawler.htm

Leave a reply

Name (*)
Mail (*)
URI
Comment