12
Nov

Journey to building a deep web crawler

Author: Sol

Although I do much less programming than I once did, I can still relate to the thrill of writing code to solve a fun and challenging problem. Matt, a software developer and graduate student in computer science, is tackling a very difficult problem; he’s going to write software to “crawl” content that lives behind search forms. And, he’s blogging about it. The first installment of his journey is in his Try-Catch-FAIL blog.

For those of you who aren’t familiar with what it takes to automatically find content behind a search form, Matt identifies some of the hurdles:

No two search engines are the same.
There are no standards for coding of web forms.
Client-side scripting. This is mainly AJAX.
SSL

To Matt’s list I add a few more challenges:

Cookies - setting them and reading them
Knowing which form parameters to submit and how to set them. Don’t forget the hidden ones
Knowing what query terms to use for automatic searching
Parsing of search results - breaking results up into fields and determining which those fields are (e.g. title, author, …)

To learn more about the challenges of automatically searching the deep web, I recommend some of my previous articles:

Matt is implementing an automated browser based on Internet Explorer to get around the AJAX issues. I think Matt’s got exciting and difficult challenges ahead of him. I look forward to watching his progress.

If you enjoyed this post, make sure you subscribe to my RSS feed!

Tags: deep web, federated search

This entry was posted on Wednesday, November 12th, 2008 at 11:39 am and is filed under viewpoints. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

One Response to "Journey to building a deep web crawler"

1 danny
December 30th, 2008 at 8:34 pm
Nice article!
in fact a person could make his owns web crawler via services such as :
http://www.knowlesys.com/products/custom_web_data_crawler.htm

Journey to building a deep web crawler

One Response to "Journey to building a deep web crawler"

Leave a reply

Categories

Archives

Pages

Sponsored By

Subscribe via RSS

Subscribe via Email

Proud Member

Recent Posts

Recent Comments