Nov
Although I do much less programming than I once did, I can still relate to the thrill of writing code to solve a fun and challenging problem. Matt, a software developer and graduate student in computer science, is tackling a very difficult problem; he’s going to write software to “crawl” content that lives behind search forms. And, he’s blogging about it. The first installment of his journey is in his Try-Catch-FAIL blog.
For those of you who aren’t familiar with what it takes to automatically find content behind a search form, Matt identifies some of the hurdles:
- No two search engines are the same.
- There are no standards for coding of web forms.
- Client-side scripting. This is mainly AJAX.
- SSL
To Matt’s list I add a few more challenges:
- Cookies - setting them and reading them
- Knowing which form parameters to submit and how to set them. Don’t forget the hidden ones
- Knowing what query terms to use for automatic searching
- Parsing of search results - breaking results up into fields and determining which those fields are (e.g. title, author, …)
To learn more about the challenges of automatically searching the deep web, I recommend some of my previous articles:
- Content access basics - Part I - screen scraping
- Crawling vs. deep web searching
- The interplay between AJAX and federated search
- What is a connector?
- More on AJAX and federated search
Matt is implementing an automated browser based on Internet Explorer to get around the AJAX issues. I think Matt’s got exciting and difficult challenges ahead of him. I look forward to watching his progress.
Tags: deep web, federated search
One Response to "Journey to building a deep web crawler"
December 30th, 2008 at 8:34 pm
Nice article!
in fact a person could make his owns web crawler via services such as :
http://www.knowlesys.com/products/custom_web_data_crawler.htm