13
Feb

99 deep web resources

Author: Sol

CollegeDegree.com has just produced a nice resource, 99 Resources to Research & Mine the Invisible Web. It’s a list of deep web search engines, databases, catalogs, directories, and social media sites. At the bottom of the list are references to a number of guides about the deep web. The deep web consists of content that usually lives in a database, is accessed by humans through web forms, by federated search engines with specialized knowledge of deep web sources, and is not easily accessible to Google and other web crawlers.

I’ve added this document to the resources page.

This “99 resources” guide is definitely worth a bookmark.

If you enjoyed this post, make sure you subscribe to the RSS feed!

Also, check out our writing contest with awesome prizes!
30
Dec

Part I of this series on content access basics explained how screen scraping is used by many federated search engines (FSEs) performing deep web searches to process search results plus the problems associated with this approach. This article provides an introduction to how XML-formatted search results are processed by FSEs.

FSEs use jargon such as “XML gateway” or “XML interface” to refer to the fact that they have a way of interacting with a particular content source using XML. It may be that the FSE generates XML and submits an XML query or that search results are generated by the remote search engine and returned as an XML document. In this article we are going to focus on the processing of XML results.

So, what is XML? Wikipedia has a nice introduction to XML plus a few examples. Here’s a nice simple tutorial on XML. The important idea about XML is that there is no ambiguity about where to find information. XML is intended for consumption by computer programs. It is very highly structured.

Read the rest of this entry »

If you enjoyed this post, make sure you subscribe to the RSS feed!

Also, check out our writing contest with awesome prizes!
27
Dec

In this multi-part series we will look at a number of different approaches that federated search engines (FSEs) take to access content from remote databases.

FSEs are always at the mercy of the content provider when it comes to searching and retrieving content. FSEs perform deep web searches since they access content that lives inside of databases. Read the earlier articles on crawling vs. deep web searching and introduction to the deep web for background information on deep web searching. Also, read the article about connectors to understand how the query processing and search engine submission process works for deep web searching.

When FSEs search deep web databases they often do so by filling out search forms much like humans do and they also process result lists (summaries of documents generated by the remote search engines) much like the way humans examine the search results in their browsers. Processing a list of search results by reading and dissecting the HTML that a search engine provides is called “screen scraping.” Wikipedia has an article about screen scraping.

Read the rest of this entry »

If you enjoyed this post, make sure you subscribe to the RSS feed!

Also, check out our writing contest with awesome prizes!
24
Dec

Abe discovered “Searching the Deep Web” on YouTube yesterday. This is a nice professionally produced 6-minute introduction to the Deep Web made by the US Department of Energy Office of Scientific and Technical Information (OSTI). Deep Web Technologies is mentioned in the video as DWT has created the search technology for a number of major OSTI applications.

What’s very cool about this video hitting You Tube is that Abe and I think of You Tube as hosting very mainstream videos. We like the idea of the public being exposed to federated search in such a venue.

For your viewing pleasure, here’s the video. Make yourself some popcorn, relax, and enjoy the show!

If you enjoyed this post, make sure you subscribe to the RSS feed!

Also, check out our writing contest with awesome prizes!
21
Dec

Marcus Zillman will be speaking about his Deep Web Research 2008 publication in his Awareness Watch program on BlogTalkRadio. The show will air January 13, 2008 at 2:00 PM Eastern Time. Here’s the show description from the show’s information page:

We will be discussing my latest publication Deep Web Research 2008 that describes the many many resources and sites that you can drill deep into the web to discover information that is not available through the traditional search engine! We will also be scrolling through the zillman.us blogs and bringing the latest Net sightings and updates!

Note that the focus of this program is in deep web content sources, not in its federation so his program will not be a federated search program. Nevertheless, those in the federated search world should be interested in knowing about content worth federating.

If you enjoyed this post, make sure you subscribe to the RSS feed!

Also, check out our writing contest with awesome prizes!
13
Dec

What is a connector?

Author: Sol

Deep web searching is fundamentally a different beast than surface web searching. Surface web crawlers like Google follow a known set of links to discover new web pages and to grow their list of links. While they’re following links the surface web crawlers are also grabbing the content and indexing it for human search.

Deep web search engines don’t follow links to find content, they fill out and submit search forms much the way humans do. Federated search engines, like deep web search engines, don’t use the crawl approach; they search content sources using either the deep web approach or via some other mechanism to access its documents. Each content owner provides its own mechanism for content search and document retrieval.

Read the rest of this entry »

If you enjoyed this post, make sure you subscribe to the RSS feed!

Also, check out our writing contest with awesome prizes!