deep web | Federated Search BlogFederated Search

Editor’s Note: This post is re-published with permission from the Deep Web Technologies Blog. This is a guest article by Lisa Brownlee. The 2015 edition of her book, “Intellectual Property Due Diligence in Corporate Transactions: Investment, Risk Assessment and Management”, originally published in 2000, will dive into discussions about using the Deep Web and the Dark Web for Intellectual Property research, emphasizing its importance and usefulness when performing legal due-diligence.

Lisa M. Brownlee is a private consultant and has become an authority on the Deep Web and the Dark Web, particularly as they apply to legal due-diligence. She writes and blogs for Thomson Reuters.  Lisa is an internationally-recognized pioneer on the intersection between digital technologies and law.


In this blog post I will delve in some detail into the Deep Web. This expedition will focus exclusively on that part of the Deep Web that excludes the Dark Web.  I cover both Deep Web and Dark Web legal due diligence in more detail in my blog and book, Intellectual Property Due Diligence in Corporate Transactions: Investment, Risk Assessment and Management. In particular, in this article I will discuss the Deep Web as a resource of information for legal due diligence.

When Deep Web Technologies invited me to write this post, I initially intended to primarily delve into the ongoing confusion Binary code and multiple screensregarding Deep Web and Dark Web terminology. The misuse of the terms Deep Web and Dark Web, among other related terms, are problematic from a legal perspective if confusion about those terms spills over into licenses and other contracts and into laws and legal decisions. The terms are so hopelessly intermingled that I decided it is not useful to even attempt untangling them here. In this post, as mentioned, I will specifically cover the Deep Web excluding the Dark Web. The definitions I use are provided in a blog post I wrote on the topic earlier this year, entitled The Deep Web and the Dark Web – Why Lawyers Need to Be Informed.

Deep Web: a treasure trove of and data and other information

The Deep Web is populated with vast amounts of data and other information that are essential to investigate during a legal due diligence in order to find information about a company that is a target for possible licensing, merger or acquisition. A Deep Web (as well as Dark Web) due diligence should be conducted in order to ensure that information relevant to the subject transaction and target company is not missed or misrepresented. Lawyers and financiers conducting the due diligence have essentially two options: conduct the due diligence themselves by visiting each potentially-relevant database and conducting each search individually (potentially ad infinitum), or hire a specialized company such as Deep Web Technologies to design and setup such a search. Hiring an outside firm to conduct such a search saves time and money.

Deep Web data mining is a science that cannot be mastered by lawyers or financiers in a single or a handful of transactions. Using a specialized firm such as DWT has the added benefit of being able to replicate the search on-demand and/or have ongoing updated searches performed. Additionally, DWT can bring multilingual search capacities to investigations—a feature that very few, if any, other data mining companies provide and that would most likely be deficient or entirely missing in a search conducted entirely in-house.

What information is sought in a legal due diligence?

A legal due diligence will investigate a wide and deep variety of topics, from real estate to human resources, to basic corporate finance information, industry and company pricing policies, and environmental compliance. Due diligence nearly always also investigates intellectual property rights of the target company, in a level of detail that is tailored to specific transactions, based on the nature of the company’s goods and/or services. DWT’s Next Generation Federated Search is particularly well-suited for conducting intellectual property investigations.

In sum, the goal of a legal due diligence is to identify and confirm basic information about the target company and determine whether there are any undisclosed infirmities with the target company’s assets and information as presented. In view of these goals, the investing party will require the target company to produce a checklist full of items about the various aspects of the business (and more) discussed above. An abbreviated correlation between the information typically requested in a due diligence and the information that is available in the Deep Web is provided in the chart attached below. In the absence of assistance by Deep Web Technologies with the due diligence, either someone within the investor company or its outside counsel will need to search in each of the databases listed, in addition to others, in order to confirm the information provided by the target company is correct and complete. While representations and warranties are typically given by the target company as to the accuracy and completeness of the information provided, it is also typical for the investing company to confirm all or part of that information, depending on the sensitivities of the transaction and the areas in which the values–and possible risks might be uncovered.

Deep Web Legal Due-Diligence Resource List PDF icon


Kosmix is the search engine that produces “topic pages” on millions of subjects. Kosmix creates these topic pages by searching APIs of deep web sources (in real time). In other words, Kosmix relies heavily on federated search for the content of their topic pages. (Actually, Kosmix combines federated search with crawling technology. More about this later in this article.)

Kosmix co-founder, Anand Rajaraman, recently spoke at PARC, the prestigious Palo Alto Research Center. Rajaraman’s talk: “What lies beneath: harnessing the deep web.” A video of the hour-long talk is available at the PARC web-site. The slides are available at the Kosmix Blog.

Rajaraman has very impressive credentials. He is also co-founder of the VC firm Cambrian Ventures, he teaches a class for Stanford’s Computer Science department, and he is former Director of Technology at where:

he was responsible for technology strategy. Anand helped launch the transformation of from a retailer into a retail platform, enabling third-party retailers to sell on’s website. Third-party transactions now account for over 25% of all US transactions, and represent Amazon’s fastest-growing and most profitable business segment.

Rajamaran has his own Kosmix topic page.

Read the rest of this entry »


If you’ve got a half hour to spare, maybe in your car via iTunes, then you might enjoy this blogtalkradio interview at Friday Traffic Report: Exploring the Deep Web.

Friday Traffic Report host Jack Humphrey interviewed Bill Wardell about the deep Web. Wardell’s site, The CyberHood Watch Blog, aims to keep families and especially children safe on the Web.

While I know quite a bit about the deep Web, I enjoyed the conversational style within which a basic introduction was provided. I recommend this interview to those of you new to the concept of the deep Web and to new LIS students.


I’m incubating a white paper about the Deep Web. The Deep Web is all that content (more than 99%) of the web that Google can’t find by crawling, right? It’s all that stuff that lives inside databases and can only be found by filling out forms, right? The main value add of Deep Web search engines is that they find only Deep Web documents, right? Not all that long ago I would have answered “yes” to all these questions. Today I’m confused.

Today I was chatting with Darcy from (blog sponsor) Deep Web Technologies’ marketing department about the white paper. I’ll refer to her as Deep Web Darcy. Well, Deep Web Darcy is asking me some rather “deep” questions about the Deep Web. We discussed harvesting, crawling, indexing, Deep Web searching, and so much more. If someone’s Deep Web content finds its way to Google has that content become surfaced and does that content no longer qualify as buried treasure? If one’s Deep Web content can be harvested, is it not really Deep Web content? If someone is browsing that content in the forest, with only one hand on the keyboard, does that content make a sound? So many koans. So little time. My brain hurts.

Read the rest of this entry »


I’m always on the lookout for academic articles related to federated search or the deep web to review. I’m embarrassed to not have heard about OAIster until Abe turned me on to it.

If you’re also new to OAIster, here’s a snippet from their About page:

OAIster is a union catalog of digital resources. We provide access to these digital resources by “harvesting” their descriptive metadata (records) using OAI-PMH (the Open Archives Initiative Protocol for Metadata Harvesting). The Open Archives Initiative is not the same thing as the Open Access movement.

The About page goes on to say:

These resources, often hidden from search engine users behind web scripts, are known as the “deep web.” The owners of these resources share them with the world using OAI-PMH.

Read the rest of this entry »


Sometimes I write about things that are not quite related to federated search. This is one of those articles. While I am writing about the deep Web, this article is not about the aspect of the deep Web that the federated search community is focused on. But two of the important people in this article are ones I’ve written about before so there is some relevance here if you read on.

I received no fewer than three emails (and a flurry of Google alerts) about Alex Wright’s article in yesterday’s New York Times: Exploring a ‘Deep Web’ That Google Can’t Grasp. I like it when important publications write about the deep Web and help to spread awareness of it.

Read the rest of this entry »


I recently took a much needed break. I spent a couple of days with a very dear friend in Colorado. On the drive back to Santa Fe, I called Abe to check in. In our discussion Abe told me that there has been a fair amount of buzz in the blogosophere about Google “surfacing” deep Web content. Last April I first wrote about Google’s efforts to crawl the deep Web. A couple of months later I followed up with Why is Google interested in the deep web. Today there’s more to write about.

Yahoo! Tech News published an article on January 30: Google Researcher Targets Web’s Structured Data (PC World). The article’s first paragraph is ominous, unless you believe that Google is regurgitating old news:

Internet search engines have focused largely on crawling text on Web pages, but Google is knee-deep in research about how to analyze and organize structured data, a company scientist said Friday.

Read the rest of this entry »


[ Editor’s note: Darcy Pedersen, Mednar Product Manager for blog sponsor Deep Web Technologies (DWT), shares her enthusiasm about the latest good press that DWT’s new Mednar medical research portal has received. I welcome stories about good press from any federated search vendor. ]

What’s an alternative search engine, you ask? According to, their motto is: “The most wonderful search engines you’ve never seen.” On, you are not only exposed to eloquent reviews on current and new search engines, you get the low-down on up-and-coming technology in the search world. And guess what just poked its head around the corner?

Read the rest of this entry »


Today I wrap up my interview with Erik Selberg, which began with a preview here. Erik answers questions about federated search, about his work at Microsoft and Amazon, and a couple of other questions.

Erik Selberg joins the ranks of federated search luminaries, standing together with Kate Noerr, Todd Miller, and Michael Bergman.

Read the rest of this entry »


Alissa Miller has produced an impressive list of deep web-related resources for the Online College Blog. I’m particularly impressed at how much time Alissa must have spent researching resources for the list.

The list is divided into nine sections:

  1. Meta-Search Engines
  2. Semantic Search Tools and Databases
  3. General Search Engines and Databases
  4. Academic Search Engines and Databases
  5. Scientific Search Engines and Databases
  6. Custom Search Engines
  7. Collaborative Information and Databases
  8. Tips and Strategies
  9. Helpful Articles and Resources for Deep Searching

Read the rest of this entry »