[ Editor’s note: This article first appeared in the OSTI Blog. Dr. Walt Warnick, Director of the Office of Scientific and Technical Information, part of DOE, and I co-authored the article. This is a visionary piece. It addresses the question of what it would take for federated search to get to “Google speed.” You may agree with our conclusion. Or, you may not. In either case, we hope you enjoy the ride! ]
Many casual users of federated search criticize the technology for being slow to retrieve results. Serious researchers recognize the unique ability of federated search engines to mine the deep Web for quality science information that Google cannot find. These users recognize that there is no practical alternative to federated search for the best information. Still, everyone wants everything faster, and those users who are willing to trade quality for quickness focus on how federated search doesn’t return results in “Google time.”
OSTI begins to address the speed issue by displaying some results as soon as they are available. However, this approach causes results to be delivered in two sequential sets, which many users find less than ideal.
The good news for federated search users is that speed is not an insurmountable issue because technology is closing in on the speed gap.
The major bottlenecks to lightning fast federated search performance are related to networks, applications, server hardware, and storage. A systematic program to increase the speed of federated search would begin with a much needed serious assessment of the relative size of the bottlenecks. Lacking such an assessment, we consider how each bottleneck can be mitigated. The good news is that the inexorable advance of technology is steadily speeding up federated search.
Networks move search result metadata and document full text to searchers. Network latency and overall network speed directly impact response times for searching. Network bottlenecks can appear anywhere in the route from the content provider’s search engine to the user’s browser. Fortunately, networks are getting faster. Network giant Cisco Systems just announced its new CRS-3 Carrier Routing System. Cisco boasts that the new system can download the entire printed collection of the Library of Congress in just over one second, stream every motion picture ever created in less than four minutes, and allow every man, woman, and child in China to simultaneously make a video call. While this technology is brand new and will initially be very expensive, and while the technology is aimed at fast video streaming, this new advancement will raise the network performance bar and improve network speeds over time for everybody.
Federated search applications are also limited in performance by software bottlenecks, in particular the software that drives a content provider’s search interface. The Open Source and commercial software industries have made it possible for content owners to make their documents easy and fast to search via modern XML interfaces. Lucene is a powerful, free, and well supported search engine. Lucene powers search for Wikipedia and for Monster.com, among other large traffic sites. Performance has also steadily improved for commercial and Open Source database systems, increasing search speed. Of course, it is up to the content owners to embrace new systems and standards and to migrate to newer and better platforms. Newer and faster technology will be there for them when they are ready.
Server Hardware is becoming cheaper and faster, in line with Moore’s Law. Raw processor power, especially in modern multi-processor systems, can frequently overcome performance bottlenecks in software components of a system. Virtualization of servers and cloud computing further allows more flexibility to configure powerful systems that scale to meet user load. Faster and more powerful hardware will drive improved search engine performance from content providers.
Storage, like other hardware, is becoming faster. Faster access to content allows for quicker delivery of search results. In the intermediate term, solid state disk drives are promising and are becoming more popular. Although currently much more expensive than conventional disk drives, solid state disk prices will fall over time and the technology can allow a high-volume search application to access storage very quickly. In the near term, faster storage using traditional electro-mechanical technology is available. The Wikipedia article about hard disk drives emphasizes the point:
The exponential increases in disk space and data access speeds of HDDs [hard disk drives] have enabled the commercial viability of consumer products that require large storage capacities, such as digital video recorders and digital audio players. In addition, the availability of vast amounts of cheap storage has made viable a variety of web-based services with extraordinary capacity requirements, such as free-of-charge web search, web archiving and video sharing (Google, Internet Archive, YouTube, etc.).
Faster disks, in conjunction with aggressive caching technology, can allow for very quick delivery of pre-computed search results, either by the content provider or by the federated search engine. Caching involves the storing of search results for faster subsequent retrieval. Rather than repeatedly querying a database for content that is frequently requested yet does not change very often, the database is queried once and the results for that query are placed in the cache. Periodically, the cache is refreshed to ensure currency of information. Caching can be implemented with fast storage to minimize access time. Furthermore, the cache can be managed by sophisticated software that identifies the search results most worthy of being cached, maximizing the cache storage available for popular query results.
The technology gap is narrowing. Hardware is improving. Network, disk, and computer speedups bode well for the future of federated search. Software is improving as well. Caching and deep web surfacing technologies will also help to close the speed gap. With speed improvements taking place on a number of fronts we can look forward to federated search performance so fast that delays are no longer noticeable. OSTI is doing everything it can afford to hasten the arrival of faster federated search.
Dr. Walt Warnick
Director of OSTI
Consultant to OSTI
Tags: federated search