[ This article is one of a number of articles that explains concepts basic to federated search. You can find other such articles by clicking on the "basics" category or by clicking here. ]

OCLC acquired EZproxy a few weeks ago. For those of you unfamiliar with proxy servers, especially in a federated search environment, I thought it’d be (almost) timely to demystify them.

A proxy server, in the broadest sense, is an application that performs an action on behalf of a user. Consider the role of proxies in the business world. Shareholders in corporations will sometimes designate a proxy to vote on their behalf at a shareholders meeting. The proxy represents the interest of the shareholder who is not physically present at the shareholders meeting to vote. In the Internet, proxies may not be casting votes but they perform a myriad of tasks for users.

There are many kinds of proxy servers. Wikipedia provides an overview of a number of them. In every scenario involving a proxy server there is a user, the proxy server, and (usually) the service being accessed by the user. The proxy stands in the middle, communicating with the user and with the service being requested. A simple kind of proxy is a web proxy. A web proxy can serve a number of functions but its simplest role is to block access to adult or other web-sites. The proxy receives requests from user browsers to connect to sites. The proxy looks up the site in its blacklist. If a match is found it returns a web page to the user notifying him that the request has been denied by the proxy. If the site has not been blacklisted then the proxy server retrieves the URL requested and sends that page back to the browser that made the request. In this use of a web proxy the action of the proxy is to protect the user from accessing inappropriate content.

In the federated search industry proxy servers are commonly used to provide access to search results and documents from subscription sources. Consider the case of an organization that accesses and federates content from a number of subscription databases. The publishers delivering the content (search results and documents) need to restrict access based on their licensing agreement with the organization. IP-based authentication is a very common way to restrict access; content is only provided to computers whose IP addresses are within a specified range. In other words, access is restricted to users within a network, or collection of networks. This works well for the publishers but how does the organization paying for access to content make it available to remote employees or to employees who are traveling or otherwise temporarily away from the authenticated network? This is where a web proxy serves a critical role.

There are two components necessary to use a web proxy in a federated search environment. The first is user authentication. The second is use of the proxy to gain access to restricted content. User authentication is typically accomplished through a publicly accessible web form. The user enters a username and password to verify that he has legitimate access to the restricted content. Once authenticated the user is permitted access to restricted URLs. Typically the user also has to configure his browser with the URL of the proxy server. Once he has done that the proxy server intercepts web page requests that would normally be rejected, retrieves those pages, and presents them to the user. Because the proxy server lives on a network that is allowed access to content that the user on a remote network normally cannot access, it is able to retrieve the restricted content on behalf of the user.

While using a web proxy is very useful there are a number of issues that the organization hosting the proxy server needs to consider that will affect their users’ access to restricted content. Useful Utilities, vendor of the very popular EZproxy web proxy, has an excellent discussion of the issues in their support page and a discussion of how the URL rewrite mechanism solves some problems. The following paragraphs, from the EZproxy support page, identify the issues:

There are three major ways that determine if a browser will use a proxy server: transparent proxying, browser configuration, and URL rewriting.

In some network configurations, a network router may be configured to reroute all web traffic through a proxy server. This has the advantage that no browser configuration is required. When such changes are made without prior announcement, they may also cut off access to specific databases if the proxy server’s IP address has not been given to the remote database vendor.

Users may be required to configure their browsers to use a proxy server. For machines at your institution, you may configure all browsers to use your proxy server for all web access. For remote users, it is common to use an autoconfiguration file, which tells the users’ browsers which web sites should use your proxy server, so that only requests to your database vendors are routed through your proxy server. There are three main problems with using a standard proxy server for remote users: browser configuration directions must be provided for every version of every browser on every operating system you want to support, users may be unable to access your proxy server due to proxy servers at their own sites or firewall restrictions, and users who access databases from multiple institutions must change the browser each time they want to use your proxy server instead of another institutions proxy server.

URL rewriting proxy servers such as EZproxy require no browser configuration. These proxy servers change the URLs in web pages so that requests for web pages from licensed databases are routed back to the proxy server.

Something to consider from the federated search engine vendor’s perspective, but transparent to the end user, is that federated search engine connectors must be configured to use proxy servers to access restricted content if the federated search application lives on a different network than the network that is allowed access to all restricted content. This also means that while developing or maintaining a connector to restricted content the federated search vendor may need a username and password to access the proxy server or have the IP address of his development site IP-authenticated to automatically use the proxy server.

A complicated authentication issue that is handled by proxy servers is when cookies are set and read by the subscription service’s search engine. The proxy server not only submits URL requests and retrieves web pages on the user’s behalf, it also manages cookies, setting and reading them on the federated search engine’s behalf. This is necessary because the federated search application itself never has direct contact with the remote database.

An interesting use of proxy servers in a search environment is in the case of wanting to search anonymously. A member of the intelligence community has a strong interest in ensuring that his search requests cannot be traced back to him. People concerned with their online privacy share the same concern. While a search engine user cannot ensure that the search engine is not logging his search request terms and IP address, by using an anonymizing web proxy he can ensure that only the IP address of the web proxy is logged.

MuseGlobal is a federated search vendor that provides its own web proxy. A press release from 2003 gives a nice explanation of how their proxy server works. I hope the information is current but here’s a relevant piece of that press release:

MuseGlobal simplifies administration by including the Muse Proxy Server with this re-writing filter component. The user’s search is sent to MuseSearch and the search is performed as usual. Returned results have their links automatically re-written so they include the Muse proxy in the loop when retrieving the full text of documents. In this way authentication is maintained and the user is not required to individually logon to the content provider servers.

For those of you looking for non-commercial web proxy solutions and technically savvy you may want to look at the Apache httpd server. It can be configured as a web proxy. Also, Wikipedia provides links to a number of software solutions for all types of proxies.

I hope this article has given you an appreciation for the important roles that the humble web proxy serves in the federated search environment.

If you enjoyed this post, make sure you subscribe to the RSS feed!

Tags: , , , ,

This entry was posted on Saturday, February 2nd, 2008 at 6:55 pm and is filed under basics. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

One Response to "Proxy servers and federated search"

  1. 1 Jonathan Rochkind
    February 4th, 2008 at 4:04 pm  

    It’s worth pointing out that any federated search product will ACT AS proxy servers itself, too.

    In the sense that when an end-user executes a search with the federated search product, it is the federated search software on it’s server that contacts the actual databases to be searched, not the end-users computer or browser.

    Of course, federated search is not just a simple ‘transparent proxy’, it is not just ferrying content un-altered between the user and the content/search provider. But it’s still a proxy of a sort, in between the end-user adn the content/search providers!

Leave a reply

Name (*)
Mail (*)