Surfing Web-Caching Technology, Part 2

Ride the wave

As more companies rush into e-business, the demand for Internet bandwidth is growing too fast for the Internet to handle. One key technology that preserves Internet bandwidth is Web caching, which saves frequently requested Web content in a local caching server for users nearby, thus eliminating repetitive Web traffic. Last month, I explored several Web-caching technologies, including browser and network caching, passive and active caching, and transparent and nontransparent caching. I also discussed how you can implement transparent caching with routers and switches. This month, I examine Web-caching scalability, and I introduce several deployment models for using Web caching in different networks. Finally, I discuss how to implement Web caching in Web publishing. The sidebar "Web Caching in Microsoft Proxy Server 2.0," page 108, looks at Microsoft's implementation of Web caching in Proxy Server 2.0.

Cache Cluster and Hierarchy
In some networks, one caching server might not be enough because of the network's size and performance and availability requirements. Two or more caching servers in such a network can serve more users in the expected response time and provide load balancing and fault tolerance. Several servers that work together to provide efficient and reliable Web caching form a cache cluster. A cache cluster can evenly distribute cached objects among the servers in the cluster. If one server in the cluster fails, neighboring servers in the cluster will continue to provide Web caching for clients. One cluster can also forward a query to a parent cluster if none of the servers in the first cluster have the requested object; this operation establishes a cache hierarchy. Figure 1 shows a logic diagram of a two-layer cache hierarchy. When a user sends a query to Cluster 1, it tries to locate the requested object for the user. If Cluster 1 doesn't have the object, the cluster forwards the user's request to the upstream Cluster 2. If it doesn't have the requested object, the cluster retrieves the object from the Web server for the user. A cache hierarchy can contain more than two layers. Two major cache cluster and cache hierarchy protocols are Internet Cache Protocol (ICP) and Cache Array Routing Protocol (CARP). Let's take a look at each.

The Harvest research project on Internet caching at the University of Southern California developed ICP in 1996. ICP is an Internet standard as set forth in Internet Engineering Task Force (IETF) Request for Comments (RFC) 2186 and RFC 2187. ICP works in a straightforward way. When a client requests a Web object, the client's local caching server returns the cached object to the client if the server has the object. If the local caching server doesn't have a local copy of the object, the local server queries each of its neighboring caching servers in the cluster. If a neighbor server has the cached object, it sends the object to the local caching server. The local server saves the copy in its cache and sends the object to the client. If none of the neighbor servers in the cluster have the cached object, the client's local caching server will either forward the query to its upstream parent cache cluster (if it's configured to use a parent cluster) or forward the query directly to the original Web server.

Most caching vendors support ICP in their products. Also, Squid, freeware caching software that derives from the original Harvest project, is available for almost all UNIX OSs and Linux at http://squid.nlanr.net.

ICP has shortcomings. A client's local caching server must query neighbor servers in a cluster and wait for their replies to find out whether its local cluster has the requested object. The querying and replying process takes time and can result in slow client response. In addition, different clients using different caching servers in a cluster might request identical objects. Because a client's local caching server will cache missing objects, the possibility exists that all caching servers in a cluster might eventually store the same cached objects. Such a situation means that the cluster isn't using caching space efficiently.

CARP is more efficient and more sophisticated than ICP. Microsoft developed CARP and implemented it in Proxy Server 2.0. CARP uses a deterministic request resolution path to quickly locate and route a request to the caching server that contains the requested object without querying between servers in the cache cluster. (Microsoft refers to cache clusters as cache arrays.) CARP doesn't duplicate cached objects in multiple servers and therefore saves caching space in the cluster or array. CARP uses hash-based functions to reach this optimization.

Here's how CARP works. Suppose a cache array contains three member servers: Cache 1, Cache 2, and Cache 3. CARP generates a hash for each server using the server name. CARP calculates the URL's hash based on the object label of the URL that the user accesses. For each server, CARP computes a final hash using the server's hash and the URL's hash. CARP load-balances the three different caching objects in the three different servers; the server with the highest final hash value caches the content of the URL. Table 1 is a simplified CARP hash table for the three servers in my example and three URLs. The server with the highest hash value caches the content of the URL; therefore, Cache 1 serves URL 2, Cache 2 serves URL 3, and Cache 3 serves URL 1.

When you add a new caching server to an array, CARP calculates a final hash for each URL for the new server. CARP then reassigns the cached objects with the highest hash values on the new server to the new server, which results in load balancing. If an array includes n caching servers, including the new server, the number of objects needing relocation is 1/n. If a server in the array fails, 1/n of the cached objects will be lost, but the remaining servers will continue to provide caching.

With CARP's deterministic method, a client's local server can immediately route the client's request to the server that contains the cached object. If the local server doesn't have the cached object in its cache, the server will route the request a maximum of one hop. In this way, CARP produces good response time for users. CARP can also forward a request to a parent cluster or directly to the original Web server.

Although CARP is still an Internet Draft (draft-vinod-carp-v1-03.txt) that Microsoft proposed to IETF, several vendors have implemented CARP in their caching software. In addition to Proxy Server 2.0, these products are CacheFlow's CacheFlow, InfoLibria's DynaCache, and Netscape's Netscape Proxy Server.

Deploying Web Caching
To deploy Web caching in your network, you can use the techniques I described in Part 1 to decide what kinds of Web caching and which products to use. For example, if you already use Microsoft Proxy Server as your Internet gateway, you can take advantage of that product's built-in caching functions. If you run a large network, you might consider employing transparent caching and a caching appliance such as CacheFlow. If you plan to upgrade your network infrastructure with a well-integrated Web-caching service, you might want to check such products as Lucent Technologies' IPWorX, which uses Web switches for transparent caching.

No rules of thumb exist to help you size a caching server—each vendor's implementation is unique. You need to identify the caching throughput, number of objects served per second, number of concurrent connections, and object hit ratio you expect. Your vendor can then usually recommend an appropriate caching system for you. For example, if you determine that you need 5Mbps throughput, 1000 objects served per second, 5000 simultaneous connections, and a 60 percent hit ratio, a medium-level caching appliance such as CacheFlow 500 (with 18GB of disk space and 384MB of memory) will meet your requirements. So far, I haven't found detailed guidelines from Microsoft for sizing server hardware for Proxy Server, though Microsoft offers some very rough sizing information in the MS Proxy Server 2.0 Data Sheet.

For fault tolerance and load balancing, you need to use a cache cluster or array. Any size network can take advantage of a cache cluster. Figure 2 shows that a small to midsized network can use a cache cluster, such as Microsoft Proxy Server's cache array, in front of its Internet connection. A company with multiple locations (e.g., regional data centers) can host a local cache cluster in each location in addition to the company's central cluster to the Internet, as Figure 3 shows. In this cluster hierarchy, each cluster in the data centers will handle local requests first. These clusters forward only unsatisfied requests to the central cluster. If the company's ISP is equipped with caching systems in its network, the company can even forward requests to a close caching system in the ISP, rather than to the Web server. An ISP can deploy caching clusters in its networks, such as one in the link to the Network Access Point (NAP) and one in each Point of Presence (POP). Figure 4, page 110, shows this ISP caching cluster and caching hierarchy architecture.

Caching for Web Publishers
Sometimes Web caching can be a headache for Web content publishers. For example, a publisher can't gather an accurate number of hits if some visitors access Web content in a caching server. If a caching server doesn't update content promptly, it can return expired or stale content to users. However, when Web publishers understand the relationship between Web caching and HTTP header functions, they can design caching-friendly Web sites and reduce their Web servers' workload.

A Web object can contain an HTTP header to instruct a browser and caching server how to cache the object. HTTP 1.0 and 1.1 both provide an expiration header. For a static image, such as a company logo, you can set the expiration header to no expiration, so caching servers can keep the image in the cache forever. If you want to gather the exact number of hits on a specific page (e.g., an advertisement), you can add an object, such as a small image, to the page and set the object to expire immediately, so the caching server won't cache the object. Then, every time a user visits that page, the browser or caching server will retrieve the object from the original Web server, and the Web server can then count the exact number of visits. If you update an object at a specific interval, you can set the object to expire at that interval, which signals change to the Web server. You can also set an object to expire at a certain date and time. For example, if a Web page contains a sales promotion that runs until 6:00 p.m. on September 30, 1999, you can set the page to expire on that day and time. Screen 1 shows the expiration header settings in Microsoft Internet Information Server (IIS) 4.0.

HTTP 1.1 supports cache-control headers, which HTTP 1.0 doesn't support. Cache-control headers let you define how a caching server and browser handle an object (e.g., only browsers, and not caching servers, can cache the object). HTTP 1.1 also includes validator headers that Web servers and caching servers use to validate an object. For example, if an object has expired, the caching server will use a validator header, such as the last-modified time, to ask the Web server whether the object is still good. If the Web server hasn't changed the object (i.e., the last-modified times in the cached object and the original object match), the caching server doesn't need to download the object again. You can define these advanced headers in an object's custom header settings in IIS 4.0.

Caching systems won't cache objects using HTTP over Secure Sockets Layer (HTTPS) for encryption and authentication. Apply HTTPS only to Web contents that must be secure.

Another important caching feature is reverse proxy caching, which redirects Web requests to downstream Web servers and caches the contents of those Web servers. For example, you can host a caching server that supports reverse proxy caching outside your Internet firewall for Web servers that sit inside the firewall. Internet users will access your Web servers via the caching server. Reverse proxy caching can improve content-retrieval performance and hides your internal Web servers from the public Internet because only the reverse proxy caching server can directly communicate with your Web servers from outside the firewall.

Riding the Wave
Web caching removes unnecessary Web traffic from the Internet and reduces response time for Web surfers and server workload for Web publishers. As more companies conduct e-business over the Internet, Web caching is becoming an important technology that ensures that corporate networks remain up and running. Use Web-caching techniques to design a better network and ride the Web-caching wave.

Surfing Web-Caching Technology, Part 2

Comments

Plain text