D. Li Cisco P. Cao Cisco M. Dahlin Univ of Texas Internet Draft Document: draft-danli-wrec-wcip-01.txt March 2001 Category: Experimental WCIP: Web Cache Invalidation Protocol Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract Cache consistency is a major impediment to scalable content delivery, because periodical revalidating objects one by one is unacceptable in terms of performance and/or cache consistency. This document describes the Web Cache Invalidation Protocol (WCIP). WCIP uses invalidations and updates to keep changing objects up to date in web caches. It thus enables proxy caching and content distribution of large amounts of frequently changing web objects. WCIP runs between the invalidation server, the participating web caches, and channel relay points (if any). An invalidation server maintains one or more invalidation channels, each of which covers a class of related objects, called an "object volume". E.g., the CNNfn channel may cover an object volume with the day's top financial news and stock quotes. Web caches subscribe to channel(s) they are interested in, while the invalidation server(s) send out invalidations and/or up-to-date objects to the channel(s). Besides server-driven invalidation, WCIP also supports client-driven validation of object volumes. Li & Cao & Dahlin Experimental - September 2001 1 Draft-danli-wrec-wcip-01.txt March 2001 WCIP employs heartbeats to guarantee the freshness of the cached objects even under network or server failure. Moreover, WCIP can set up channel relay points via a cache hierarchy or a CDN (content delivery network). A channel relay point performs channel relay (one-to-many) and connection aggregation (many-to-one). Revision Log 1. Introduce the concept of "object volume" to clear the common confusion between the transport (invalidation channel) and the unit of consistency (object volume). 2. Remove "targeted service" (for now) as it significantly complicates the protocol without obvious benefit, given that "channel" is already a form of coarse filtering. 3. Describe "client-driven volume validation" as an operation mode as opposed to a special case of channel registration with infinite heartbeat interval. 4. Specify the "channel abstraction". Describe a HTTP-based channel implementation. Other implementations, e.g., Beep, IP multicast, can be future work. 5. Add the protocol state machine. Table of Contents 1. Introduction ......................................2 2. Terminology .......................................4 3. Design Issues .....................................6 3.1 Freshness Guarantee 3.2 Object Volume 3.3 Channel Abstraction 4. Deployment Issues ................................12 4.1 Channel Relay 4.2 Detect Changes 4.3 Discover Channels 4.4 Join Channels 5. Protocol Specification ...........................15 5.1 Object Volume DTD 5.2 Client-initiated Volume Synchronization 5.3 Server-initiated Volume Synchronization 5.4 Serving Content 6. Protocol State Machine ...........................22 6.1 Client State Machine 6.2 Server State Machine 7. Security Concerns ................................25 8. References .......................................26 9. Acknowledgments ..................................27 10.Authors' Addresses ...............................27 1. Introduction In web proxy caching, a document is downloaded once from a web server to a caching proxy, which then serves the document to end- Li & Cao & Dahlin Experimental - September 2001 2 Draft-danli-wrec-wcip-01.txt March 2001 users repeatedly out of the cache. This offsets the load on the web server, improves the response time to the users, and reduces the bandwidth consumption. When the document seldom changes, everything works out wonderfully. However, the hard part is when the document is popular but also frequently changing. Frequently changing content is quickly becoming a significant percentage of Web traffic, e.g., news and stock quotes, shopping catalog and prices, product inventory and orders, etc. Because the content is changing, the caching proxy has to frequently poll the web server for a fresh copy and still tends to return stale data to end-users. Specifically, a proxy using "adaptive TTL" is unable to ensure strong cache consistency, and yet "poll every time" is costly [1]. So a content provider usually sets a very short expiration time or marks frequently changing documents as non-cacheable all together. This defeats the benefit of caching, even though those objects may be cached, should the proxy know when the document becomes obsolete [2]. Moreover, if the proxy can be informed of the change to the underlying data that a web object is generated from, the proxy can re-generate the web object on its own, making it possible to distribute "dynamically computed content". Addressing this problem, WCIP (Web Cache Invalidation Protocol) provides freshness guarantees to content providers while keeping the cost of doing so low. Using WCIP, a web server can advertise to caching proxies an "object volume" and the corresponding invalidation channel, identified as an URI. To provide freshness guarantees to objects in the object volume, a caching proxy subscribes to the invalidation channel and obtains an up-to-date view of the object volume -- a process referred to as "volume synchronization". After the initial volume synchronization, to stay synchronized, the invalidation channel operates in either the server-driven mode or the client-driven mode (or a mix of both). In the server-driven mode, the invalidation server sends invalidations to the channel whenever changes happen to the volume, while the proxy listens passively. The server also generates heartbeats so that the freshness guarantees can be met even upon network partition or server crash. The heartbeat interval is determined by the freshness guarantees required for the object volume. In the client-driven mode, the invalidation server doesn't proactively send updates to the channel. The caching proxy periodically initiates "volume synchronization" to revalidate the volume, at which time the invalidation server returns all the updates made to the volume since the last time the proxy validated the volume. The revalidation interval is determined by the freshness guarantees required for the object volume. The two modes are merely the two extremes of a continuum, characterized by how soon the server proactively sends Li & Cao & Dahlin Experimental - September 2001 3 Draft-danli-wrec-wcip-01.txt March 2001 updates/heartbeats and how soon the proxy revalidates the volume. The sooner the revalidation, the quicker objects are invalidated; this results in better consistency but also more load on the server and proxy. Regardless of the mode, same messages are exchanged between the invalidation server and the caching proxies, whose format is defined by "ObjectVolume" XML DTD in Section 5.1. Each round of message exchange, whether initiated by the server or the client, is a process of "volume synchronization" and results in an up-to-date view of the object volume. Based on the up-to-date view, the proxy can provide freshness guarantees to all the objects in the volume. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [3]. Since WCIP makes some extensions to HTTP, please refer to RFC-2616 [4] for HTTP related terminology. Following are WCIP-related terms. Cache Consistency A property that the replica data item reflects its master copy in a certain fashion. There are at least 3 fashions. (1) Strong consistency -- the replica must be always the same as the master. (2) Delta consistency -- the replica must become the same as the master at most "delta" seconds after the master is updated. (3) Eventual consistency -- the replica must become the same as the master at some unknown point in the future. WCIP provides "delta consistency" where "delta" is the freshness guarantee. Freshness Guarantee A promise that the invalidation client will not service content (belonging to the object volume) from the cache after X seconds of known or presumed update at the origin server, where X is the freshness guarantee and is specified by the content provider. In other words, the invalidation client never delivers cached content that is more than X seconds stale, regardless of network partition, proxy failure, or server failure. A freshness guarantee provides "delta consistency" and also allows "eventual consistency" (i.e., when X is infinite). Object Volume A set of correlated web objects, their consistency state, and their freshness guarantee. The object volume is employed as the unit of consistency as well as the unit of filtering. A consistent view of the object volume implies the consistent view of every object in the object volume. Also called "volume". Volume Synchronization Li & Cao & Dahlin Experimental - September 2001 4 Draft-danli-wrec-wcip-01.txt March 2001 The act of updating the object volume at the invalidation client with that at the invalidation server. Either the server or the client can initiate volume synchronization. After the volume synchronization, the two mutually agree on the consistency states of objects in the volume (for "freshness guarantee" time long). Last Synchronization Time The time of the last volume synchronization. Invalidation Channel A transport abstraction that carries messages between the invalidation server and the invalidation client(s) for the purpose of volume synchronization. Also called "channel". Invalidation Server An application program that provides WCIP services to caching proxies. It maintains the master copy of the object volume and disseminates the volume and changes to volume to caching proxies. (The invalidation server logically differs from the origin server because a cache may fill a request from a CDN content server or a replica origin server. The cache may not be able to tell these various sources from the origin server. The WCIP service may not reside on each or any of them. "Invalidation server" uniquely identifies the source of the WCIP service.) Invalidation Client A web cache, usually a caching proxy, which subscribes to the invalidation channel and maintains a consistent view of the object volume. Also referred to as the "proxy". Server-driven mode An operation mode, where the invalidation server proactively sends changes made to the object volume as well as heartbeats to invalidation clients via the invalidation channel, for volume synchronization. Client-driven mode An operation mode, where the invalidation client periodically queries the invalidation server via the invalidation channel, for volume synchronization. The server replies with the changes made to the object volume since the last time the client asked. Channel Address Li & Cao & Dahlin Experimental - September 2001 5 Draft-danli-wrec-wcip-01.txt March 2001 Information that a caching proxy needs in order to access the channel, e.g., the name of the channel, the address (domain name) of the invalidation server, the security mode, etc. Channel Replay An intermediary program that subscribes to one or multiple invalidation channels on behalf of its clients (e.g., downstream proxies) and relay the channel messages to its clients. It MUST implement both the invalidation server and the invalidation client. Heartbeat A periodic message sent by the invalidation server to keep the channel from being silent for too long. It allows the invalidation client to verify the channel connectivity and source liveliness, so as to confirm that the volume remains synchronized. Heartbeat Interval A property of the server-driven mode. The invalidation server sends heartbeat to the invalidation channel if the channel is silent for the last heartbeat interval. The interval MUST be smaller than the freshness guarantees of the objects in the object volume, or the object volume may lose synch. Revalidation Interval A property of the client-driven mode. The invalidation client initiates volume synchronization with the invalidation server, when the "last synchronization time" was "revalidation interval" ago. The interval SHOULD be smaller than the freshness guarantees of all the objects in the object volume, to avoid unnecessary cache misses. Invalidation Latency The time between an object is updated at the origin server to the time the old copy is treated as stale at all the participating proxies. The goal of a freshness guarantee of X seconds is to guarantee that the invalidation latency is within X seconds at all times. Content Delivery Network (CDN) A self-organizing network of geographically distributed content delivery nodes (reverse proxies) for contracted content providers, capable of directing requests to the best delivery node for global load balancing and best client response time. 3. The Design Li & Cao & Dahlin Experimental - September 2001 6 Draft-danli-wrec-wcip-01.txt March 2001 Before the specifics, here are some design principles this protocol tries to follow: (1) Simple and effective: try to design a lightweight client and leave complexity to the server, then use multicast and channel relay points to address the server scalability. Also, try to leverage off-the-shelf components as much as possible. Example may include HTTP, SSL, XML, Beep, etc. (2) Logical separation of the invalidation server and the origin server: this is because WCIP needs to work with CDNs and distributed data centers. There may be multiple authoritative sources of an object. "Invalidation server" uniquely identifies where the invalidation source is, not where the content initially is fetched. It also allows for delegation of invalidation service to a 3rd party, possibly a CDN provider. (3) Clear separation of the notification transport and the notification semantics: WCIP includes a transport abstraction (invalidation channel) and then the cache consistency semantics (object volume). This separation makes the protocol clearly layered, much more understandable, and extensible. Moreover, the message body is specified in XML, making the protocol extensible to other types of notifications. 3.1 Freshness Guarantee WCIP provides reliable invalidations and consistency guarantees so that content providers can make their frequently changing content cacheable. It's important that WCIP guarantees that, in the worst case, a proxy subscribed to an invalidation channel will not service stale content X seconds after the content is updated at the origin server, regardless of network partition or server failure. The content provider can specify the value of X, e.g., to 5 minutes. In the normal case, this is not hard. Using WCIP, the proxy will not deliver any stale object as soon as an invalidation arrives from the server. The invalidation latency only depends on network propagation and queuing delay, which are typically within a second. In other cases, however, when the network or the invalidation server is down, invalidations cannot reach the proxy in a timely fashion. To ensure an upper bound on the invalidation latency, the proxy MUST invalidate content automatically if it hasn't been able to synchronize the object volume for a certain period of time, assuming the server or network may be down and the volume may have changed. Therefore, to control the freshness, the content provider specifies a "freshness guarantee" for each object in the volume, while the caching proxy keeps track of the "last synchronization time". Then, upon serving a client HTTP request, the proxy MAY use the cached object only if the time elapsed since the last synchronization time is less than the object's freshness guarantee. Otherwise, the cached object is marked as stale and MUST NOT be served from the cache Li & Cao & Dahlin Experimental - September 2001 7 Draft-danli-wrec-wcip-01.txt March 2001 without HTTP revalidation. The proxy is RECOMMENDED not to remove the object right away as HTTP revalidation could result in an indication that the object is "Not Modified". To prevent unnecessary cache misses during normal operation, the "last synchronization time" needs to be kept within the freshness guarantees. Hence, in the server-driven mode, the invalidation server sends heartbeats whenever the channel has been silent for the last "heartbeat interval", so as to confirm to the proxies that the volume hasn't changed. Similarly, in the client-driven mode, for every "revalidation interval", the proxy queries the invalidation server to make sure it holds the up-to-date copy of the volume. The invalidation server picks the heartbeat interval while the invalidation client picks the revalidation interval. Both of them SHOULD be smaller than any of the freshness guarantees of the objects in the volume, to avoid unnecessary cache misses. Moreover, the invalidation server SHOULD send invalidations "reasonably" soon after it learns of an object change, but it MAY delay the synchronization until some time before the subsequent heartbeat. Such a strategy allows the server to batch multiple changes into one update, without inducing unnecessary cache misses. In essence, there are two consistency concepts: the average-case staleness and the worst-case staleness: (1) The worst-case staleness is bounded by the freshness guarantee and enforced by the proxy not using a cached object if the time elapsed since the last volume synchronization time is more than the object's freshness guarantee. (2) The average-case staleness is controlled by the heartbeat interval and the revalidation interval. The more aggressively that the server sends invalidations or the proxy revalidates the volume, the better average-case staleness that can be achieved. At one extreme, the server sends invalidations immediately when an object is modified. Then average-case staleness (for clients that are reachable) is on the order of server queue delays plus network delays (typically only a few seconds). At another extreme, the server doesn't bother with pushing invalidations. Then average case = worst case. Middle ground is feasible. For example, a server can batch invalidations; every 30 seconds, send the clients a list of invalidations that have happened in the last 30 seconds. While WCIP uses freshness guarantee to provide "delta consistency", it also supports a more relaxed form of consistency -- "eventual consistency". I.e., when the freshness guarantee is set to be much larger than the typical object modification interval or even set to infinite. Then WCIP is similar to best-effort invalidation delivery and is subject to network and server failures. As long as the Li & Cao & Dahlin Experimental - September 2001 8 Draft-danli-wrec-wcip-01.txt March 2001 heartbeat interval or revalidation interval is not infinite, the caching proxies achieve eventual consistency. 3.2 Object Volume An "object volume" is a set of correlated objects that are updated by the same invalidation Channel. E.g., an "eBay auction" channel contains the most active auction pages. A "SFO flight schedule" channel contains pages describing various airline flights that are departing from or arriving at SFO. "Object volume" serves two purposes, as the unit of filtering and as the unit of invalidation. Using volume as the unit of filtering, caching proxies may subscribe to updates for certain object volumes but not others, based on the interests of the population they are serving. The strategy for forming volumes is to group "correlated" objects into one "volume". This way, if a caching proxy is interested in some objects in the volume, it's highly likely that it is or will be interested in the other objects in the volume as well. Examples include CNNfn, ESPN, NBA, etc., similar to TV programming. An object volume of reasonable size and correlation helps to reduce unwanted invalidations and amortize the channel cost [5][6][7]. Object volume also serves as the unit of consistency. If the caching proxy obtains the up-to-date view of the volume, it follows that the caching proxy has the up-to-date view of every object in the volume (in terms of Last-Modified time and Etag). This allows one "volume synchronization" exchange to (in)validate all the objects in the volume, greatly improving efficiency compared to per-object HTTP validation. Moreover, when a web site updates its content, often it would like to preserve a consistent view of the site. I.e., it would like the end-users to see either entirely the new content or entirely the old content, not a bit of the new at some web pages and a bit of the old at other pages. By grouping these correlated web pages into one object volume, one can atomically invalidate the entire volume and thus preserve the coherent view. An object volume is described in XML (see section 5 for the DTD). In essence, it is a collection of object meta-data and can be retrieved incrementally based on its version. Whenever the caching proxy subscribes to an invalidation channel, the first thing it does is to synchronize the object volume with the server. Before synchronization, the proxy knows nothing about the volume and cannot cache objects that are non-cacheable according to HTTP Cache-Control directives. After synchronization, the proxy knows what objects are covered by the volume, whether its local cache copies are stale or not, and each object's freshness guarantee. The proxy SHOULD ignore the normal HTTP Cache-Control directives for these objects, such as no-store, expires, and max- age. But it SHOULD still honor directives such as "private". [Note: thorough specification on HTTP Cache-Control is needed.] Li & Cao & Dahlin Experimental - September 2001 9 Draft-danli-wrec-wcip-01.txt March 2001 Once synchronized, volume re-synchronize does not need to return the entire object volume again. Depending on the last version the proxy synchronized, the server sends the list of changes made to the volume since the last version, which can be substantially smaller than the entire volume. This facilitates quick re-synchronization. 3.3 Channel Abstraction An invalidation channel may be implemented in many different ways, e.g., using HTTP, Beep, or IP multicast. It's out of the scope of WCIP to design this transport layer. However, specified here is the channel abstraction that a specific implementation ought to provide: (1) Naming: channels are named as URIs. For interoperability, the channel name MUST indicate the transport implementation. E.g., "wcip://my.net:80/channel/name?proto=http" denotes a channel carried on top of HTTP. A channel implementation MUST be able to translate the channel URI into the addressing information that the implementation is using. (2) Subscription: provide an interface for channel subscription based on the channel URI as well as notifying the upper layer whenever the subscription terminates unexpectedly. Once subscribed, the caching proxy can start to send to and receive from the channel. (3) Framing: channel messages are self-describing and well-formed XML text. Each "send" and "recv" by the invalidation server or client returns the entire XML message. (4) Delivery: delivery SHOULD be real-time in that the average latency should be comparable to the network round-trip time from the sender to the receiver. It's RECOMMENDED that the delivery be reliable, full duplex, and in sequence (wrt. the sender) to achieve good performance, although it's not required. (5) Security: a channel can be configured into clear text, or signed for integrity, or encrypted for secrecy. Channel subscription can be either open or authenticated. (6) Scalability: help to ensure that the invalidation server doesn't become overwhelmed by excessive load, by providing either IP multicast (later in this section) or channel relay (see section 4.1). (7) Environment: be able to operate across wide-area networks and across administrative domains (i.e., firewalls). Some may be multicast-capable and some may not. An implementation on top of HTTP (RFC-2616) is as follows: Li & Cao & Dahlin Experimental - September 2001 10 Draft-danli-wrec-wcip-01.txt March 2001 (1) Naming: A HTTP-based channel is denoted as "wcip://:/a/hiearchical/name?proto=http". (2) Framing: messages are sent as HTTP POST requests with the request body being the message. The request URI is the channel URI. The request response carries the message response (if any). (3) Subscription: a persistent connection is established to the host and port as specified in the channel URI. If multiple channels have the same server address and port, they can share the same persistent connection. Tear-down of a persistent connection and re-establishment of a new connection represents a possible loss of synchronization and MUST trigger a volume synchronization. (4) Reliability: HTTP runs on top of TCP so is reliable. All server-driven messages are sent on the persistent connection in a first-come-first-serve order. Pipelining may be used to improve latency and throughput. (5) Security: use HTTPS when the channel needs to be secure. The URI is "wcip://:/a/hiearchical/name?proto=https". (6) Scalability: when there are too many HTTP connections to the server, the server can instruct the channel implementation to use HTTP Location header to redirect new connections to a multicast channel or a channel relay point, which can be chosen by static configuration or CDN routing. (7) Environment: to cross administrative domains, the channel must be on a port allowed by the firewalls. If port 80 is used, it's possible that the traffic be intercepted by a transparent proxy that doesn't understand WCIP. Depending on its configuration, the transparent proxy may or may not pass on the traffic without interference. In case it doesn't, either reconfigure it or a port number other than 80 must be used. Besides HTTP, channels may also be implemented using e.g., Beep [8], PGM [9]. Their channel URI may be: wcip://my.net:80/channel-name?proto=beep&security=tls wcip://my.net:80/channel-name?proto=pgm&group=239.1.1.1 [Note: additional work is needed to fully specify them.] Both HTTP and Beep are unicast, which has scalability limitations. E.g., suppose a machine is capable of 20000 concurrent persistent connections. Then that machine being an invalidation server can support at most 20000 simultaneously active invalidation clients. Moreover, if it takes 1ms to send out a message, then the invalidation latency is at least 20 seconds, even under the best network condition. Li & Cao & Dahlin Experimental - September 2001 11 Draft-danli-wrec-wcip-01.txt March 2001 An IP-multicast-based channel implementation avoids this scalability problem. An IP multicast group is allocated for the invalidation channel and its address is advertised as part of the channel information. For any update to the object volume, the invalidation server only needs to send one copy, to the multicast group. The invalidation client subscribes to this multicast group to receive the updates. Because the object volume has version numbers, WCIP may not have to run on top of a reliable multicast protocol. In the absence of IP multicast, an unicast-based channel implementation may employ channel relays to improve scalability, which is the topic of the next section. 4. Deployment Issues 4.1 Channel Relay An invalidation channel may have tens of thousands of invalidation clients. Channel relay points can improve the scalability of an unicast-based channel. Instead of subscribing directly to the origin invalidation server, some invalidation clients are redirected to a channel relay point. A channel relay point can perform one-to-many channel relay and many-to-one connection aggregation. (1) Channel Relay The channel relay point may have multiple clients subscribed to the same invalidation channel. It in turn only subscribes once to the original invalidation server. By hierarchically relaying channel messages, it reduces the load on the invalidation server and helps to scale the invalidation channel end-to-end. Invalidation Server | | conn0 | | Channel Relay Point / | \ / | \ conn1 / conn2| \ conn3 / | \ / | \ Client1 Client2 Client3 A "dumb" relay point copies all messages from connection "conn0" to "conn1", "conn2" and "conn3", and vice versa. A "smart" relay point also constructs the up-to-date view of the volume as well as the journal of changes to the volume, based on the messages it receives from the invalidation server. Then, it can respond to client-driven Li & Cao & Dahlin Experimental - September 2001 12 Draft-danli-wrec-wcip-01.txt March 2001 volume synchronization requests, instead of forwarding the requests all the way to the invalidation server. (2) Connection Aggregation A relay point supports not only multiple clients but also multiple channels. Connection aggregation reduces the number of TCP connections the invalidation client and the replay point have to maintain. See the example below. Server1 Server2 Server3 \ | / conn1 \ |conn2 / conn3 \ | / \ | / Channel Relay Point / | \ / | \ conn4 / conn5| \ conn6 / | \ / | \ Client1 Client2 Client3 The client would have established 3 connections for 3 different invalidation servers. Now that all 3 channels are redirected to the same relay point, the client only needs to establish 1 TCP connection, to the relay point, which in turn subscribes to the 3 invalidation servers. This reduces the client's TCP overhead and allows the client to support more channels, as well as reducing the overhead on the invalidation servers and the relay point. A channel relay point can be set up via a cache hierarchy or a CDN. Specifically, an invalidation client can discover and then connect to the relay point in one of the following ways. (1) The origin web server or replica origin web server, being part of a CDN, returns a channel URI with the relay point as the hostname. (2) The relay point, being a configured outgoing proxy to a potential invalidation client, intercepts and replaces the channel URI in the HTTP response with its own information. (3) When the invalidation client does DNS name lookup of the invalidation server hostname, the DNS server of a CDN returns the IP address of a local channel relay point. (4) When the invalidation client connects to the invalidation server, the invalidation server replies with a redirect message pointing to a channel offered by the relay point. 4.2 Detect Changes Li & Cao & Dahlin Experimental - September 2001 13 Draft-danli-wrec-wcip-01.txt March 2001 Detecting changes is the job of the origin server and/or invalidation server. Web content may change because of updates from the content owner or updates from the content viewer. E.g., the content owner CNN.com updates its front page every 15 minutes, while Ebay updates its content whenever its customers post new auction items or bids. Therefore, changes may be detected in 4 ways. (1) When the script runs that generates content and updates the web source file (e.g., a news article is updated with the latest financial information), the script notifies the invalidation server which then sends out invalidations or delta-encoded [10] updates to all participating caches. (2) When a piece of data in the database is modified via the database interface (e.g., an addition to the inventory of books), a database trigger notifies the invalidation server of the event. (3) When a HTTP request comes in (e.g., a POST request to add a new auction item), the origin server or its surrogate (reverse proxy) notifies the invalidation server of the event. (4) The last but simplest way is for the invalidation server to poll the origin server periodically to find out if the object has changed. Given that there is only one invalidation server polling, the polling frequency can be very high, e.g., once every minute, offering decent cache consistency as well. In some cases, an event described above may invalidate multiple URLs. E.g., a database event may trigger the invalidation of hundreds of objects. Instead of listing all those objects and sending over to proxies, the server may describe the event itself to the proxies, provided that the proxies know how to interpret the event and figure out what objects become stale. Integrating such functionality may be future work. There is software providing user-level notification of changes to web content, e.g., the AIDE system [11]. WCIP could potentially be used to permit agents to subscribe to change notification, not for the purpose of cache invalidation, but to notify users. E.g., a web crawler could subscribe to WCIP channels instead of crawling web sites periodically for object updates. 4.3 Discover Channels A caching proxy learns about an invalidation channel in three ways: (1) configured by the proxy's administrator, (2) configured by the CDN that's controlling the proxy, or (3) obtained from the HTTP response when fetching an web object. Specified here is method 3: In a normal HTTP request-&-response exchange, the caching proxy obtains the channel address from the HTTP entity headers "Invalidated-By" and "Channel-Object". Invalidated-By = "Invalidated-By" ":" Channel-URI Channel-URI = Li & Cao & Dahlin Experimental - September 2001 14 Draft-danli-wrec-wcip-01.txt March 2001 "wcip:" "//" host ":" port "/" channel-name "?" query channel-name = token Example: Invalidated-By: wcip://www.cnn.com:777/allpolitics?proto=http 4.4 Join Channels The decision to join a channel can be either (1) instructed by the proxy's administrator, (2) instructed by the CDN that the proxy is part of, or (3) dynamically decided. It's not the job of this protocol to specify the decision algorithms but there are some common sense ones. E.g., join a channel when the proxy has cached M objects belonging to that channel, or when the proxy has received N requests to objects belonging to that channel. The proxy's administrator can configure M and N. Moreover, the proxy can employ a heuristic [12]: consider an object for WCIP service only if (1) it is cached and (2) a subsequent request does use the cached copy without discovering it expired or modified. This heuristic avoids objects that either are not very popular or are modified more frequently than accessed, despite it being cached in the meantime. This guideline can be applied to calculating M and N. 5. Protocol Specification This section lays out the message syntax and sequences. Section 6 has the complete rule set (state machine) with regard to the server and client's behavior. Following is a brief description of the WCIP protocol in the most common and simple case: 1) In a normal HTTP request-&-response exchange, a caching proxy obtains invalidation channel information from the HTTP response header "Invalidated-By", returned by the origin server or its surrogate. 2) To join the channel, the caching proxy establishes a persistent HTTP connection with the invalidation server, assuming the channel implementation is based on HTTP. 3) Immediately following connection set-up, the proxy MUST initiate one round of volume synchronization (see section 5.2) to obtain an up-to-date view of the ObjectVolume, and hence the up-to-date view of all the objects in it. 4) After the initial round, the invalidation server MAY initiate volume synchronization when updates are made to the volume or when Li & Cao & Dahlin Experimental - September 2001 15 Draft-danli-wrec-wcip-01.txt March 2001 the channel is silent for "heartbeat interval" time (see section 5.3). 5) Whenever the proxy notices that the "last synchronization time" is more than "revalidation interval" ago, the proxy MUST initiate a round of volume synchronization. 6) When serving content, the proxy MUST NOT use a cached object if the cached object is marked as stale or the "last synchronization time" is more than "freshness guarantee" time ago for the object. Instead, the proxy MUST perform HTTP revalidation with the origin server before serving the object. (See section 5.4). 7) The proxy or the invalidation server MAY terminate communication anytime by closing the connection. Then the proxy reverts back to HTTP Cache-Control. 5.1 Object Volume DTD A description of the object volume contains (1) the volume's own information, e.g., its version, date, invalidation channel, Last- Modified time, Etag etc.; and (2) the volume composition, which iterates the objects that belong to the volume and their consistent state. An ObjectVolume MAY list not only objects but also directories. For example, an ObjectVolume entry with uri = "http://www.cnn.com /allpolitics/" represents all the web objects that share this URI prefix. Given a web object, longest prefix match is used to identify an applicable entry in the ObjectVolume. If the matching entry's URI is a filename, Etag and Last-Modified time (if available) SHOULD be used to determine object freshness. If the Etag and Last-Modified time are not available or if the matching entry's URI is a directory path, the attribute state="stale" determines that all cached objects with that URI prefix ought to revalidated. An ObjectVolume MAY also consist of objects from different origin servers, as long as the same invalidation server is being used for all the objects. This may be typical in a CDN environment. Thus, an object volume is described using the following XML DTD: ;# the time this xml message is sent by the origin ;# invalidation server ;# the invalidation channel URI that carries this object ;# volume ;# the version number of this object volume; It's incremented Li & Cao & Dahlin Experimental - September 2001 16 Draft-danli-wrec-wcip-01.txt March 2001 ;# whenever an change is made to the volume. ;# the base version number that the following volume info is ;# based on; a base of 0 means the following volume info solely ;# defines the volume composition; a positive base number means ;# that the following info should apply to an existing volume ;# of that version number. ;# the volume's current Last-Modified time; may be used in ;# conjunction with the version number to identify the version. ;# the volume's current Etag; may be used in conjunction ;# with the version number to identify a volume version. ;# whether the following objects are to be included or excluded ;# from the volume composition. Also, if the cache doesn't have ;# the object, whether it should be prefetched into the cache. ;# "prefetch" implies "include". ;# whether the enclosed objects have become stale or are still ;# fresh relative to the base version, or unknown. ;# redirect the receiver to receive updates for the following ;# objects from another invalidation channel. ;# the following objects are carried by the current channel ;# because they are redirected from another channel. ;# a name (or ID) for the object, unique within the channel. ;# the freshness guarantee of the object in seconds. ;# whether content of the new object will be sent. ;# the object URI; if the URI is a directory path instead of ;# a filename, it can potentially match any object with that ;# URI prefix. ;# the object's current Last-Modified time ;# the object's current Etag For example, Li & Cao & Dahlin Experimental - September 2001 17 Draft-danli-wrec-wcip-01.txt March 2001 [Note: specification is needed for sending small objects in full in the ObjectVolume and for sending the delta encoding [10] of a slightly changed object.] 5.2 Client-Initiated Volume Synchronization Immediately following channel subscription is always one round of client-initiated volume synchronization. Then, subsequent rounds of volume synchronization can be either client-initiated or server- initiated. Client-initiated volume synchronization is also performed whenever the proxy notices that the current time has passed the "last synchronization time" plus "revalidation interval". The proxy MAY notice it via timeout or notice it whenever it cannot use a cached object because the "last synchronization time" has been the object's "freshness guarantee" time ago. Four steps take place for client-initiated volume synchronization: (1) Synchronization request: the caching proxy sends a ObjectVolume message to the invalidation server, describing its own view of the volume, especially the version number "A". If the proxy had never subscribed to the channel before, the version number is 0. (2) ObjectVolume update: the invalidation server replies with the journal of changes to the volume since version "A" up until the latest version "B", if the journal of changes since version "A" is still available. The server SHOULD aggregate multiple updates to the same object; it only needs to report the latest one. If the journal of changes is not available, it replies with the full copy of latest ObjectVolume. If "A" is equal to "B", the server simply echoes back the synchronization request. (3) ObjectVolume processing: the caching proxy examines each object entry in the update, records its freshness guarantee, and Li & Cao & Dahlin Experimental - September 2001 18 Draft-danli-wrec-wcip-01.txt March 2001 compares the cached object (if any) with the entry. If the cached object's Etag is not equal to that in the entry and the cached object's Last-Modified time is earlier than that in the entry, the proxy marks the cached object as stale. If the entry URI is a directory path instead of a filename, all cached objects with that directory prefix are marked as stale. (4) Update "last synchronization time": set it to the time the caching proxy sent the synchronization request. Here are some examples. Suppose the proxy has never subscribed to the channel before. The first synchronization request looks like: The invalidation server replies with the complete ObjectVolume: Suppose later the proxy disconnects from the channel and rejoins after 10 minutes. Assuming it still keeps the volume description of version 7, it sends a synchronization request like this: Suppose the volume has not changed in that 10 minutes. The invalidation server replies: Li & Cao & Dahlin Experimental - September 2001 19 Draft-danli-wrec-wcip-01.txt March 2001 However, if the volume indeed has changed, the invalidation server sends back the journal of changes since version 7. The reply MUST have a base version equal to or smaller than the version in the synchronization request. Here is an example: ;# exclude object(s) from the volume. But if the server is now at version 20 and no longer has records on changes before version 10, while the client is at version 7, then the invalidation server sends back the complete ObjectVolume information with base="0". 5.3 Server-Initiated Volume Synchronization While the caching proxy is required to initiate volume synchronization whenever necessary, the invalidation server is not required to initiate volume synchronization if it doesn't choose to operate in server-driven mode. The invalidation server may be configured to operate in either mode. It MAY switch between server- driven mode and client-driven mode after any volume synchronization. In server-driven mode, the invalidation server initiates volume synchronization when changes are made to the object volume, or when the channel has been silent for more than "heartbeat interval" time. The invalidation server SHOULD initiate volume synchronization "reasonably" soon after it learns of an object change, but it MAY delay the synchronization until some time before the subsequent heartbeat. It MUST NOT delay further. Such a strategy allows the Li & Cao & Dahlin Experimental - September 2001 20 Draft-danli-wrec-wcip-01.txt March 2001 server to batch multiple changes into one update. It minimizes the number of volume synchronization rounds without inducing unnecessary cache misses. Three steps take place for server-initiated volume synchronization: (1) ObjectVolume update: if there are changes to the object volume, the invalidation server updates its view of the object volume, increments the version number, and organizes the changes into an ObjectVolume update message. The server then sends this update out as well as storing it into the volume's journal of changes. The server SHOULD aggregate multiple updates to the same object into one. If it's time to generate a heartbeat and there has been no change to the volume since the last update, the server simply sends out an update that reiterates the current volume's version number. (2) ObjectVolume processing: the caching proxy examines each object entry in the update, records its freshness guarantee, and compares the cached object (if any) with the entry. If the cached object's Etag is not equal to that in the entry and the cached object's Last-Modified time is earlier than that in the entry, the proxy marks the cached object as stale. If the entry URI is a directory path instead of a filename, all cached objects with that directory prefix are marked as stale. (3) Update "last synchronization time": in this case, there is no synchronization request, just the server's update. To account for possible clock skew, the proxy MUST convert the "date" in the server's update into the proxy's local time. Suppose t1 is the time the proxy sent out its initial synchronization request (when it established the channel subscription), while t2 is the "date" in the corresponding ObjectVolume update at that time. Now, t3 is the "date" of the current update from the server, then the "last synchronization time" is set to "t1 + (t3 - t2)". For example, an object modification moves the volume from version 9 to version 10. The corresponding ObjectVolume update looks like this: Li & Cao & Dahlin Experimental - September 2001 21 Draft-danli-wrec-wcip-01.txt March 2001 To generate a heartbeat when there has been no change to the volume, the server simply restates the current volume's version number. For example: [Note: specification is needed for sending small objects in full in the ObjectVolume and for sending the delta encoding [10] of a slightly changed object.] 5.4 Serving Content When a HTTP request comes in with a URI, the proxy searches its ObjectVolume data structure for a matching entry. If an ObjectVolume entry is a directory path instead of a filename, the entry is applicable to the URI if the URI has that directory path as prefix. If multiple such directory entries match, the entry with the longest match is used. The caching proxy MUST NOT use a cached object if the cached object is marked as stale or the current time has past the "last synchronization time" plus the freshness guarantee of the matching entry. Instead, the proxy MUST either perform HTTP revalidation with the origin server before serving the object or initiate volume synchronization with the invalidation server. After the proxy fetched the new object into its cached (or revalidated the existing one), the proxy MUST compare the cached object with the corresponding entry in the ObjectVolume. If the matching entry is a directory path or if the entry doesn't contain Last-Modified time and Etag, the cached object MUST be marked as not stale. Otherwise, the proxy checks if the cached object's Etag is equal to that in the entry or the cached object's Last-Modified time is later than that in the entry. If yes, the cached object MUST be marked as not stale; otherwise, it's still marked as stale. 6. Protocol State Machine 6.1 Client State Machine The initial state is "INIT". Actions may be to run a procedure, which is defined at the end of the section. STATE: INIT INPUT: subscription is established to the invalidation server. ACTION: set the "current version number" to 0; create a local ObjectVolume data structure with 0 objects in it; initialize Li & Cao & Dahlin Experimental - September 2001 22 Draft-danli-wrec-wcip-01.txt March 2001 "revalidation interval" to an arbitrary or pre-configured value. NEXT-STATE: TO-SYNC STATE: TO-SYNC INPUT: none ACTION: send a synchronization request with the "current version number"; record the current time as "sync request time"; reset the REVALIDATION-TIMER to a value equal or smaller than "revalidation interval". NEXT-STATE: SYNC-INITIATED STATE: SYNC-INITIATED or INTERIM INPUT: receive ObjectVolume update with version number X, base Y. CONDITION: Y is equal to or smaller than the "current version number" and X is equal to or larger than the "current version number". ACTION: run procedure "process the ObjectVolume update"; set the "current version number" to X; run procedure "set last synchronization time"; run procedure "set revalidation interval"; reset the REVALIDATION-TIMER to "revalidation interval". NEXT-STATE: INTERIM STATE: SYN-INITIATED or INTERIM INPUT: receive ObjectVolume update with version number X, base Y. CONDITION: Y is larger than the "current version number" or X is smaller than the "current version number". ACTION: discard the message; reset the REVALIDATION-TIMER to a value equal or smaller than "revalidation interval". NEXT-STATE: INTERIM STATE: SYN-INITIATED or INTERIM INPUT: REVALIDATE-TIMER times out NEXT-STATE: TO-SYNC Procedure "process the ObjectVolume update": for each object entry in the update message: add or update the entry in the internal ObjectVolume data structure; compare the cached object (if any) with the entry. If the cached object's Etag is not equal to that in the entry and the cached object's Last-Modified time is earlier than that in the entry, mark the cached object as stale. If the entry URI is a directory path instead of a filename, all cached objects with that directory prefix are marked as stale. Procedure "set last synchronization time": set it to "sync request time" if STATE==SYNC-INITIATED; otherwise, set to " sync request time" + current time - "sync response time". Procedure "set revalidation interval": this is where the proxy has some liberty and can implement some policy. Picking a large value means less aggressive synchronization, and thus higher Li & Cao & Dahlin Experimental - September 2001 23 Draft-danli-wrec-wcip-01.txt March 2001 invalidation latency. To avoid unnecessary cache misses, the proxy SHOULD pick a value smaller than any of the freshness guarantees of objects in the volume. On the other hand, to limit the load, it's RECOMMENDED that the proxy only revalidate if the volume has recently seen active use. 6.2 Server State Machine The server has two types of tasks. One is the Volume Monitor, which keeps track of the up-to-date view of the ObjectVolume and its journal of changes. The other is the per-client Volume Synchronizer, which is charge of volume synchronization with each client. Here is the Volume Monitor state machine. The initial state is "INIT". STATE: INIT INPUT: the initial ObjectVolume. ACTION: create an up-to-date ObjectVolume data structure; create an empty journal of changes; set the current volume number to 1. NEXT-STATE: INTERIM STATE: INTERIM INPUT: a change to an object in the volume is detected. ACTION: generate the up-to-date ObjectVolume entry for the object; update the ObjectVolume data structure; increment the current version number; if an entry for the same object exists in the journal of changes, remove it; enter the entry to journal of changes. Also, send a NEED-SYNC signal to every Volume Synchronizer. NEXT-STATE: INTERIM Here is the state machine for the per-client Volume Synchronizer. The initial state is "INIT". Actions to run a procedure is defined at the end of the section. STATE: INIT INPUT: a client subscribed. ACTION: set "last update version" to 0. NEXT-STATE: INTERIM STATE: INTERIM INPUT: receive client synchronization request with base version X. ACTION: set "last update version" to X. NEXT-STATE: TO-SYNC STATE: TO-SYNC INPUT: none ACTION: send the journal of changes since the "last update version". If the journal since then is not available, send the full ObjectVolume. If the journal since then is empty, simply echo the synchronization request. Set the "last update version" to Li & Cao & Dahlin Experimental - September 2001 24 Draft-danli-wrec-wcip-01.txt March 2001 the "current version number"; run procedure "set heartbeat interval"; set the SYNC-TIMER to the "heartbeat interval". NEXT-STATE: INTERIM STATE: INTERIM INPUT: receive NEED-SYNC signal from the Volume Monitor. CONDITION: the server elects to initiate immediate synchronization. NEXT-STATE: TO-SYNC STATE: INTERIM INPUT: receive NEED-SYNC signal from the Volume Monitor. CONDITION: the server elects to delay the synchronization. ACTION: set the SYNC-TIMER to a value equal or smaller than the time left on the timer. Picking a large timeout means less aggressive synchronization, and thus higher invalidation latency. But the timer MUST NOT be set to a value larger than the time left. NEXT-STATE: INTERIM STATE: INTERIM INPUT: SYNC-TIMER times out NEXT-STATE: TO-SYNC Procedure "set heartbeat interval":If the server elects to perform no proactive invalidation, set the "heartbeat interval" to infinite. Otherwise, the server SHOULD pick a value smaller than any of the freshness guarantees of objects in the volume. This is where the server has some liberty and can implement some policy. Picking a large value means less aggressive synchronization, and thus higher invalidation latency. The server MAY set the heartbeat interval very high or infinite in order to reduce load. 7. Security Considerations In essence, web caches tend to trust the network infrastructure. If one can spoof IP addresses or poison DNS caches, one can poison web caches. In contrast, content providers tend to be concerned about content integrity, besides freshness. With WCIP, web caches should also be concerned about the denial-of-service attack where the malicious keeps invalidating objects in a cache, preventing the cache from doing real work. To accommodate the various security needs of the invalidation servers and clients, WCIP provides three channel security modes: (1) IP-based weak security, i.e., the invalidation client accepts a channel message if the source IP address of the invalidation message matches the invalidation server name. This is for those invalidation server and clients that both do not need strong security. Li & Cao & Dahlin Experimental - September 2001 25 Draft-danli-wrec-wcip-01.txt March 2001 (2) Public-key-based strong security with mandatory verification, i.e., the invalidation client obtains the public key of the channel during channel subscription (e.g., using SSL). The invalidation server signs or encrypts the channel messages with the channel's private key. The invalidation client MUST verify the signature and discard the message if the signature doesn't match. This is when the invalidation server requires strong security for the channel. The invalidation clients have to comply. For unicast, the channel can simply be a SSL connection as in HTTPS. To prevent intermediate node from tampering with the channel information in the first place, the domain name of the channel MUST be identical to that of the object's origin server. Upon channel setup, the origin server MAY then redirect the invalidation client to the true invalidation server via HTTPS. (3) Public-key-based strong security with optional verification, i.e., the invalidation client obtains the public key of the channel during channel subscription. The invalidation server signs all the channel messages with the channel's private key. However, the invalidation client can choose to verify either the signature (strong) or the source IP address (weak). This is when the invalidation server doesn't need strong security but wants to accommodate both clients that need and need not strong security. The authors cannot determine the necessity of this third option. Option 1 and 2 may be easier to support because they fit in the HTTP and HTTPS model well. Option 3 may be easier to support using a Beep implementation. The above public-key solution ensures message integrity. To guard against message replay attacks, the Etag or Last-Modified of the updated object has to be part of the invalidation material. 8. References 1 James Gwertzman and Margo Seltzer, " World-Wide Web cache consistency", In Proceedings of 1996 USENIX Technical Conference, pages 141-151, San Diego, CA, January 1996. 2 Cao, P.; Liu, C.; "Maintaining strong cache consistency in the World Wide Web" 17th International Conference on Distributed Computing Systems. 27-30 May 1997. IEEE Transactions on Computers (April 1998) vol.47, no.4 p. 445-57 3 Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. Li & Cao & Dahlin Experimental - September 2001 26 Draft-danli-wrec-wcip-01.txt March 2001 4 R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 5 Edith Cohen and Balachander Krishnamurthy and Jennifer Rexford, " Improving End-to-End Performance of the Web Using Server Volumes and Proxy Filters", Proceedings of the ACM SIGCOMM conference, September 1998. 6 D. Li and D. R. Cheriton. "Scalable Web Caching of Frequently Updated Objects using Reliable Multicast", 2nd USENIX Symposium on Internet Technologies and Systems (USITS'99). October 1999. ftp://ftp.dsg.stanford.edu/pub/papers/mmo.htm 7 Yin, J.; Alvisi, L.; Dahlin, M.; Lin, C.; "Using leases to support server-driven consistency in large-scale systems" Proceedings of 18th International Conference on Distributed Computing Systems. 26-29 May 1998. p. 285-94 8 M. T. Rose, "The Blocks Extensible Exchange Protocol Framework", IETF Internet Draft draft-ietf-beep-framework-08. 9 Tony Speakman, etc. "PGM Reliable Transport Protocol", IETF Internet Draft draft-speakman-pgm-spec-06 10 Mogul, J.C.; Douglis, F.; Feldmann, A.; Krishnamurthy, B., "Potential benefits of delta encoding and data compression for HTTP", ACM SIGCOMM 97 Conference. 11 Fred Douglis, Thomas Ball, Yih-Farn Chen, and Eleftherios Koutsofios, "The AT&T Internet Difference Engine: Tracking and Viewing Changes on the Web", World Wide Web, January 1998, pp. 27-44. Also appears as AT&T Labs--Research TR 97.23.1, April, 1997. 12 Dilley, John; Arlitt, Martin; Perret, Stephane; Jin, Tai. "The Distributed Object Consistency Protocol", HP Labs Technical Report, http://www.hpl.hp.com/techreports/1999/HPL-1999-109.html, September 1999. 9. Acknowledgments This draft greatly benefited from the valuable comments from Carl Sutton, Ian Cooper, Mark Nottingham, Brad Cain, Hilarie Orman, Fred Douglis and Alex Rousskov. 10. Author's Addresses Dan Li Cisco Systems, Inc. Li & Cao & Dahlin Experimental - September 2001 27 Draft-danli-wrec-wcip-01.txt March 2001 Email: lidan@cisco.com Pei Cao Cisco Systems, Inc. Email: cao@cisco.com Mike Dahlin University of Texas Email: dahlin@cs.utexas.edu Li & Cao & Dahlin Experimental - September 2001 28 Draft-danli-wrec-wcip-01.txt March 2001 Full Copyright Statement "Copyright (C) The Internet Society (date). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implmentation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC editor function is currently provided by the Internet Society. Li & Cao & Dahlin Experimental - September 2001 29