Monday, May 26, 2008

Http Headers and Caching: Cache-Control, Expires, Last-Modified and Pragma

HTTP Headers

There are a few topics that most web developers don't really get good exposure to early in their career. One of these topics is HTTP headers. There are certain tasks that are only really possible after developing a strong grasp on HTTP headers and how they work.
Not only do HTTP headers open up doors to certain browser behaviors, but I've found that learning them and troubleshooting them really solidifies an understanding of how the browser works and how the http request/response model operates.
This is the second of a 2 part article series dealing with HTTP headers. The first article focused on Content-Type and Content-Disposition. In this article we'll be talking about caching content and how browsers react differently to headers like Cache-Control, Expires, Last-Modified and Pragma.

Tools

Since HTTP headers are pretty much invisible without the proper tools. I'd strongly recommend one of the following utilities.
If you prefer using FireFox I'd recommend you download and get familiar with Tamper Data, a FireFox extension. Otherwise I'd consider using HTTP Fiddler if you prefer working in IE. All the screen caps of http headers you see here will be from Tamper Data.

What Are Caches and How do they Work?

When we speak of caches, we're generally referring to Browser Caches, Proxy Caches and Gateway Caches. It's important to remember that just because you never intended for any of your content to be cached doesn't mean it's not going to be cached. You rarely have any control about what kinds of caches are downstream of your web site. You CAN however instruct these caches about how to handle your content to achieve the effect that you want (which may be to NOT cache your content).

Validation Headers

I like to think of validation headers as the kind that you get for free. Validation headers are most often those emitted by web servers like IIS and Apache that help caches discern whether or not the representation being cached is still valid.
If you put a file on disk and serve it directly off of the web server (for IIS 6) you'll get two validation http headers for free, ETag and Last-Modified.
Examples of Response ETag and Last-Modified headers.
Last-Modified is pretty self explanatory, it changes whenever your content has been modified, I'm pretty sure it's analogous with the Last Modified Date of the file on disk.
ETag is slightly more interesting. It was introduced with HTTP v1.1 and is used to help caches discern if the content is the same. It is generated by the web server and different servers will go about different means to generate them.
Ideally how they're supposed to work is the browser looks at it's own cache and then issues a request with Request headers using If-Modified-Since and If-None-Match headers. If the If-None-Match header matches the ETag and the If-Modified-Since is still the same date as the Last-Modified header then the web server responds with a 304 - Not Modified response. Below is an example of the browser's (FireFox) request, and the web server's (IIS 6) response. I'll repeat, the 304 is sent because the ETag's match AND the files Last-Modified date hasn't changed. The moment you change the file the cache becomes invalid and the web server issues a new Response - 200 and serves off the file in entire. Example of request If-Modified-Since and If-None-Match request headers. Example of Not Modified - 304 reponse from IIS when ETag and Modified-Since match the request.

Pragma

Pragma is misused pretty often. There's been a misconception on the streets that issuing an http header Pragma: no-cache will tell browsers not to cache your content. It doesn't work. In fact there's nothing in the HTTP spec about Pragma being used in Response headers at all. It's supposed to be used in Request headers.
Probably the most common use of Pragma is when you press CTRL-F5 in a web browser. Your request makes use of Pragma: no-cache and looks a little like this (below). It tells the caches (including intermediary caches) not to serve any cached content.
Example of proper use of Pragma: no-cache in an HTTP request.

Expires

The simplest way to instruct caches to cache your content is to set an Expires header. It doesn't come with a lot of options and so isn't all that versatile but a lot of developers like it for it's simplicity. The only valid value an Expires header has is a date in GMT (Greenwich Mean Time) format. Google makes good use of this header on their classic Google image. If you look at the http headers on http://www.google.com/intl/en_ALL/images/logo.gif you'll see an expires header set to January of 2038 (below).
Example of Google setting the Expires reponse header on an http response.
An example of an HttpHandler that serves off a file and tells caches to cache the content for 20 days might look something like below. I checked and it does indeed cache (at least in FireFox v2.0.0.14).
public void ProcessRequest(HttpContext context)
{
context.Response.ContentType = "image/jpeg";
context.Response.AddHeader("Expires",
DateTime.Now.AddDays(20).ToUniversalTime().ToString("R"));
context.Response.WriteFile(context.Server.MapPath("stem.jpg"));
}
The associated Response looks like (below):Setting the Expires reponse header on an http response.
Some of the troubles with Expires is that it's date driven in GMT. So if the web server and the cache are out of sync time wise it's possible your caches may not be being honored like you intended.

Cache-Control

Cache-Control is the fully featured sibling of Expires. With Cache-Control you have the following possible values.
  • public — marks authenticated responses as cacheable. Normally if HTTP authentication is required (whether it's forms, NT, etc...) responses are not cached. Marking them public will allow them to be cached.
  • no-store — instructs caches not to keep a copy of the representation under any conditions.
  • no-cache — forces caches to submit the request to the origin server for validation before releasing a cached copy, every time. This is useful to assure that authentication is respected (in combination with public), or to maintain rigid freshness, without sacrificing all of the benefits of caching.
  • must-revalidate — tells caches that they must obey any freshness information you give them about a representation. HTTP allows caches to serve stale representations under special conditions; by specifying this header, you’re telling the cache that you want it to strictly follow your rules.
  • proxy-revalidate — similar to must-revalidate, except that it only applies to proxy caches.
  • max-age=[seconds] — specifies the maximum amount of time that an representation will be considered fresh. Similar to Expires, this directive is relative to the time of the request, rather than absolute. [seconds] is the number of seconds from the time of the request you wish the representation to be fresh for.
  • s-maxage=[seconds] — similar to max-age, except that it only applies to shared (e.g., proxy) caches.
As you can see there's a tonne of different scenarios supported by the above values. One example might be:
Cache-Control: must-revalidate; max-age=604800
Which tells the cache to respect your instructions and to keep the representation for a week. Or consider:
Cache-Control: public, no-cache
Which makes the cache authenticate the request before releasing a copy of the cache. This is popular when caching authenticated content so that you can ensure the user is authenticated before showing them secured content.

Rules of Thumb When it Comes to Caching

  1. If the Response tells the cache not to keep the content, it wont.
  2. If the Response is secure (Cache-Control: private) it won't be cached by proxies. Some browsers may cache these data.
  3. If the Response doesn't have any cache instructions (Cache-Control, Expires) and there's no validator (Last-Modified, ETag) it won't be cached.
  4. A cached representation is considered fresh (that is, able to be sent to a client without checking with the origin server) if:
    • It has an expiry time or other age-controlling header set, and is still within the fresh period.
    • If a browser cache has already seen the representation, and has been set to check once a session.
    • If a proxy cache has seen the representation recently, and it was modified relatively long ago.
      Normally representations are served directly from the cache, without checking with the origin server.
  5. If an representation is stale, the origin server will be asked to validate it, or tell the cache whether the copy that it has is still good.

Summary

I remember back in the day when the stuff I wrote either didn't have a lot of users or was on machines farmed in such a way that caching didn't really matter. More and more these days I find my code is sharing a machine with MANY other CPU/Memory hungry applications and my sites need to be ever more efficient. One of the easiest ways to do this is to move things to the client, and that includes leveraging proven caching infrastructure at the proxy and client side.
Best,
Tyler

References:

http://www.mnot.net/cache_docs/

5 comments:

Sergey said...

really great post & really great blog. I subscribed.

Tyler Holmes said...

Thanks for the feedback Sergey. I'll try not to disappoint!

Best,
Tyler

Guillaume Cretot-Richert said...

Rule of thumb #2 (If the Response is secure (Cache-Control: private) it won't be cached) is incorrect. FireFox will cache pages to disk if Cache-Control is set to anything other than no-store, and so will Chrome. This is according to spec.

Tyler Holmes said...

Thanks Guillaume, I updated that section. Meant to read that it wouldn't be cached by proxies.

Guillaume Cretot-Richert said...

5 years later and you came back to fix! I am impressed. It's a beautiful thing when people take good care of their blogs. Congratulations.