How we hash our Javascript for better caching and less breakage on updates
Posted by david on 01 Sep 2009 at 10:39 pm | Tagged as: technical
One of the problems we use to see frequently on Green Felt happened when we’d update a Javascript API: We’d add some parameters to some library function and then update some other files so that they called the function with the new parameters. But when we’d push the changes to the site we’d end up with few users that somehow had the old version of one of the files stuck in their cache. Now their browser is causing old code to call new code or new code to call old code and the site doesn’t work for them. We’d then have to explain how to reset their cache (and of course every browser has different instructions) and hope that if they didn’t write back that everything went OK.
This fragility annoyed us and so we came up with a solution:
- We replaced all of our
<script>
tags with calls to a custom “script” function in our HTML template system (we use Template::Toolkit;[% script("sht.js") %]
is what the new calls look like). - The “script” function is a native perl function that does a number of things:
- Reads the Javascript file into memory. While reading it understands C style “#include” directives so we can structure the code nicely (though we actually don’t take advantage of that yet)
- Uses JavaScript::Minifier::XS to minify the resulting code.
- Calculates the SHA hash of the minified code.
- Saves the minified code to a cache directory where it is named based on its hash value which makes the name globally unique (it also keeps it’s original name as a prefix so debugging is sane).
- Keeps track of the original script name, the minified script’s globally unique name, and the dependencies used to build the image. This is stored in a hash table and also saved to the disk for future runs.
- It returns a script tag that refers to the globally unique Javascript file back to the template which ends up going out in the html file. For example,
<script src="js/sht-bfe39ec2e457bd091cb6b680873c4a90.js" type="text/javascript"></script>
- There’s actually a step 0 in there too. If the original Javascript file name is found in the hash table then it quickly stats its saved dependencies to see if they are newer than the saved minified file. If the minified file is up to date then steps 1 through 5 are skipped.
The advantages of this approach
It solves the original problem.
When the user refreshes the page they will either get the page from their browser cache or they will get it from our site. No matter where it came from the Javascript files it references are now uniquely named so that it is impossible for the files to be out of date from each other.
That is, if you get the old html file you will reference all the old named Javascript files and everything will be mutually consistent (even though it is out of date). If you get the new html file it guarantees you will have to fetch the latest Javascript files because the new html only references the new hashed names that aren’t going to be in your browser cache.
It’s fast.
Everything is cached so it only does the minification and hash calculations once per file. We’re obviously running FastCGI so the in memory cache goes across http requests. More importantly the js/ dir is statically served by the web server so it’s exactly as fast as it was before we did this (since we served the .js files without any preprocessing). All this technique adds is a couple filesystem stats per page load, which isn’t much.
It’s automatic.
There’s no script to remember to run when we update the site. We just push our changes up to the site using our version control and the script lazily takes care of rebuilding any files that may have gone out of date.
So you might be thinking, isn’t all that dependency stuff hard and error prone? Well, it’s really only one line of perl code:
sub max_timestamp(@) { max map { (stat $_)[9] || 0 } @_ } # Obviously 9 is mtime
It’s stateless.
It doesn’t rely on incrementing numbers (“js/v10/script.js” or even “js/script-v10.js”). We considered this approach but decided it was actually harder to implement and had no advantages over the way we chose to do it. This may have been colored by our chosen version control system (darcs) where monotonically increasing version numbers have no meaning.
It allows aggressive caching.
Since the files are named by their contents’ hash, you can set the cache time up to be practically infinite.
It’s very simple to understand.
It took less than a page of perl code to implement the whole thing and it worked the first time with no bugs. I believe it’s taken me longer to write this blog post than it took to write the code (granted I’d been thinking about it for a long time before I started coding).
No files are deleted.
The old js files are not automatically deleted (why bother, they are tiny) so people with extremely old html files will not have inconsistent pages when they reload. However:
The js/ dir is volatile.
It’s written so we can rm js/* at any point and it will just recreate what it needs to on the next request. This means there’s nothing to do when you unpack the source into a new directory while developing.
You get a bit of history.
Do a quick ls -lrt of the directory and you can see which scripts have been updated recently and in what order they got built.
What it doesn’t solve
While it does solve the problem of Javascript to Javascript API interaction, it does not help with Javascript to server API interaction–it doesn’t even attempt to solve that issue. The only way I know to solve that is to carefully craft the new APIs in parallel with the old ones so that there is a period of time where both old and new can work while the browser caches slowly catch up with your new world.
And… It seems to work
I’ve seen similar schemes discussed but I’ve not seen exactly what we ended up with. It’s been working well for us–I don’t think I’ve seen a single bug from a user in a couple months that is caused by inconsistent caching of Javascript files by the browser.
At Yahoo, I’ve built something like this twice, for two separate projects. There are a few other solutions floating around, but our environments are so often complicated and distributed, it’s hard to get a good mix of production speed and development convenience that works in a general way.
The problems that I’d see with this approach are that a) you’re not concatenating your scripts in production (or maybe you are, and just didn’t mention that in the article), and b) you’re not loading your scripts from a CDN, meaning that they must be built on every FE box that you deploy on.
On my most recent project, http://apps.yahoo.com, we developed a system consisting of these components:
a) An INI file mapping the “raw” location to the “prod” location. The entries look something like: “/path/to/script.js: /path/to/script-89fn3902ha8ofh3o8ahfao03ha.js”
b) A combo handler script our our CDN server.
c) A PHP class that can be called and either spits out the raw javascript files as multiple script tags, or a single script tag pointing at the concatenated files. It allows the developer to customize the settings, so you can test the concatenated scripts by just running the build and pointing it at your local box.
d) A build program that concatenates all the scripts in the htdocs/js folder, gives them a hash, pushes to the CDN server (or whichever server you point it at), and creates the INI file for deployment.
So, we’d build away in development, happily using our 15 script tags that point at unconcatenated unminified stuff, and then run the build script on it, push the package to our test box, run our tests against the built JS, and finally push to the several production servers.
None of this is rocket surgery, and of course, it’s unlikely that Green Felt needs to solve the kinds of problems of scale that Yahoo faces. The main point, which I think you’ve done well, is to figure out a system where developers can iterate quickly, and pages can load even more quickly.
To deal with the client-server problem, add an api version number to all api calls and a special failure response that can come from any call telling the client to refresh the whole app.
http://particletree.com/notebook/automatically-version-your-css-and-javascript-files/
autoVer works (implemented in your scripting language dejour). Changing the filename is the best cache-buster there is, so whatever is easiest for you.
I use a similar technique but instead of minifying dynamically I just use git’s hashing:
http://blog.woobling.org/2009/06/git-s3-and-rewritemap.html
even I was facing the same problem, some said that if u use ?var=xx it dosen’t work in IE8.. I need to check.. I was looking at the source of ur blog even there u use this technique..
[SAKIII-4082] Find solution to caching problems…
Moved this over to the UI dev project, as it’s probably more appropriate. A couple of interesting comments have been made in SAKIII-4025: Eli Cochran added a comment – 13-Sep-2011 14:40 There have already been a few discussions on list about th……
Nice article.
Is the code available anywhere?
Cheers.
Sounds very similar to Django 1.4’s offering which I just discovered (https://docs.djangoproject.com/en/dev/ref/contrib/staticfiles/#cachedstaticfilesstorage), so adds to the creditibility of your technique.
@mark
Yeah, I published the code on CPAN a while back.
That code is old with respect to what Green Felt is currently using—we really should do an updated release.