Our aging server stopped working last night at about 7pm. We couldn’t fix it remotely so we had to ask someone in the datacenter where the server lives to hard reboot the machine for us. Because of some sort of logistical issues (no one at the datacenter at that time), it took about 4 hours to get the machine rebooted. We don’t have any redundancy and so during that time we were just off the air.

Oddly, since the server has been rebooted the disks are acting differently (in a good way) and we I haven’t seen any of the “502 gateway timeout” errors that have been plaguing us for the last few weeks. We’ve been working on that (rather unsuccessfully so far), so it’s a little disappointing to see it magically fix itself. To investigate the 502s we gather a bunch of metrics use them to build neat graphs so that we can see what’s going on:

Green Felt DowntimeThis graph shows both the outage and today’s lack of “502 Gateway timeout” errors—the outage is the huge 4 hour gap between the 2 vertical dashed green lines and the red line shows the 502 errors. Notice today it’s nice and flat (ahhhhh), while yesterday there were a bunch of ugly spikes during peak hours (US/Pacific time).

New Server

On the plus side, we have obtained a fancy new server, it’s got roughly 6 times the number of processor cores (with each core being twice the speed), and more than 30 times the memory. We’re (slowly) getting it ready. The old server was built in such a way that it was hard to move the programs around and keep everything working. We’re taking the more modern approach with the new one, but it means a lot of thought and planning up front so that everything is smooth (and possibly redundant) in the future. Jim and I are both pretty busy with our day jobs right now, so all this is happening in our spare time.