We’ve had some bad downtime this week. On Wednesday (2018-05-30) something got wedged and games stopped being recorded. Jim and I were both busy and not paying close attention and so I didn’t notice till Thursday, when I happened to pull up the page during some downtime while my hair was being cut. I ended up remotely debugging and fixing the problem on my phone which was a pain, but worked and made me feel like I was some sort of elite hacker.

Today (2018-06-02) our SSL cert ran out for some reason and so things weren’t working until Jim fixed it.

Also today, we hit our 2,147,483,647th game played! If you don’t recognize that number, it’s the largest 32 bit signed integer (0x7fffffff if you’re into hexadecimal notation). That means that no more games can be added because the number that identifies the games can’t get any bigger. This was kind of a stupid oversight on our part and is the reason you are seeing the euphemistic “The server is undergoing maintenance” message when you finish a game. When we started this site in 2005 we didn’t think it would ever be popular enough for a number that big to come into play. Last year I even read an article about this exact thing happening to someone else and felt pretty smug that we weren’t that dumb. Oops.

The annoying thing is that all those games take up a lot of space. We have our database on 2 disks (volumes, technically)—one is a 2TB SSD (the main high score tables are there) and is 77% full. The other is a 4TB hard disk and is 95% full. The only way to permanently fix the issue is to re-write the data out which means we need double the amount of space we currently have to fix it. That means buying more disks, which will take a couple days (I don’t think there’s any place locally I can buy them so we’ll have to mail order from Amazon).

The long and short of it is that currently high scores are down and it’ll take us a few days to get back up and running again. The message is true though, the scores are being written out to a different disk and when the db is alive again we’ll import all those scores. If you read that article I mentioned, you might have noticed they had a quick fix to delay the inevitable. We’ll might do that tonight and get things kind of working. But we’re going to have to do a permanent fix soon-ish and so you’ll probably be seeing more of the “undergoing maintenance” message in the next week.

Update (2018-06-03):

Our crusty old server decided to die in the middle of the night. Because it’s housed in a satellite office of our hosting provider, there was no one on staff to reboot it. I had anticipated this and made a backup of the machine about a week ago. Jim and I spent Sunday morning getting the backup restored onto the shiny new server (currently hosted at my house). That is currently what the main site is running on. The blog and the forum are still running on the old server. We’ll be moving those to the new server when the new disks arrive (Amazon says Wednesday).

Update (2018-06-05):

We’ve copied the database to another computer with enough space and updated the DB to the latest version (Postgresql 10 if you are curious). We’re currently converting the id column that ran out of room into a representation that can hold bigger numbers (ALTER TABLE for you SQL nerds). This is unfortunately a slow process due to the size of the data. We started it going last night and it looks like it’s maybe 30% done. During this time the DB is completely offline and that is causing the server code to…not be happy. It can’t authenticate users (because users are stored in the DB, too) and so it’s not even saving games. Sorry about that. When the conversion is complete we’ll bring the DB back online and scores should be immediately be working.

Update (2018-06-08):

The disks all arrived and are all installed. We’re still converting the id columns. It’s very slow because the database has to more or less rebuild itself completely. Also we had a bug in the conversion script and once it got halfway done (after about 20 hours) it died and reset all the progress. :-/ That’s fixed though so this time it should work (fingers crossed).

Update (2018-06-10):

The database id column conversion finally finished. Took 79 hours to run. I’ve pointed the site to the server where the database is temporarily housed and will begin copying the data back to where it is supposed to be. This temporary database is not stored on SSDs so things might be slow. We’re not sure the disks can handle the full Green Felt load. It still might be a few days before everything is smooth.

Update (2018-06-28):

The database is back where it belongs, on the new SSD. Unfortunately there was a bug in my script that copied the DB over and I didn’t notice until I had switched over to it. I probably should have switched back to the other DB and recopied, but I didn’t realize the extent of the problem and how long it would take to fix it (I thought it’d be quicker this way—I was wrong). I’ve been rebuilding indexes on the DB for the past week and a half. Luckily the indexes can rebuild in the background without causing outages, so high scores could continue during the process. Unfortunately the one I saved for last does require that the database go down while it’s being rebuilt. I started it on 2018-06-17 at around midnight and expected it to be done when I woke up in the morning—it was not. It’s still going, in fact, which is why you haven’t been able to save games, fetch high scores, or login. I’m not sure how long this one is going to take—I would have expected it to be done already. Until it finishes, things are going to be down.

In other news, the old server (that the blog and the forum are hosted on) went down yet again for a few hours, prompting me to move those programs and their data over to the new server. This move appears to have worked. If you are reading this, it’s on the new server :-).

Update (2018-07-03):

A few days ago we gave up fixing the indexes (we kept finding more and more things wrong with the DB copy) and decided to re-copy the DB. This finished today, and we (again) switched over to the DB on the SSD. The copy was good this time, and things seem to be running well. Check out this graph:

The red line is the number of 502 errors (the “server is undergoing maintenance” errors that we all love). The vertical green dotted line is when the SSD DB came online. After that there have been no 502s (it’s been about 3 hours so far).

When we decided to re-copy the DB, we saved all the new games that were only in the flawed DB and have been re-importing them into the good DB. This is about a 1/4 of the way done but has dramatically speeded up since we got the DB back on the SSDs. Once the import finishes, we will be done with the DB maintenance! Well, until the next disaster strikes! 🙂

Update (2018-07-05):

Everything is going smoothly. We’ve imported all the games that were saved out-of-band (if you ever saw the “Your game was saved but our database is currently undergoing routine maintenance” message). As far as we know, everything is back online (and hopefully a little better than before this whole mess). Have fun!