Topic: Botched Update, Server Failure, Rollback

Hi everybody,

As you may have noticed, the game has been rolled back.  Specifically, we lost 27 ticks earlier this morning.

What happens now?

This obviously sucks for many players.  A bug was discovered that allowed players to take action without spending the necessary units/resources.  Contrary to what you might have heard, this was not only limited to exploration ships.  Players potentially built for free as well, so between moving forward and rolling back, rolling back was the least bad of 2 bad options.

Some players have described this as "round ruining", and I can understand the frustration.  Our course of action now is open to player vote.  A vote/discussion thread has been opened in uni news.

How did this happen?

For the sake of transparency, I'll describe what lead to this:

Work has been ongoing for awhile now to rewrite the underlying code that powers the game.  This is necessary prep work for the mobile app.  Recently, this crossed over with some updated views for planets and systems.  Together, the two groups of changes grew to a substatial change in how the game works behind the scenes.  My testing validated functionality, however I did not adequately account for performance impact.  Something about the new changes introduced a memory leak that did not manifest on my test server.  This is problem #1: I don't sufficiently load test my changes.

The memory leak eventually lead to intermittent crashes and database failures, which meant that in certain scenarios player actions would only partially complete.  For example, sending exploration ships without having the ships deducted, or building infra without actually spending the resources.  It potentially happened in the other direction as well, meaning players potentially spent fleet/resources to no effect.  This is problem #2: IC's data handling fails Atomicity.

As a result of these two problems the game's data integrity was compromised, which is a more serious long term problem than rolling the game back in the short term.

How will this be prevented in the future?

For starters, it's become apparent that I need to slow down with the updates.  I've had my eye on the mobile app for awhile now and this has been driving me to aggressively work on the game at the expense of stability.

I'm going to take a step back and reevaluate my dev/test/release processes to ensure that problems like this will not happen again.  The downside there is that development will generally take longer until I have sufficient and well-structured automation in place.

Secondary to this, I'm going to be looking at our application architecture.  IC doesn't require much resources to play in your browser, but it does generate quite a bit of data on the server side.  Our data has never been particularly well structured, and although work has been ongoing to optimize within our current framework, it may be time for us to consider new directions.

Got a few bucks?  The Imperial Tip Jar is accepting contributions!