We are back online. Last night we had the longest network outage so far, lasting ~ 8 hours. Needless to say, we are truly sorry for any inconveniences that this caused.
The long story is that at about 8PM (UTC) we had to reboot the main server. Last server reboot was 180 days ago and since then we were operating without any major issues. Usually rebooting would mean just a few minutes of being off-line and we were fully prepared to do this operation smoothly. Unfortunately there was a problem with starting again — the booting just hanged at one point, without any apparent reason, any message etc. The natural thing was to quickly boot the server in a "rescue mode" and trace the problem. So we did it within the next few minutes. We had a very similar situation when we were testing this server prior to making it public, but apparently the problem was different this time.
As we were suspecting that this could be easier to debug and fix locally (it could be anything, software or hardware), we asked the support team from SoftLayer.com to handle the issue — they are great at fixing such things. And they did it, but it took ~ 8 hours all together to bring the server back.
We hoped it could be fixed quicker, but we are glad it is fixed after all and the service is up, running and stable again. Fortunately it was evening/night in Europe and US and hopefully the outage was not noticed by most of our user, but for us it was a very busy night.
There is certainly one thing we have learned: we will start designing a failover/load-balance infrastructure asap. Right now we have live backups to our server in Germany (so your data is never lost) and theoretically we could make it the primary server during the outage, but the amount of traffic we are getting recently would be too much for it for sure. Shame on us that we have not looked into this before. Anyway, we will do our best to avoid such situations in the future.
Thanks for using Wikidot!
The Wikidot Team