Back on-line
Forum » News / Wikidot news » Back on-line
started by: michal frackowiakmichal frackowiak
on: 1211959074|%e %b %Y, %H:%M %Z|agohover
number of posts: 8
rss icon RSS: new posts
Back on-line
michal frackowiakmichal frackowiak 1211959074|%e %b %Y, %H:%M %Z|agohover

We are back online. Last night we had the longest network outage so far, lasting ~ 8 hours. Needless to say, we are truly sorry for any inconveniences that this caused.

The long story is that at about 8PM (UTC) we had to reboot the main server. Last server reboot was 180 days ago and since then we were operating without any major issues. Usually rebooting would mean just a few minutes of being off-line and we were fully prepared to do this operation smoothly. Unfortunately there was a problem with starting again — the booting just hanged at one point, without any apparent reason, any message etc. The natural thing was to quickly boot the server in a "rescue mode" and trace the problem. So we did it within the next few minutes. We had a very similar situation when we were testing this server prior to making it public, but apparently the problem was different this time.

As we were suspecting that this could be easier to debug and fix locally (it could be anything, software or hardware), we asked the support team from SoftLayer.com to handle the issue — they are great at fixing such things. And they did it, but it took ~ 8 hours all together to bring the server back.

We hoped it could be fixed quicker, but we are glad it is fixed after all and the service is up, running and stable again. Fortunately it was evening/night in Europe and US and hopefully the outage was not noticed by most of our user, but for us it was a very busy night.

There is certainly one thing we have learned: we will start designing a failover/load-balance infrastructure asap. Right now we have live backups to our server in Germany (so your data is never lost) and theoretically we could make it the primary server during the outage, but the amount of traffic we are getting recently would be too much for it for sure. Shame on us that we have not looked into this before. Anyway, we will do our best to avoid such situations in the future.

Thanks for using Wikidot!

The Wikidot Team

unfold Back on-line by michal frackowiakmichal frackowiak, 1211959074|%e %b %Y, %H:%M %Z|agohover
Re: Back on-line
Craig MacomberCraig Macomber 1211960247|%e %b %Y, %H:%M %Z|agohover

Possibly the backup could server static html pages so the load handling capacity would not have to be as high? Not being able to edit and post forum messages is one thing, but having no access at all is much worse. You could just drop login support entirely and serve totally fixed pages from the backup, and replace the login prompts with a note about the status. This also solves possible sync issues caused by having edits done at the temporary location.

Of course, a fully functional backup would be wonderful, but has larger hosting costs.

unfold Re: Back on-line by Craig MacomberCraig Macomber, 1211960247|%e %b %Y, %H:%M %Z|agohover
Re: Back on-line
pieterhpieterh 1211966894|%e %b %Y, %H:%M %Z|agohover

A static HTML backup is a really good idea. Michal, could we check with Softlayer if it's possible to have a smaller backup machine on their network so that the IP address can be switched over rapidly if needed?

unfold Re: Back on-line by pieterhpieterh, 1211966894|%e %b %Y, %H:%M %Z|agohover
Re: Back on-line
michal frackowiakmichal frackowiak 1211968534|%e %b %Y, %H:%M %Z|agohover

To be honest, GabrysGabrys was just suggesting me (more or less) the same thing via Jabber this morning. A static failover machine could be faster to set up and cheaper to maintain. Of course we still need a place to dump real backups, but this could be independent.

Softlayer has some load-balancers that could possibly do the routing. As always, the complexity is hidden in the details. We will make a quick plan on ways how to do it within the next few days. The solutions at the moment are:

  1. Make static version of wikis as seen by a not-logged visitor. Do it either by a kind of proxy or by creating static html dumps.
  2. Use the read-only backup database and backed up filesystem to generate read-only wikidot. This however would require more resources, but fixes some potential problems.

Of course it would be better to have redundant servers with real balancers and we are surely considering this for some time now, but this would require a much much more complicated setup. But once it is up, failure of any of the nodes would not affect the whole service.

unfold Re: Back on-line by michal frackowiakmichal frackowiak, 1211968534|%e %b %Y, %H:%M %Z|agohover
Re: Back on-line
santysanty 1211984435|%e %b %Y, %H:%M %Z|agohover

Hey Guys,

Things like these happen. :-) I used to run a small forum before and I know the kind of pressure you are in to get things back in order.
Good on ya for fixing it and doing a good job.

You guys have an awesome site.

Thanks and Good luck!

Santy Raghavan
santy.wikidot.com

unfold Re: Back on-line by santysanty, 1211984435|%e %b %Y, %H:%M %Z|agohover
Re: Back on-line
Brian SmithBrian Smith 1212020851|%e %b %Y, %H:%M %Z|agohover

I am new to wiki admin and am trying the wiki admin thing with people who are all new to wikis. Yesterday was going to be my rollout date for the site we built. When I could not get online, I wondered "does this happen all the time", and almost moved my content to a new site. But then this morning, I read your news article that openly shared the problem you had and what you are doing to prevent it in the future, and I decided there is no place I'd rather be.

unfold Re: Back on-line by Brian SmithBrian Smith, 1212020851|%e %b %Y, %H:%M %Z|agohover
Re: Back on-line
VruVru 1212464639|%e %b %Y, %H:%M %Z|agohover

;-) I'm quite, i trust in wikidot project:-)

unfold Re: Back on-line by VruVru, 1212464639|%e %b %Y, %H:%M %Z|agohover
Re: Back on-line
KukukissKukukiss 1215288504|%e %b %Y, %H:%M %Z|agohover

i do to!!!

unfold Re: Back on-line by KukukissKukukiss, 1215288504|%e %b %Y, %H:%M %Z|agohover
new post