Server Downtime

  • jacob1
    July 2015 Developer 16 Permalink
    As i'm sure you know by now, the server has been down for the last 2 days. But don't worry, TPT isn't dead and everything is totally fine. The server is finally back up now, so i'll try to explain what happened here (based on what @Simon was saying)

    At ~6:20PM UTC on July 17th, the server host (OVH, RBX-1 in France) has some kind of power failure. A bunch of servers went down, including the TPT server (fancy image). The power went off 2-3 times over the next few hours. During one of the first power failures a single save was corrupted (had 100 views and a tag), so Simon shut down the server after this to prevent more damage and to do a full backup of everything.

    The server was supposed to come back up July 18th, after the backup completed overnight. It didn't complete though, because a disk had a hardware failure (apparently unrelated to the power failure). Trying to read from the disk often just crashed the server entirely. Simon spent the entire day trying to backup this disk without it crashing, and eventually got it. This means all your saves / other data is all OK, nothing to worry about there. It was actually only 4KB of area that was causing the entire issue, and later investigation showed that there wasn't actually anything stored there, so nothing was lost.

    So now it is today, Simon set up a new disk that hopefully won't fail miserably if there is a hardware failure like that.

    Definitely thank @Simon for getting the server back up, it took a lot of time and work to fix this problem. It wasn't down for 2 days because of nobody there to fix a simple problem, it was actually quite complicated and just took that long to sort out without losing any important data.



    If there is ever more downtime, I will post about it here once the server is back up. While it is down though, there are several places you can go to get more info:

    • Freenode webchat, the recommended place to go to chat with other ops / talk about TPT.
    • TPT Multiplayer, I was setting motd's here about the current status. The multiplayer server is hosted elsewhere by @cracker64, so should be up when TPT is down
    • https://twitter.com/PowderToy. We don't tweet often, but maybe i'll try to do that more. Minimal information was being posted here and I was replying to people
    • https://www.facebook.com/PowderToy. I don't have access to the facebook account, but Simon did post an update there
    • admin@powdertoy.co.uk. Of course this wouldn't work if the server was down, but this is the email for the site which Simon can read


    There shouldn't be much downtime though, the host is usually pretty stable.


    One other thing i'll mention is those times when the server just doesn't seem to respond for a few minutes, and stuff like errors when submitting comments happen, but turns out they all went through and now there are duplicated. That has been a persistent issue for a very long time now, and is likely caused by the database being locked up by something. Hopefully it will be fixed eventually, but it is only an issue for a few minutes once or twice a day, so just ignore these :P
    Edited 5 times by jacob1. Last: July 2015
  • dayday24
    July 2015 Member 2 Permalink

    Thank you for going such lengths to save such a small game (In comparison of course). But I really want to know what the color code is on that image. I know Red is failed and Green is normal, but what about orange, or blue?

  • jacob1
    July 2015 Developer 0 Permalink
    @dayday24 (View Post)
    Here is the page itself: http://travaux.ovh.net/vms/index_rbx.html

    I guess blue is 1-4 servers down, dark red is 15+. The scale is on the bottom. No idea which specific box in there TPT is hosted on.
  • dayday24
    July 2015 Member 0 Permalink

    @jacob1 (View Post)

     Well that was fast, and thank you.

  • Simon
    July 2015 Administrator 0 Permalink
    @dayday24 (View Post)
    Each square is not an individual server, but a rack of servers, the colour codes indicated how many servers in the rack are offline.
  • gbasilva
    July 2015 Member 1 Permalink
    Thanks for explaining us. Great thing that everything is fine!
  • thespazz
    July 2015 Member 0 Permalink

    im so glad its back up. XD

    weird how a server would go down like that. i have been in a fiew server rooms to work on there AC units and found all of them using state of the art, stupid-proof equipment.

  • Alt-Factorial
    July 2015 Member 0 Permalink

    so glad that it's available again

  • J23PowderToy
    July 2015 Member 0 Permalink

    I hope there will be no more problems with servers, thank you for fixing that ^^

  • Mrprocom
    July 2015 Member 0 Permalink
    Yeah, thanks for doing that.
    Also, I blame the leap second, it's effect was a little bit late though.
    Edited once by Mrprocom. Last: July 2015