Outage and an irritating lack of support

Written By: - Date published: 9:07 am, April 9th, 2013 - 15 comments
Categories: admin, The Standard - Tags:

As I’m sure that many of you are aware, we had an outage commencing at about 2205 NZST last night and finally getting fixed at about 0825 this morning.

I got notified about 10 minutes into the outage by my various ping systems that the system was down. The main server with the database was just completely off the net, including from web consoles. So I checked everything that I could, including requesting that the system reboot, to no avail. Meanwhile the suppliers own system had noticed that the server was inaccessible and informed me. So I sent off an email and received an automated acknowledgement of receipt.

I was  questing to find out how far the problem extended. On the network everything was there  up to the last step to the server.  It was quite clear that at least a bank of other servers at the same node had also gone dark which pointed to a structural problem that should have been making a big noise of their boards .

I figured that it was nearly midnight, most of the readers would have gone offline,  and they would know of the problem from my email at least. I’d check later in the morning to find out when it went back online. So I suppressed the server alarms from my cellphone and went to sleep.

Arggh, woke up and there was no response and no server. So I rang their polite phone support monkey service  at 0630 who said that they’d make sure someone knew about it. I finally got a response from support that arrived at 0707 about 8 hours from my first support email and about half an hour after I phoned.

Apologies for not getting back to you earlier. The underlying hosting
environment had to be reinitialized.

Why do I get the distinct impression that was the first time they’d realised that they had a problem. Don’t they read their frigging emails?  If they had then they could have rebooted eight hours earlier.

Sure, I’m a penny pinching sysop. But this server still generates several thousand dollars a year in revenue for the supplier and I only expect them to run the infrastructure. I don’t expect gold-plated support, but I do expect people to at least read the frigging emails I send them and to respond to them within hours.

Anyway, by that point I was  engaged in warming a backup system and refreshing it with the latest available data from 1700 the previous day. I’m going to have to increase the frequency of database backup deltas. I may even reimplement database replication now we’re out of the southern cross cables horrendously expensive data.

But I’m starting to look for a more support responsive primary server provider.

 

15 comments on “Outage and an irritating lack of support ”

  1. karol 1

    I understand your frustration, Lynn.

    Thanks for getting the site back up again so soon. I only noticed it was offline this morning, when I was planning to do my Thatcher post. I actually wrote up the post on a word document while the site was offline, and it came back up when I was ready to post it.

    However, such inaction from the servers’ people gives you unnecessary stress and work, and is a disruption for people wanting to comment on and/or read stuff on TS. And as you say, it’s not good business on behalf of the people who get money from this site.

    Thanks Lynn for your dedication to keeping this blog online.

  2. freedom 2

    my guess is many of us here simply suspected Big Brother took it down because of all the seedlings of change that have sprouted thanks to the tireless efforts of The Standard’s (and others’) green thumb of reality 🙂

  3. infused 3

    Should maybe not host overseas then? You kept going on how cheap it is – well now you know why.

    I’ll host it for cheap 😉

    And I was trying to get to it. Kept getting cloudflare, which I think is useless by the way. It’s generally far out of date. But I guess it’s better than nothing.

    • lprent 3.1

      The servers locally are just as cheap and have service that isn’t that much different (I had an overnight outage last year that lasted about 12 hours because of hardware failure and 8 hours before getting a support response). But only provided you do not have a lot of overseas traffic.

      I wasn’t fleeing the NZ hosts. After all our audience is 95% from NZ (which is why I moved it back here in the first place). I was fleeing the cost of having google, bing, facbook and other bots reading the site and unpredictably ratcheting up our overseas data charges at $3-1/Gb (price keeps coming down but the bots keep getting more hungry) by hundreds of dollars per month.

      If I could get a flat rate for a 600+Gb of traffic per month in NZ with someone I trust then I’d happily use it.

      Cloudflare is mostly there to serve the images, css and js as a CDN and to thereby take part of the server load.

      • infused 3.1.1

        Easy. I run local hosting here. Hardware failure shouldn’t be an issue these days, especially not a 12 hour outage.

        Not that you’d probably want to host it with me anyway, but I’d need to know exactly what resources you were paying for, and how much inbound international traffic you have.

        • lprent 3.1.1.1

          Hardware failure shouldn’t be an issue these days…

          … and it wasn’t. However simply getting the attention of the support people accounted for 9 of those twelve hours. Then there was the usual remote testing, getting someone into the rack to look at it, followed by testing there, and then replacement took less than half an hour.

          The problem is seldom getting the problem fixed. Most of the time it is getting someone to look at the frigging thing.

          Not that you’d probably want to host it with me anyway,

          Probably not.

          …but I’d need to know exactly what resources you were paying for, and how much inbound international traffic you have.

          And that is rather the point (which if you’re actually read my previous comment would have been obvious).

          I can tell you that the total international traffic will be between ~15GB/mo and ~200GB/mo because that is the range this site has had since mid 2011. The amount of inbound traffic is probably smaller than that – but probably pretty significiant. And what frigging difference does inbound make anyway? The charges are always for international traffic totals as far as I’m aware.

          The interest from overseas isn’t as steady as the local traffic which is pretty damn steady at between 50GB/mo and 100GB/mo.

          To handle the traffic spikes I need between 2 (local) and 4 (local+international bots) CPU core’s and 1-2GB of RAM (the latter is mostly to hold PHP files and most recent database queries in APC). Never bothered to look at the diskspace since I’ve never found a resource limit.

          Quite simply the amount of overseas traffic depends on how the webbots are screwing up in a particular month. There was one month when facebook had a problem with their caching of likes and every person seeing a like on every page read reloaded the whole post.

          It’d be easy enough to have a local server and a overseas server. But why bother. Easier to just move out of the range of the dumbarse monopoly pricing from Southern Cross Cable. It does mean that we push the cost indirectly onto the local readers but it does mean that we get foreseeable bills.

          • infused 3.1.1.1.1

            My point is, you shouldn’t need to get the attention of support people. That’s shit service. That’s what System Center and Ops manager are for. Obviously they are not using these tools. Having to reboot their VM Infrastructure? Sounds like they are running it on Xen or something equally as crap.

            We don’t charge for outbound international. Hence my question.

            • lprent 3.1.1.1.1.1

              I agree about the service. Wasn’t that happy about the local services I have used as well. However I’m also extremely interested in keeping costs down. I do have a couple of places that I have used before who are responsive. I suspect I will be heading back to one of them.

              Strategically the idea is that if we have to dump advertising and run completely off donations then the costs need to be *low*. This allows us to not be beholden to anyone, which allows us to write whatever we want to and put in authors who are going to piss people off. So that gives an upper limit on what level of service I’m prepared to pay for.

              We don’t charge for outbound international.

              Interesting. I searched for something like that as that would largely get rid of my issues with overseas costs traffic. Didn’t find it. All of the sites I looked at showed “international traffic” regardless of direction.

    • felix 3.2

      Cloudflare is a pain in the arse.

  4. prism 4

    Now Trade me is out – all I get is a revolting circle but finally, the site! Dah dah. Boomkeesh. No, their logo on the search line but the page says The service is unavailable. This on top of continuing problems since the last ownership change. When the rich kids were passing the Trme parcel at the party some of the goodies must have dropped out.
    Now trade me is up – I don’t know whether I’m Arthur or Martha.

  5. Mark 5

    Could be a few reasons, IMHO..

    Someone is on strike..
    Someone would prefer to be with their family than worry about your little issues..
    The irony of your not using local services became too much and it crashed due to this.
    You haven’t paid your bill
    The volume of bullshit on here overwhelmed everything.
    Dotcon’s theft of IP to sell for untaxed profit has resulted in DOS to his supporters.
    Vast RWNJ conspiracy to shut down free speech.
    You are not the world class SYSOP you think you are.
    🙂

    [lprent: Yeah right.

    And I know that you really are just myth making moronic dickhead. It takes a special kind of stupidity to attempt to put boasts in someone else’s mouth, but evidently you are the kind of pathetic little arsehole who does.

    Programming code in C++ is what I’m good at. Running this site is merely something I do in whatever spare time I have that isn’t taken up with more important tasks. ]

    • ghostrider888 5.1

      to be frank, and vernacular, ‘Mark’, why don’t you Fuck, yes, Fuck, Off back to your porn sites!

  6. xtasy 6

    I noticed there were server issues, but as it was late, yes, due to that, I decided to call it a night in front of the computer and prepare for bed.

    Such things happen, and I was relieved this morning, to see The Standard is again (and still) up and running.

    Thanks lprent

  7. lprent 7

    Damn – another 15-20 minutes of outage. Looks like the same reason.

    Update: Looking at the lack of comments it looks like the site has been inaccessible since ~1115 (pingdom says since 1119). I guess I know what I will be happening this weekend. Moving the server to somewhere a bit more stable and responsive.

The server will be getting hardware changes this evening starting at 10pm NZDT.
The site will be off line for some hours.