Routing issues

Written By: - Date published: 11:08 am, April 1st, 2011 - 9 comments
Categories: admin, The Standard - Tags:

We have had some routing problems this morning with several networks used by our provider in San Diego. The problem have been ongoing with the network links appearing and disappearing from the perspective of NZ and aussie (there is less of a issue from overseas networks). They’re working on it. It started just before 7am and is starting to look stable at about 11am.

Ironically, I’m just waiting for some due diligence to be done on a local supplier for the election year server – ie that they won’t just fold if a malicious complaint is made. The current warm backup system can’t handle the load (it was having problems handling the comments being fed to it), so it is currently off. The new server was due to get configured this weekend but that is looking less likely because it is unlikely to get provisioned before Monday.

And no, this isn’t an April fools joke…. But since we last had a failure almost exactly a year ago at the end of March running over April 1, I’m starting to get superstitious.

9 comments on “Routing issues ”

  1. Rich 1

    Access seems fine right now.

    Also, I use Dreamhosts. Cheap, reliable, good customer service, and a reasonable AUP. I’m not sure how robust they are to takdown complaints, but their polices seem quite firm. (http://abuse.dreamhost.com/libel/ for instance).

    Also, if you need a backup site, I may be able to provide something for free. Email me?

    • lprent 1.1

      Will do.

      These days the primary server is running on a reasonably recent dual core dedicated Xeon that started at running well under 20% normal’ peak and 40% spiking CPU during the day last March, and now runs 40% normal peak and 80% spiking. The spiking typically happens when posts are being made with the SEO notifications kicking off demand, whilst at the same time background processes like the search indexing run, and we have multiple comments happening at the same time. Unfortunately I have to do capacity planning based on the spikes rather than the normal and it is difficult to prevent spiking in software (ie it is an OR queuing problem). So I need more cores and more CPU cycles.

      The idea is to move to a much higher spec system in NZ and leave our existing US system running in the warm backup role. The latter should have enough grunt in an emergency to carry us through the election year growth (probably about another 50% page views) if I turn off some of the background processes.

      If we can’t find something suitable in NZ, then it is pretty easy to get offshore servers with more cores and more CPU cycles and where we don’t have the potential malicious legal issues we have in NZ.

  2. lprent 2

    FYI from the hosting company

    We wanted to provide you with an update about the network connectivity
    issues we were seeing today. As of approximately 3pm PT the network has
    returned to normal and we are receiving assurances from our network
    provider that the situation should remain stable. Latency will also
    improve as additional carriers are brought online.

    As to what happened, we are still not being given a clear picture. The
    engineers working on the issue will provide a detailed explanation as soon
    as the situation has been fully resolved and understood. We are more than
    happy to pass that engineering report along to you (please just let us
    know you’d like to receive it.) What we’ve been told is there was a
    routing issue with one of our carriers (XO) which propagated to our other
    carriers- we have connections from multiple backbone providers to prevent
    problems or outages with one carrier from causing a complete outage. We
    realize that has absolutely not been the case today, and we will be
    receiving an explanation and plan for how this can be avoided in the
    future.

    This type of disruption is somewhat sporadic in that traffic from some
    ISPs and networks was affected, while traffic from others was unaffected.
    Also, as the routes were restored via backup carriers, you may have
    noticed that the connection would work briefly then go down again. There
    my be some more of these hiccups as this is brought to resolution in the
    next 12-24 hours, however we expect those to me momentary (on the order of
    a few seconds.)

    Please know we will continue to monitor this situation very closely and we
    will be back in touch if further issues arise.

    Not quite as bad as the hosting companies I have had for a different operations.

    One was in Florida the day that they had a really good hurricane come through. They had multiple failures in their connection providers resulting in only a single connection staying active. The volume of traffic was such that it kept overloading. After that we always kept the servers separated in different sites and the sites were selected for their lack of expected natural disasters

    But of course we had a different hoster that had a contractor for a company down the street from them ram a metal girder straight through their high voltage underground power line. The power surge was sufficient to fry the switches that were meant to divert them to the battery backup system (and the generators that were meant to feed that). Consequently our servers in that site were out for a day while they rewired everything.

    Disaster planning is a bit of a pain when traffic keeps rising. You usually find out it doesn’t work when the systems fail. Of course you could do what I did last month. I forced a test failure and found out that the warm backup couldn’t even get close to handling normal loads.

    Anyway, I think I’d better order that server today. They may be able to provision it tomorrow. Otherwise it will be a week before I have time to do the setup.

  3. chris 3

    Use linode. Their VPS service is amazing and cheap! Steer clear of dreamhost, as a programmer I would expect nothing less than self-configured servers 😛

    • lprent 3.1

      We can’t survive on VPS’es. If you look at the ‘Online’ tab on the upper right you will see why. That is a picture of the number of readers (including the spiders) within a couple of minutes on the site. During our day that usually sits between 40-60. But it goes up as high as 150 when comment spikes. Overnight it drops down to 10-15 base load because that is when many of the spiders pick up their data steadily reading their way through the 7000 odd posts and 250,000 odd comments that are online.

      Remember that these are unique IP’s. There are many many connections within that count for all of those little graphics, CSS and JS.

      We have been booted off three VPS’es for having too much traffic so far. The last one was running purely as a warm backup storing comment and post updates. Right now we’re on the verge of having to start spreading across several servers to handle traffic, which is what I intend to set up when the new server comes online. NZ readers will read the NZ server, and overseas users (mostly spiders) will read the US server.

  4. Lanthanide 4

    “Overnight it drops down to 10-15 base load because that is when many of the spiders pick up their data steadily reading their way through the 7000 odd posts and 250,000 odd comments that are online.”

    Isn’t there some way to force the spiders not to re-cache this stuff all the time? Most of those pages and comments won’t be changing very frequently (if ever), so they should be able to index it once and then not need to come back for a couple of months, and only then just to check that the pages are still valid.

  5. BLiP 5

    What’s that cdn.topsy thing that can take f o r e v e r to load?

The server will be getting hardware changes this evening starting at 10pm NZDT.
The site will be off line for some hours.