This site gets rather a lot of comments. Every time a comment is made the page it is on has to be regenerated, which limits the amount of static caching that can be done. So on a dynamic site like this one we require rather a lot of raw processing power to generate dynamically pages that keep changing all of the day.
But we like to make the site as simple as possible, as up to date as possible and as secure as possible. So when comments arrive, this site updates the site.
Because of the current design of the pages (something that is always under review), this means that whenever a comment is posted, all pages with the list of recent comments is invalid. Because that is on every page to assist with navigation, the next time a page is read by anyone it is recalculated and laid out. While the server will cache that page, it will do so for a particular version of the browser that it was destined for. That is because not all pages are served up the same. It depends on what type of browser asked for it. So a page may be regenerated many times.
That requires a significiant amount of processing power when there are a lot of people commenting at the same time as there are lot of people reading the site. Outside of the middle of the night, those demand levels tend to quite unpredictable.
It’d be pretty easy to simply get bigger servers that could handle all of the possible demand. But how big is enough? We have had times when a story has been breaking where the load on the server has been for short periods of time been more than 10 times the usual load. We’ve had times when those peaks have killed the site running with a powerful 8 core system. During an election year we usually get a lot more traffic than we previously have had.
So since late last year we have been running the web facing parts of this site on cloud servers and setting up the required infrastructure to allow us to run multiple servers with common file, database, and page/query cache services.
This years we’ve been steadily getting a auto-scaling system largely triggered by CPU levels working. For a single webserver we use something that is rather like a dual core CPU laptop. Not massively fast, but capable of handling a moderate amount of traffic simultaneously. This is our quanta.
When the average processing usage for the existing web servers rises about a set value for more than a few minutes, then a new server of that type is automatically hired from Amazon, equipped with our standard disk image, and thrown on to the load balancer. If the demand keeps rising, then so do the number of servers in usage.
Conversely when the average processing usage drops below a certain value for a long period, then a server is dropped. This keeps happening until the servers are working productively. They go off to do some other type of task for someone else.
This kind of on-demand system is damn hard on people because they really don’t like chopping and changing all of the time as it is intensely unproductive to keep retraining people for different tasks. But computers are machines. You give them a new filesystem and they are trained. The hard bit is creatively figuring out what should be on that filesystem – a task that humans are usually pretty good at.
Anyway, I thought I’d share what our webservers have been doing over the last couple of weeks. Bear in mind that the times on this graph are in UTC (Universal time or Greenwich time) so they are currently about 12 hours off NZ time.
You can see the strong diurnal cycle (click through the image for a higher resolution one if you cannot). The servers drop down to two or even one overnight. You can also see the weekends and public holidays where the servers during the day run a three and sometimes four. During the week days we’re usually running at four and sometimes five servers.
Each server costs about 2.8c per hour to have (with variable costs like bandwidth as extra), so while it doesn’t cost much, it adds up over a month to save the overnight costs. With a voluntary site like this all money saved is pretty useful.
But you can also see peaks for short bursts where we are running at six, seven or even eight servers when a story goes active and everyone piles into the site. And this fortnight has been a quiet period news wise.
I’ve currently capped the maximum numbers of web servers at 35 based on the potential loads on the database and file servers. Hopefully that will be enough to cope with the upcoming demand. Besides, I can always start autoscaling those if I really have to…