Written By:
lprent - Date published:
4:52 pm, April 18th, 2014 - 8 comments
Categories: admin, The Standard -
Tags: technology
This site gets rather a lot of comments. Every time a comment is made the page it is on has to be regenerated, which limits the amount of static caching that can be done. So on a dynamic site like this one we require rather a lot of raw processing power to generate dynamically pages that keep changing all of the day.
There are a number of solutions for this issue from offloading the comments to an external supplier like Disqus (with the consequent issues about privacy and control) or to the excessive caching of pages when new comments come in (causing out of date pages) or to separating the comments into a dynamically loaded system (requiring extra client side javascript).
But we like to make the site as simple as possible, as up to date as possible and as secure as possible. So when comments arrive, this site updates the site.
Because of the current design of the pages (something that is always under review), this means that whenever a comment is posted, all pages with the list of recent comments is invalid. Because that is on every page to assist with navigation, the next time a page is read by anyone it is recalculated and laid out. While the server will cache that page, it will do so for a particular version of the browser that it was destined for. That is because not all pages are served up the same. It depends on what type of browser asked for it. So a page may be regenerated many times.
That requires a significiant amount of processing power when there are a lot of people commenting at the same time as there are lot of people reading the site. Outside of the middle of the night, those demand levels tend to quite unpredictable.
It’d be pretty easy to simply get bigger servers that could handle all of the possible demand. But how big is enough? We have had times when a story has been breaking where the load on the server has been for short periods of time been more than 10 times the usual load. We’ve had times when those peaks have killed the site running with a powerful 8 core system. During an election year we usually get a lot more traffic than we previously have had.
So since late last year we have been running the web facing parts of this site on cloud servers and setting up the required infrastructure to allow us to run multiple servers with common file, database, and page/query cache services.
This years we’ve been steadily getting a auto-scaling system largely triggered by CPU levels working. For a single webserver we use something that is rather like a dual core CPU laptop. Not massively fast, but capable of handling a moderate amount of traffic simultaneously. This is our quanta.
When the average processing usage for the existing web servers rises about a set value for more than a few minutes, then a new server of that type is automatically hired from Amazon, equipped with our standard disk image, and thrown on to the load balancer. If the demand keeps rising, then so do the number of servers in usage.
Conversely when the average processing usage drops below a certain value for a long period, then a server is dropped. This keeps happening until the servers are working productively. They go off to do some other type of task for someone else.
This kind of on-demand system is damn hard on people because they really don’t like chopping and changing all of the time as it is intensely unproductive to keep retraining people for different tasks. But computers are machines. You give them a new filesystem and they are trained. The hard bit is creatively figuring out what should be on that filesystem – a task that humans are usually pretty good at.
Anyway, I thought I’d share what our webservers have been doing over the last couple of weeks. Bear in mind that the times on this graph are in UTC (Universal time or Greenwich time) so they are currently about 12 hours off NZ time.
You can see the strong diurnal cycle (click through the image for a higher resolution one if you cannot). The servers drop down to two or even one overnight. You can also see the weekends and public holidays where the servers during the day run a three and sometimes four. During the week days we’re usually running at four and sometimes five servers.
Each server costs about 2.8c per hour to have (with variable costs like bandwidth as extra), so while it doesn’t cost much, it adds up over a month to save the overnight costs. With a voluntary site like this all money saved is pretty useful.
But you can also see peaks for short bursts where we are running at six, seven or even eight servers when a story goes active and everyone piles into the site. And this fortnight has been a quiet period news wise.
I’ve currently capped the maximum numbers of web servers at 35 based on the potential loads on the database and file servers. Hopefully that will be enough to cope with the upcoming demand. Besides, I can always start autoscaling those if I really have to…
Sorry, some times I hit F5 too much. mostly when felix or Sanc are trolling Syrlands or Pete G.
Its my online crack habit…
Its irresistable.
I don’t care. It is literally what the system was designed for.
BUT don’t do it more often than 15 times in a minute. If you do then you will find the site abruptly slows down on you for the next two hours because Wordfence now treats you as a pain-in-the-arse. I find that this seems to induce the change in behaviour that I’m after.
Great work Lynn. And thank goodness I don’t have to use Disqus – which simultaneously destroys the small measure of privacy I use and slows comments to a crawl.
So if I read the above correctly it sounds as if the majority of your problems in terms of scalability are due to the frequent cache invalidations caused by new comments? I would be interested in hearing more about why you don’t want to load them dynamically – the additional client coding sounds like a relatively minor tradeoff. It wouldn’t solve the problem of caching multiple versions of the page for different clients but you would be able to hold them in cache longer.
Oh I agree that it’d appear to make the site simpler to run. However the coding/testing complexity of doing it is a pain when you look at the detail. And it wouldn’t make much difference in terms of actual operations because of the site layout. We’d have to do a lot more coding to make it work.
Depends what we’re talking about of course….
But if I dynamically loaded the comments on the right tab then I’d get benefits for the pages that weren’t getting comments added to them and weren’t the front page.
However at any given time between a quarter and a half of our page views are the front page. A prominent feature on the front page is the comment count. Every comment causes that page to regenerate. Of course that count could also become dynamic – but it has just doubled the dynamic code to test and maintain.
At any given time between a quarter and a half of our page views are usually on a single post. This is usually because that particular post is getting a lot of comments, a lot of reads by people commenting, and a lot of reads by the substantial lurker population who read comments closely.
During the day, about a fifth of the page views are from search engines plowing their way through the nearly 14k posts and 700k comments. We help them as much as we can by telling them where the site is active. But they are a very important source of new readers to the site.
This means that most of the pages being read through the day and evening are on the most dynamic pages anyway. Usually inside a full day something like 75% of the page views will fall in on less than 10 pages. These will also be the pages that have the most comments being made (or in the case of the front page, comments being reported)
To be effective, we’d have to do the same code changes amongst ALL of the browsers that access this site. As well as the browsers for the desktop versions of windows, mac and linux, it also includes the ever increasing numbers of variants being used on the mobiles (14% of page views). Just testing those on every one of the frequent code updates for wordpress and plugins beggars belief. The only real way to do it would be to write automated testers – which would be a hell of a coding chore.
And of course wordpress directly has no support for dynamic comments, which means that we’d be on our own for testing updates across all of those targets platforms. The testing would have to be done with every update on wordpress.
There are a number of possible alternatives system that do get their own testing done (Disqus for instance).
But this is a site that gets quite a lot of comments, many of them large and just as interesting as the posts. There is a considerable SEO benefit in optimizing for them – which we have done. So we’d also have to do something about how the search engines access the comments. For instance I ran your first sentence through google and got an immediate hit. If you do the same on a disqus site like whaleoil – the same will not happen. As far as I can tell (a notoriously difficult thing to measure) about a third of the retained new readers come in after querying something that got a strike on a comment.
There are a whole lot of other issues as well (like the ever shifting jQuery/Ajax systems). Suffice it to say that doing it with servers appeared to be a lot simpler than doing it with code because the returns off doing it with code are a lot less than you’d expect, and the cost of the code is whole lot higher and less transportable.
Roll on HTML5 which will make some of this easier to do inside wordpress itself.
Interesting looking at the Shane Jones effect today on servers
Click image for a larger view. The times are in UTC, so they are 12 hours out. So the Shane Jones story went up just after 06:00
The load peaked at approximately 3.5x the usual human average maximum weekday middle of the day load at about 7pm and started falling back to under twice the usual load at about 9ish, then spiked again with the evening news. Then dropped rapidly after that.
No page errors reported from the client to load balancer, but the latency was somewhat spiky as web servers were being added.
I need a bigger story to really test the spike loading on the site. It didn’t really touch the database or file server. Comments kept pouring in without any problem apart from some moronic trolls unaware of the first comment policy…
I’m happy with the performance. Especially for a peak loading at about 17-18c per hour…
The traffic is certainly heating up.
The direct traffic out last month for webpages (excluding the static images, javascript, and CSS) was 309GB (short month feb was 319.996 GB). So far this month as at the 22nd, we’re at 338GB. I guess election year is underway.
Good to see that the changes I made at the end of last month are reducing all of the other costs down.