Written By:
lprent - Date published:
6:00 am, June 21st, 2015 - 16 comments
Categories: blogs, David Farrar -
Tags: dodgy statistics, kiwiblog
I was just running through this sites numbers for the last weeks for the first real month of winter *. Then as usual I went off to have a look at how the other political sites are going.
Either Kiwiblog had a very large social media blip starting on the 19th, going into the 20th, and still running today in the early hours of the 21st. Or something is really wrong with Sitemeter, or there is a irritating bug somewhere bloating the numbers. I’d like to know one way or another, as I’ve had to deal with some strange spikes like this in the past.
Looking at the rapid increase in the ratio of page views to “visits”, I’d expect it is a bug.
Normally when you get a real link flood from people coming in from reddit or digg or facebook or twitter or any of the international services, you’d expect the pages per visit to drop. People come in to have a look at the page that has been linked to and they don’t bother much with the rest of the site.
Kiwiblog in sitemeter usually consistently runs at a long term variation of around 1.3-2.0 pages per visit. In this part of the year and election cycle, the ratio goes up to 1.8 for weekend traffic when there are fewer casual readers and down to 1.5 . During election month it will get over 2.0 on week days. During January it will approach 1.1 because there isn’t that much interest in holiday snaps even amongst the sewer-rats. Getting a 2.6 ratio and a million page views in a day (about 50 times the usual) tends to make me suspect that this is a glitch. Lots of casual reading of lots of pages? Not likely.
Besides, I can’t see any post or comment in the last week that would cause a major influx (No doubt David Farrar will tell us if there was).
But I’ve seen something before that was a bug and looks like this.
Back in 2012 2011 I changed this site to use the asynchronous Facebook javascript rather than the synchronous one.
But there was some kind of bug on the facebook site with their caching. Rather than Facebook storing the link, excerpt and graphic for a ‘liked’ link on our site like it was meant to, Facebook would request the whole of the same page every time that someone scrolled past the link on Facebook. As people tend to open and scroll down their Facebook page a lot, there were a lot of requests on the site.
What was worse was that Facebook appeared to grabbing the whole page and executing the javascript at the end of the page that said to StatCounter and SiteMeter that a human had read the page, and it was doing it from the end-users browser.
It took me a little while and a chunk of after work analysis to figure out why I was getting five times as many page views per post on some posts. The IPs were all different and in our usual IP ranges (ie mostly NZ) so it didn’t look like a spambot attack. At least not unless someone had made botnet zombies out of a awful lot of kiwi computers (and I was worrying about that scenario for a few hours).
After a few days I realised where the excess pageviews were coming from after I filtered and sorted the millions of lines of website log looking for the common patterns. There is a lot of chaff in website log file because it logs ‘hits’ to every image file, CSS file, javascript file, async jquery, and a lot of internal stuff. A lot of the time it doesn’t even do anything much work with those. It just returns the machine version of “you already have that file, use that” for client side caching. The actual dynamic pages of a commented website are a smallish fraction of the lines in the web log that you have to filter for.
Eventually, I figured out that the common feature for the excess page views was Facebook. Confirmed it by looking just at my own machine talking to Facebook and seeing the effect on the site. So I turned off the asynchronous facebook and our numbers went back down to normal.
A few months later, I tested it and it repeated that same pattern. The third time a few months later it worked (with no changes in my code) the way that I expected it to.
My guess is that DPF has had something similar happen to his site. It’d be interesting to see what shows up in google analytics (I had some significiant variations in page views between sitemeter, statcounter, and analytics), and how much effect it had on the server. With my interesting Facebook bug, the reason I noticed it was because I was getting warnings that the CPU on the server was getting close to maxing out, and readers were complaining that the site was running really slow.
But it really does show you how easy it is to get interesting statistics in the web world. Some people deliberately cultivate this kind of bug, or even induce it.
* The winter cycle… We usually peak up to April or May, and then have significiant drop in page views over June and July. Then start pickup at the end of August. You’d think that winter was when people had long evenings to while away on blogs. But that isn’t what happens.
Updated: I noticed an error in my post. The facebook issues were in 2011 not 2012. See the post about how close we were getting to Kiwiblogs page views. That post probably made DPF realise that his site wasn’t collecting page views from the post pages.
The current rise of populism challenges the way we think about people’s relationship to the economy.We seem to be entering an era of populism, in which leadership in a democracy is based on preferences of the population which do not seem entirely rational nor serving their longer interests. ...
The server will be getting hardware changes this evening starting at 10pm NZDT.
The site will be off line for some hours.
Interesting. Any sign of this happening in other months?
TS seems to be catching up to KB views and if this sort of blip has happened before it could be quite close.
Not that I have observed and certainly not to this degree. Unfortunately the public stats on sitemeter don’t allow you to walk back as far as the stats on statcounter. But I’m pretty sure that I’d have noticed abnormalities in the stats.
Those systematic shifts were pretty noticeable when Cameron Slater shifted his stats.
In less than half of the time that it took you to write this post – you could have just emailed him.
Re-read the final two sentences of this post, and think about what they might mean. Everything else is purely to illustrate the point they make.
Offhand, I can’t recall ever initiating a email to any blogger outside of TS except when it is related to reposting one of their posts, trying to recruit them, or requesting information related to TS (often – where is your RSS?) or where they have requested technical help in public.
I have replied to a number of emails from them on roughly the same topics. But with the very occasional email from other bloggers pointing out defamatory comments that escaped moderation.
I can’t be sure without searching my email system, but I think that in nearly 8 years I have never had a reason to initiate sending a email to David Farrar. I can’t think of any particular reason to start now.
Mostly bloggers should do everything in public except what they explicitly state what they protect. In our case that is just the names and details of people under pseudonyms (because there are always arseholes like Slater out there). But there are also some mundane bits of reasonably innocuous organising chatter between authors at TS. Like when people have fixed errors in posts, or are making it clear to other authors about why they wrote a particular post a particular way, or these days with increasing comments on moderating cohesion.
You don’t think that readers and commenters and readers don’t need to be involved in the public conversation about the blogs that they are involved with? Why?
For that matter there are hundreds of local blogs out there. Pointing these things in public allows something that could affect other sites to be known about.
And don’t you think that they should know things that showed up in #dirtypolitics of unacknowledged paid-for infomercial posts and the monetary and influence payments for background organising of collusion attacks by multiple blogs at political foes? If not then why?
That is why we do almost all of our commentary in public and have always done so.
The overall rankings are on openparachute, if anybody is keen to have a look. While WO is probably the most visited/read, the actual numbers are unknown as WO rely on spurious generated traffic to manipulate the results.
TS continues to build numbers month on month, so pats on backs to all the authors and commenters who make the site such a vital part of the political scene in Aotearoa.
https://openparachute.wordpress.com/2015/06/01/may-15-nz-blogs-sitemeter-ranking/
https://openparachute.wordpress.com/nz-blog-ranks/
Funny, given the recent shrill warnings from a few regular righties and Wayne Mapp that this site is becoming more radical left and out of touch with ordinary New Zealanders.
well observed
Plus the traffic KB will be getting from Farrar’s high profile compared to the standard’s relative lack of publicity outside of the blogosphere (although that’s changing a bit with the MSM now taking notice of political blogs). I think that means the standard is doing very well.
The Right reckons this is Radical Left? haha. Out-of-touch, maybe, sometimes, but it’s really quite easy to be out-of-touch with the blinkered mainstream narrow-mindedness in NZ.
Shit, soon as any working class guy or gal gets schnippy with their oppressed reality and voices an angry opinion, mainstream views start trending towards complaints that the person speaking is a “radical”. Being out-of-touch with normal is how people tell their own specific stories from a real world that doesn’t naturally comply with cultural myths, from the fringe of a culture, and how they examine ideas new-to-their-way-of-thinking… but nowhere near radical left. In my opinion, “normal”,”mainstream” and “culturally correct” are words to describe a method to propagate mental illness in perfectly sane people whose personalities fall outside the range of activites that a society values at any given time.
I’d be happy to read “more radical” here, and by radical I don’t mean conspiracy or hard-ideological radical, just ideas that really shamelessly push the known limits. Radical ideas don’t have to be practical or easily applied, just have to inspire and promote widening of perspectives, to reveal something not previously noticed. If our culture’s norms only purpose is to avoid cognitive dissonance in it’s ruling and aspiring classes, I’d say it’s about ready to unavoidably fracture.
My ISP mentioned to me on the 19th there was a fairly serious DDOS attack from China. They blocked a dozen or so IP addresses. Looks like it has been ongoing.
A new site called minds.com claims to be more secure than Facebook whilst offering similar services. Maybe worth a look for the security conscious?
What does that have to do with anything?
A DDOS that is executing the javascript that Sitemeter, Statcounter, Analytics, etc all run to count the page views?
That would be a bit weird. Usually the bots completely ignore all of the javascript. But they could be doing it to try to bypass possible javascript protections on the login, comments, or registration.
For the interior defence against most kinds of bots, try something like the plugin Wordfence. It has a number of defences, but the one that is probably most useful is the one that keeps track of the number of calls for pages from IPs and allows you to set a limit per minute for humans and known bots. It is targeted because wordfence lives inside of wordpress and in aware of what a page is.
The only thing to avoid is turning the logging the incoming data so that you can look at it. For some strange reason this is done using the database and knocks the DB performance during an attack.
Wordfence will then either throttle or or in my case block bots for hours if they exceed those limits. It is efficient doing the latter because it uses .htaccess. You still get a lot of requests, but they are small and you don’t wind up processing them outside of .htaccess or the nginx equivalent.
But it does mean that any bot contaminated system has to read the site very very slowly at almost human speeds. Which is what they should be doing anyway if they are a legit system. Known systems like googlebots, national library, wayback and others are let through for faster scans. But things using our bandwidth and CPU – they just get sorted out, killed, and reported.
The Standard automatically blocks hundreds of previously unknown bot systems worldwide every day. It also blocks machine speed attempts to login, register, and post comments.
BTW: Looks like it is ongoing.
You’re up to 250+ thousand for today so far.
Good luck cleaning it up.