Dealing with the DIA website-harvest parasite

Written By: lprent - Date published: 2:27 am, February 3rd, 2011 - 13 comments
Categories: admin, blogs, humour, inoculation, interweb, scoundrels, Spying, The Standard - Tags: spambots, spiders

Periodically I run a scan to identify network parasites that are sucking up our bandwidth and processing resources in excess. Of course I leave the benign parasites that provide search facilities alone. Like the ubiquitous googlebot and its other benign brethren. But I stomp on the nasties and inoculate the system against them.

It could be spambots trying to insert comments and requiring us to expend effort in cleaning up the garbage. There denying from .htaccess is your friend and vast swathes of the IP’s from Russia, China and some other regions are excluded from reading anything at all.

It could be virus slaved botnets mindlessly trying to exploit our contact us and contribute pages to send e-mails. There I find that a relatively simple captcha is sufficient to deter them.

Feedbots picking up RSS feeds used to be a problem. But since I put some limits on how much they can pick up and how old it can be, that problem has disappeared.

It could be a out of control spider trying to crawl all over our web pages in a single cataclysmic event. I find that putting restrictions on how often an IP number can connect within a given period of time tends to squelch them. Because of the nature of the other beasts on the net I have that to just harm the really out of control processes with grey-listing. The system starts refusing to accept new connections after a while while very slowwwly feeding the bot a dribble of data and holding them up.

But that leaves the processes that don’t really go out of control but are rather badly mannered. Specifically they don’t appear to obey the rules of my robots.txt particularly well. Usually I just put in rules to ban them permanently

Tonight the biggest parasite appears to have been a New Zealand government department – the Department of Internal Affairs. Processes mounted on IP numbers 192.122.171.75 and 192.122.171.11 ran very high number of queries on the site over a couple of hours. It advertised itself in the user agent as being “heritrix/1.14.1 + http://www.dia.govt.nz/website-harvest”.

Well ok, I looked at the web page and it doesn’t exist. I googled for website harvest at the DIA – nothing. I looked on the DIA website to no avail. I looked up heritrix which asserts that it follows the robots.txt. Thats not what I saw especially when it was picking up from two IPs.

At that point I was tired of being nice to whoever hadn’t put up the web page explaining why they were tapping our website. If anyone does know then put a comment here. I also didn’t know where to send my irritated e-mail to.

I’m hesitant to exclude a branch of the government from archiving our site for prosperity and historians (which is what heritrix is designed for). But I sure as hell want to be able to know what they’re doing. I’d also like them to moderate their behavior and obey the rules in the robots.txt somewhat better.

Rather than doing what I would have usually done and simply denying them access, I decided to draw their attention to their lack of a explanatory webpage.

I used a RewriteRule in .htaccess to permanently redirect every query from them to the missing webpage. I’m now happily off to bed contented that somewhere in the DIA there is a file system or database slowly filling up with copies of their missing page.

13 comments on “Dealing with the DIA website-harvest parasite ”

Comments are now closed

Jenny 1

3 February 2011 at 6:17 am

Good for you Lynne, With the $millions dished out to our secretive spooks to prod and pry and fish for our private details and invest our private files.

It is about time someone punctured their arrogant smug bubble again. (name check Waihopai here).

capcha – “sequence” (as in code)
- lprent 1.1
  
  3 February 2011 at 8:25 am
  
  I don’t think it is the spooks. Too overt for a starter. I figured it out to be a part of something like the wayback engine.
Tim H 2

3 February 2011 at 6:44 am

Don’t know much about it, but you might want to look at natlib.govt.nz/website-harvest. What with them being subsumed into DIA, I reckon that might have something to do with it. Cock up rather than evil spooks, possibly.

There’s contact addresses and all there.
- lprent 2.1
  
  3 February 2011 at 8:30 am
  
  That will be it. I did this very early this morning. I was more interested in getting their pickup off my screen so I could see the effect of other parasites. Dropping what my site delivered to them from tens of k per page to 596 bytes did it – and amused me as well
- lprent 2.2
  
  3 February 2011 at 8:43 am
  
  Google for natlib website-harvest gets it. They need to look at their SEO. website-harvest alone should have been enough for me to locate it.
  
  http://natlib.govt.nz/website-harvest
  
  I can’t find a mention of them running a sweep at present. In fact the whole website looks out of date. I guess that the DIA is doing what is expected and not doing the required work for natlib.
  - Idiot/Savant 2.2.1
    
    3 February 2011 at 11:44 am
    
    They run a sweep fairly regularly – they have a program to archive the top blogs for posterity. Though IIRC they asked permission (which obviously I granted).
    - lprent 2.2.1.1
      
      3 February 2011 at 12:46 pm
      
      It isn’t running through the site that bugs me. I think it is a worthy project.
      
      It is that they were pulling it too fast whilst people were still active on the site. In particular authors were writing posts after work (it was after Marty commented that he’d had a few outages that I went and looked at what was using more than usual resources). Also that when I went to look for whoever it was, the URL was invalid – that tends to really annoy sysops because it wastes our time.
      
      They didn’t ask me, but that is because we’re in a .nz domain space rather than .com. There is some legislation covering that.
prism 3

3 February 2011 at 10:15 am

Is this a follow-on from the political meddling with NZ Archives etc? Is it part of results from trying to reduce quality of service and the separateness of our old records to make short-term savings? We saw how uninterested even hostile previous Nat politicians have been to cultural needs of us as an advanced, educated society, such as in pushing RadioNZ out of their premises, specially built for their purposes.
- lprent 3.1
  
  3 February 2011 at 11:36 am
  
  Is this a follow-on from the political meddling with NZ Archives etc?
  
  I suspect so. What annoys me is that the page pickup rate is too high and it was happening prior to midnight. We get a lot of traffic in the evenings and it runs all the way up to the wee hours of the morning.
  
  A bot picking up pages virtually every second may do great things for their efficiency. But it is a pain in the arse for sites with a lot of traffic and a very large number of pages. I suspect that it isn’t an issue for something like the NZ Herald site where they have quite a lot of hardware redundancy. But for us where we run on quite limited servers it is annoying.
  
  The norm for bots like baidubot or googlebot is to pick up a page every 20-30 seconds on any single site during a full scan and do that continuously. That pushes it neatly into the background processing. Of course there are usually a couple of dozen bots running sweeps at any particular point in time. So with the DIA picking up virtually every second it effectively doubles our background processing. I tend to block out servers that pick up every second.
  
  In practice we use a sitemap (listed in the virtual robots.txt) and push notifications for the major bots to tell them when a page has been altered and if they should pick it up. But that just covers googlebot, bing, yahoo, and ask. Bearing in mind that the bots are well over 50% of the in the day traffic (and 95% of the overnight), I’m starting to consider excluding all bots apart the ones we choose to inform.
Rex Widerstrom 4

3 February 2011 at 7:53 pm

I had exactly this happen with a political candidate site, except it was the National Library of Australia, and yes, for a wayback / archive operation.

We were paying a commercial host and within an hour their spider had used our entire month’s bandwidth allocation and got the site shut down – in the middle of a campaign.

Figuring they must have the site by now (it was only a few dozen pages!) and they’d leave us alone and intending to complain later, I just paid for additional bandwidth. It came back, did it again. Within 24 hours we were off air again.

I wrote them a very angry letter (even by my standards) and instead of saying “sorry, we’ll look at what it’s doing” they got lawyers involved. Assholes.

Bill them for the bandwidth, time and inconvenience and send them off to the debt collectors if they don’t pay up.
- lprent 4.1
  
  3 February 2011 at 10:10 pm
  
  At present they are sucking up nothing much apart from instructions to read their own site. But the server has an allocation of about 700GB for bandwidth and we’re well under that. Just inconvenient and a bit stupid.
lprent 5

3 February 2011 at 10:37 pm

Finally caught up with my e-mail (I only read it a couple of times per day). There was a good e-mail of which the relevant bit was:-

We have harvested the site before but unfortunately last night there was a glitch, which set off two harvests instead of one. This would explain the extra traffic on your site last night.

Compounding that there was a mistake in our website address, which is why you couldn’t find the page that explains our web harvesting process. I’ve corrected that now. The correct URL is http://www.natlib.govt.nz/website-harvest which will appear in future log files.

I can live with a single process scanning. I’ve released the redirects.