Periodically I run a scan to identify network parasites that are sucking up our bandwidth and processing resources in excess. Of course I leave the benign parasites that provide search facilities alone. Like the ubiquitous googlebot and its other benign brethren. But I stomp on the nasties and inoculate the system against them.
It could be spambots trying to insert comments and requiring us to expend effort in cleaning up the garbage. There denying from .htaccess is your friend and vast swathes of the IP’s from Russia, China and some other regions are excluded from reading anything at all.
It could be virus slaved botnets mindlessly trying to exploit our contact us and contribute pages to send e-mails. There I find that a relatively simple captcha is sufficient to deter them.
Feedbots picking up RSS feeds used to be a problem. But since I put some limits on how much they can pick up and how old it can be, that problem has disappeared.
It could be a out of control spider trying to crawl all over our web pages in a single cataclysmic event. I find that putting restrictions on how often an IP number can connect within a given period of time tends to squelch them. Because of the nature of the other beasts on the net I have that to just harm the really out of control processes with grey-listing. The system starts refusing to accept new connections after a while while very slowwwly feeding the bot a dribble of data and holding them up.
But that leaves the processes that don’t really go out of control but are rather badly mannered. Specifically they don’t appear to obey the rules of my robots.txt particularly well. Usually I just put in rules to ban them permanently
Tonight the biggest parasite appears to have been a New Zealand government department – the Department of Internal Affairs. Processes mounted on IP numbers 22.214.171.124 and 126.96.36.199 ran very high number of queries on the site over a couple of hours. It advertised itself in the user agent as being “heritrix/1.14.1 + http://www.dia.govt.nz/website-harvest”.
Well ok, I looked at the web page and it doesn’t exist. I googled for website harvest at the DIA – nothing. I looked on the DIA website to no avail. I looked up heritrix which asserts that it follows the robots.txt. Thats not what I saw especially when it was picking up from two IPs.
At that point I was tired of being nice to whoever hadn’t put up the web page explaining why they were tapping our website. If anyone does know then put a comment here. I also didn’t know where to send my irritated e-mail to.
I’m hesitant to exclude a branch of the government from archiving our site for prosperity and historians (which is what heritrix is designed for). But I sure as hell want to be able to know what they’re doing. I’d also like them to moderate their behavior and obey the rules in the robots.txt somewhat better.
Rather than doing what I would have usually done and simply denying them access, I decided to draw their attention to their lack of a explanatory webpage.
I used a RewriteRule in .htaccess to permanently redirect every query from them to the missing webpage. I’m now happily off to bed contented that somewhere in the DIA there is a file system or database slowly filling up with copies of their missing page.