March 18, 2003

Out, out, damned bot

I've been maintaining websites since 1997, but it's only recently that I've come to realize the ever-growing prevalence of obnoxious robots and spiders. These are basically software applications that act like automated browsers, following links and downloading whole sites in order to extract information for their owners. Not a big deal when the bots are sent out by legitimate search engines such as Google, and when they obey the standard instructions webmasters use to control what is to be catalogued and what is off-limits. Many bots, however, are programmed to ignore these instructions, and their purpose is often less than benign -- the worst being the spambots, sent out to find email addresses for spammers. Bots can also suck up large amounts of bandwidth, bandwidth that site owners have to pay for, and since they often masquerade as ordinary browsers, they can also distort the traffic statistics of a site -- which is how I first noticed their pernicious presence at Cronaca.

You will have more options if your site is hosted on a Unix server. A good intro is available here; an article devoted to the .htaccess file is here; articles on building bot traps are here and here; and once you get your defenses up, you can test them at Wannabrowser. You can keep up to date through the forums at WebmasterWorld, and with this very long thread devoted to blocking bots (free registration required; highly recommended).

For those using IIS, options are more limited; I've found this but haven't yet had time to test it.

Posted by David on March 18, 2003 11:56 AM

Comments
Post a comment




  Remember Me?


(For bold text to display correctly, please use <strong>, not <b>)




Google