Web Robot Detection

part of the ArsDigita Community System by Michael Yoon

User-accessible directory: none
Site administrator directory: /admin/robot-detection/
Data model: /doc/sql/robot-detection.sql
Tcl procedures: /tcl/ad-robot-defs

The Big Picture

Many of the pages on an ACS-based website are hidden from robots (a.k.a. search engines) by virtue of the fact that login is required to access them. A generic way to expose login-required content to robots is to redirect all requests from robots to a special URL that is designed to give the robot what at least appear to be linked files.

You might want to use this software for situations where public (not password-protected) pages aren't getting indexed by a specific robot. Many robots won't visit pages that look like CGI scripts, e.g., with question marks and form vars (this is discussed in Chapter 7 of Philip and Alex's Guide to Web Publishing).

The Medium-sized Picture

In order for this to work, we need a way to distinguish robots from human beings. Fortunately, the Web Robots Database maintains a list of active robots that they kindly publish as a text file. By loading this list into the database, we can implement the following algorithm:

Check the User-Agent of each HTTP request against those of known robots (which are stored in the robot_useragent column of the robots table).
If there is a match, redirect to the special URL.
This special URL can either be a static page or a dynamic script that dumps lots of juicy text from the database, for the robot's indexing pleasure.

This algorithm is implemented by a postauth filter proc: ad_robot_filter.

(Note: For now, we are only storing the minimum number of fields needed to detect robots, so many of the columns in the robots table will be empty. Later, if the need presents itself, we can enhance the code to parse out and store all fields.)

Configuration Parameters

[ns/server/yourservername/acs/robot-detection]
; the URL of the Web Robots DB text file
WebRobotsDB=http://info.webcrawler.com/mak/projects/robots/active/all.txt
; which URLs should ad_robot_filter check (uncomment to turn system on)
; FilterPattern=/members-only-stuff/*
; FilterPattern=/members-only-stuff/*
; the URL where robots should be sent
RedirectURL=/robot-heaven/
; How frequently (in days) the robots table
; should be refreshed from the Web Robots DB
RefreshIntervalDays=30

Notes for the Site Administrator

Though admin pages exist for this module, there should be no need to use them in normal operation. This is because the ACS automatically refreshes the contents of the robots table at startup, if it is empty or if its data is older than the number of days specified by the RefreshIntervalDay configuration parameter (see below).
If no FilterPatterns are specified in the configuration, then the robot detection filter will not be installed.

Set Up

build a non-password protected site starting at /robot-heaven/ (that's the default destination), using ns_register_proc if necessary to create a pseudo static HTML file appearance
specify directories and file types you want filtered and bounced into /robot-heaven/ (from the ad.ini file)
restart AOLserver
visit the /admin/robot-detection/ admin page to see whether your configs took effect
view your server error log to make sure that the filters are getting registered

Testing

See the ACS Acceptance Test.

michael@arsdigita.com