Web Robot Detection
part of the
ArsDigita Community System
by
Michael Yoon
The Big Picture
Many of the pages on an ACS-based website are hidden from robots
(a.k.a. search engines) by virtue of the fact that login is required to
access them. A generic way to expose login-required content to robots is
to redirect all requests from robots to a special URL that is designed
to give the robot what at least appear to be linked files.
You might want to use this software for situations where public (not
password-protected) pages aren't getting indexed by a specific robot.
Many robots won't visit pages that look like CGI scripts, e.g., with
question marks and form vars (this is discussed in Chapter 7 of
Philip and Alex's Guide to Web Publishing).
The Medium-sized Picture
In order for this to work, we need a way to distinguish robots from
human beings. Fortunately, the
Web
Robots Database maintains a list of active robots that they kindly
publish as a
text
file. By loading this list
into the database, we can implement the following algorithm:
- Check the
User-Agent
of each HTTP request against those of known robots (which are stored in the robot_useragent
column of the robots
table).
- If there is a match, redirect to the special URL.
- This special URL can either be a static page or a dynamic script that dumps lots of juicy text from the database, for the robot's indexing pleasure.
This algorithm is implemented by a postauth filter proc: ad_robot_filter
.
(Note: For now, we are only storing the minimum number of
fields needed to detect robots, so many of the columns in the
robots
table will be empty. Later, if the need presents
itself, we can enhance the code to parse out and store all fields.)
Configuration Parameters
[ns/server/yourservername/acs/robot-detection]
; the URL of the Web Robots DB text file
WebRobotsDB=http://info.webcrawler.com/mak/projects/robots/active/all.txt
; which URLs should ad_robot_filter check (uncomment to turn system on)
; FilterPattern=/members-only-stuff/*
; FilterPattern=/members-only-stuff/*
; the URL where robots should be sent
RedirectURL=/robot-heaven/
; How frequently (in days) the robots table
; should be refreshed from the Web Robots DB
RefreshIntervalDays=30
Notes for the Site Administrator
-
Though admin pages exist for this module, there should be no need to
use them in normal operation. This is because the ACS automatically
refreshes the contents of the
robots
table at startup, if
it is empty or if its data is older than the number of days specified
by the RefreshIntervalDay
configuration parameter (see
below).
-
If no
FilterPattern
s are specified in the configuration,
then the robot detection filter will not be installed.
Set Up
- build a non-password protected site starting at /robot-heaven/
(that's the default destination), using
ns_register_proc
if
necessary to create a pseudo static HTML file appearance
- specify directories and file types you want filtered and bounced
into /robot-heaven/ (from the ad.ini file)
- restart AOLserver
- visit the /admin/robot-detection/ admin page to see whether your
configs took effect
- view your server error log to make sure that the filters are getting
registered
Testing
See the
ACS Acceptance Test.
michael@arsdigita.com