Ars Digita Site Profiler
One of the
Ars Digita Free
Tools
Software testing is boring and repetitive. As a result, people aren't
very good at it - we tend to carry out the same actions over and over, and
thus miss errors that are outside the path of our normal activities. So
testing is often neglected, and it can be quite easy for minor (and
sometimes major) defects that detract from a piece of software's utility
and appearance of 'solidness' make their way into production code.
Automating The Process
Because websites have fairly consistent and well-defined mechanisms though
which all user interaction occurs (a GET or POST for input, HTML for output),
and through which most possible actions are communicated to the user (i.e.
links), we have an opportunity to easily automate the testing of a website.
Crawlers have been taking advantage of these consistent interfaces to index
sites for years. Why not use a crawler of sorts to test a site for broken
links and some other easily-distinguished errors, and to profile its
performance? That's the purpose of this system.
What This System Can Do
Tell you which pages are slow. Find pages with broken links, internal
server errors, and any other problem that is accompanied by an HTTP response
code other than 200.
What This System Can't Do
We have no way of testing the semantic correctness of a page's content.
Your .TCL script can return complete gibberish, and the crawler will happily
report that all is well as long as it gets an HTTP return code of 200.
Also, forms are currently ignored (meaning that a page only reachable via
a POST will not get profiled), and no validation is done on the returned HTML.
Both these things may change, though. See 'Possible Improvements', below.
The Mid-Level Details
The system consists of two parts:
The Crawler
The crawler is a fairly dumb program that takes a few arguments specifying
hostname, count of hits to perform, login information, and a few other things.
It then goes through a fairly straightforward sequence of actions:
- Log on to the site using the specified login information.
- Retrieve a specified starting path from the site.
- Record a record indicating the time required to get a response from the
server and any error conditions returned.
- Extract all links in the HTML retrieved by last hit that don't point to
a different host.
- Randomly choose one of the URLs from the previous step; fetch that from
the server.
- Repeat steps 3-5 until the specified number of hits has been performed.
- Write all recorded logging information to a file.
(Actually, it's a bit more complicated than this, when you get into
issues like reacting to error codes returned by the server and preventing
the crawler from getting 'trapped' in a small subset of the site. If you
want more detail, browse the source, with
particular attention to the profile_site
function.)
The Profile Log Analysis Pages
The log file stored by the crawler contains mostly just raw data about
response codes and times for the pages it fetches. Making effective use of
this data requires tools for browsing and generating summaries of the data.
This is where a new set of pages at /admin/profile comes in. On these pages,
developers can upload log files generated by the Python script. The data from
the logs is entered into a database table, and views on the data that
summarize by fairly arbitrary criteria can be created.
How To Use The System
Running The Crawler
The first step to using the profiling system is to record a dataset,
using the crawler script, named 'profile_acs_site'. This script is written
in Python; running it requires that you have Python v1.5.2 installed on your
system. The default Red Hat install seems to include Python by default. If
your system doesn't already have Python, you can download it from
the official Python website. All library
modules needed by the script should be included in the standard Python
distribution.
The format of the command line to record a profile is:
profile_acs_site -profile (host) (path) (count) (outfile) [(logon)] [(illegal_urls)]
Let's take this apart:
- -profile - this flag indicates that the script should
record a profile log. There's also a -report option, which takes an
existing log and generates a plain text report on the data it contains.
However, that option is mostly a legacy from the early stages of this
tool, before the web/Oracle analysis system was created.
- host - hostname of the site to profile. If the service
is running on a port other than 80, specify the port using the
hostname:port syntax.
- path - 'start path' from which the crawler begins
exploring the site. Usually you'll want to use '/'.
- count - number of samples (i.e. hits) to record
- outfile - name of the file in which profile data will
be stored. More details on the file format below.
- logon - a logon specification. This can take 2 forms:
- -acs - indicates an ACS-type login. This flag should be
immediately followed by an email/username and password. The crawler
then executes a 'GET' on /register/user-login.tcl with these values
passed as URL variables, and follows the redirect chain to get the
authentication cookie.
- -generic - indicates a 'generic' logon. This flag should
be immediately followed by a URL (presumably containing username and
password or somesuch as URL variables) which the crawler will GET in
order to log on. This option is mostly a leftover from an earlier
incarnation of the crawler, but I kept it around thinking it might be
useful some other time.
The logon may also be omitted entirely.
- illegal_urls - the last argument to the script is a
regular expression (Perl style, I believe - but go to the
documentation
for the Python 're' module for the final word on the syntax). The
crawler will not follow any links whose URL (minus scheme and hostname)
match this regexp. This can be used to prevent the crawler from wasting
time profiling pages you aren't interested in, or following links that
will cause it to be logged out. (When you specify a -acs login,
'/register' is automatically added to this regexp to prevent logouts
from occuring.) Please note that this filter only applies to links
parsed out of a page's HTML. The only restriction on redirect
locations is that they must point to the host being profiled.
Here's a sample command line I would use to crawl the WineAccess
website starting from the path '/', for 10000 hits, recording output in the
file 'walog', logging in to the site as 'burdell@gatech.edu', and avoiding
links to pages under /static (or, implicitly, /register):
profile_acs_site -profile www.wineaccess.com / 10000 walog \
-acs burdell@gatech.edu thegoodword "/static"
WARNING! When the crawler runs as a
logged in user, it can do anything that the user it connects as can
(with the exception of actions requiring a POST). If you have it
connect as a user with significant administrative privledges, it can
randomly delete content, nuke users, or wipe out significant parts of
the site, and all sorts of other bad stuff. Think carefully about
who the crawler will be connecting as.
Analyzing The Crawler Log
Once the log has been recorded, it needs to be uploaded to the database.
For this to work requires that the log analysis data model and pages be
available on an Oracle-backed AOLserver somewhere. Installation requires
two or three steps:
- Load the data model in profile.sql into SQL*Plus.
- Drop the files under the pages directory of the profile distribution
into their own directory somewhere under the server's pageroot. If
this is an ACS site, I recommend /admin/profile.
- If the server is running ACS, you're done. If not, you'll also need to
download the Ars
Digita AOLserver utilities and place the file somewhere that your
AOLserver can load it on start up (i.e. in either the shared or
private TCL directory).
In a browser, go to index.tcl in the directory where you installed the
contents of the pages directory. Select "Upload New Dataset", fill in the
information requested by the form. All fields other than short name are
optional, but some links generated elsewhere in the profile pages will be
broken if you don't provide an exact hostname for the site being profiled.
(This hostname should be the same as the one you passed to the crawler
script.) The upload will take a while (5-10 minutes for a 10,000 record
log file) - TCL is having to dissect a CSV file and insert a record into the
database for each sample taken, and this is not such a fast process.
After the upload is complete, go back to index.tcl, and you should
see an entry for the dataset you just uploaded. Click on it, and you get
a summary view that groups samples by stripping off the URL variables from
the paths and computes statistics on these groups. You can build your own
summary views by specifying filtering and grouping expressions in the form
on this page. The expressions provided here end up being put into
WHERE
and GROUP BY
clauses in a select on the
profile_samples
table, so you may reference any columns (with the
exception of dataset_id, which is already determined) from this table.
Log File Format
For those of you interested in writing your own analysis tools, the
logfile is a simple comma-seperated value file. Each line in the file
represents a single hit on the site and is in this format:
<seq>,<path>,<form>,<redirect>,<prevpath>,<result_cd>,<result_msg>,<html_valid>,<dur>,<timestamp>
- <seq>
- A simple sequence number recording the order in which hits occured
- <path>
- The initial URL requested from the site, minus "http://<host>"
- <form>
- Form data sent to the site. If this is present, then a POST to the site
was done with this data. If absent, a simple GET. Currently no POSTs
are done by the script - this was simply put in for future expansion.
- <redirect>
- A semicolon-seperated list of redirects. If the initial path results in
a redirect, the new location and any ensuing redirects will be listed
here. Note that the duration given for the page load includes the time
required to load all pages in the redirect chain.
- <prevpath>
- The page on which the link to path was found. This
path should probably always be either the path or last entry in the
redirect list (if any) of the previous sample.
- <result_cd>
- Normally, this will be the HTTP response code. However, certain errors
generate other error codes. Codes that aren't standard HTTP result codes
will always be negative:
- -1, -11:
An error occured connecting, no further info
- -12:
A timeout occured (> 60 seconds) while fetching page
- -13:
Encountered redirect to another host. (Note that this isn't
exactly an error, but we don't want the crawler wandering off
and profiling some other site.)
- -14:
Bad location header: The server returned a Location header that
had a relative or host-relative URL. These violate the HTTP
spec.
- -15:
The server returned 302, the HTTP redirect code, but did not
provide a Location header.
- <result_msg>
- A text result message
- <html_valid>
- 0 if HTML failed some sort of validation, 1 if it passed, empty string
if no validation performed. Currently the only 'validation' done is a
check for unevaluated evaluated '<% %>' ADP tags.
- <dur>
- length of time it took to fetch page, including all links in the redirect
chain, if any
- <timestamp>
- a timestamp telling when the fetch was initiated, in in the format
"YYYY-MM-DD HH:MI:SS", with a 24-hour clock. Greenwich mean time is
used.
Frequently Asked Questions:
- Why do I get a "Bad Location Header" when I try to get
the crawler to log in to an ACS site?
The Crawler is really anal-retentive about some HTTP specs.
In this case, it refuses to follow a 302 redirect if the
location header is not a complete URL (i.e. including
"http://<hostname>"). ns_returnredirect, if given a
relative URL as input, generates such a Location header. Some
of the cookie-chain and login pages on older ACS versions use
ns_returnredirect this way, and as a result generate bad
location headers. The crawler sees these, reports an error,
and does not complete the login process. You can fix this
problem
- My location headers are okay, but I'm still having
trouble getting the crawler to log in.
When you
specify the -acs logon option, the crawler attempts to log on
by doing a GET of /register/user-login.tcl with the email
address and password encoded as URL variables. The specific
field names used are "email" and "password_from_form". If
user login is not at this location, or expects different field
names, the login won't work. If you want to quickly modify
the crawler to make it work with your site (which I recommend
over changing the site to work with the crawler), look in the
crawler source at the function named "process_logon_args"
(It's the block of code that starts out with
"def process_logon_args
". Hopefully, even
those unfamiliar with Python will be able to figure out what
needs to be changed to get things working. Alternatively,
figure out how to use the -generic logon option.
Possible Improvements
Here are a few things I'd like to see but haven't had time to implement
just yet:
- I'd like to make the crawler smart enough to generate random data to send
to forms so that it can do POSTs as well as GETs. Also, it would be nice
to test how the POST recipients behave in response to random inputs.
I've made room for this in the data model and CSV files that the crawler
outputs, but haven't had time to to write code to parse forms and work
out what inputs need to be sent, just yet.
- Run the HTML returned by each page through an HTML validator, and record
any errors found.
- Extract anything that appears to be English text and run it through a
spelling/grammar checker. (Might be better to just hire a proofreader,
though, if we can't come up with an automated system that doesn't give
a lot of false hits.)
- Handle SSL. This will probably have to wait until the python
httplib
module that I'm using to talk to the site gains this support.
- Integration with other ACS components? It seems like there may be some
opportunities to, for example, streamline the generation of tickets
for defects located by the crawler.
- Clean up data load process so that aborting halfway doesn't leave
fragments of datasets laying around
- A faster means of importing datasets into Oracle. TCL parsing of the data
files is slow. Maybe exec an external script, or SQL*Loader?
- Implement PostGres versions of the data model and pages, to make this
tool useful for people who don't have access to Oracle.
elorenzo@arsdigita.com