(Originally published in March, 1999. This article resides in the Historical Archive section of the ArsDigita System Journal)
The ArsDigita Server Architecture is a way of building and delivering
Web services cheaply and reliably. Since I've written a whole book about building sites, the focus of
this document is on maintaining Web services. However, since you can't
maintain something unless you build it first, we'll start with a little
bit of background.
Web services are usually built with loose specs, to tight deadlines, by
people who aren't the world's best educated or most experienced
programmers. Once online, Web applications are subject to severe
stresses. Operations get aborted due to network failure. So if your
transaction processing house isn't in order, your database will become
corrupt. You might get dozens of simultaneous users in a second. So if
your concurrency control house isn't in order, your database will become
corrupt.
A workable solution?
- leave the hard stuff of concurrency control and transaction
atomicity to a standard relational database management system (RDBMS)
- develop pages in a safe interpreted language
Note that using Oracle + Perl DBI scripts running as CGI fulfill these
requirements. However, this isn't really fast enough for our kinds of
sites (e.g., www.scorecard.org;
featured on ABC World News and the recipient of 40 requests per
second, each requiring up to five database queries), so we
add a couple more elements:
- the Web server program is itself the database client
- the Web server program pools connections to the database and makes
them available to threads that need them
- the underlying operating system is Unix rather than NT
Given these requirements, we settled on the following technology
infrastructure for all the stuff we've built at ArsDigita:
- Solaris or HP-UX
- Oracle RDBMS
- AOLserver
- AOLserver Tcl API
Note that choice of commercial Unices is not intended to disparage
Linux; Linux barely existed back in 1993 when we started building Web
apps and it wasn't able to run Oracle until late 1998.
Aren't we done then? This is proven stuff. Unix has barely changed
since the late 1970s, Oracle since the late 1980s, and AOLserver
(formerly NaviServer) since 1995. We've launched more than 100 public
Web services using this infrastructure.
We are not done. What if a disk drive fills up? Are we notified in
advance? Are we notified when it happens? How long does it take to
restore service?
That's what's this document is about.
Making the Unix layer reliable
Unix per se is remarkably reliable, capable of handling more
than 100 million Web transactions between reboots. Unix is also very
complicated and, if it isn't running reliably, you probably need help
from a bona fide wizard. Unless you're behind a corporate firewall, you
have to be concerned with security. You don't want someone breaking in
and, if someone were to, you want to know immediately. This requires
arcane and up-to-the-minute knowledge of new security holes so we rely
on Scott Blomquist of techsquare.com
to tell us what to do.
Assuming a secure Unix machine that starts up and shuts down cleanly,
you need to think about the file systems. So that a hard disk failure
won't interrupt Web services, the ArsDigita Server Architecture mandates
mirrored disk drives using Solstice DiskSuite on Solaris or Mirror/UX on
HP. The Architecture mandates straight mirroring (RAID 1) in order to
retain high performance for the RDBMS. These disk drives should be on
separate SCSI chains.
Mirrored disks aren't all that useful when they are 100% full. Unix
programs love to write logs. Mailers log every action (see below). Web
servers log accesses, errors, and, the way we typically configure
AOLserver, database queries. The problem with these logging
operations is that they fill up disk drives but not so fast that
sysadmins are absolutely forced to write cron jobs to deal with the
problem.
For example, in December 1998, the Web services maintained by ArsDigita
were generating about 250 MB of log files every day. Some of these were
access logs that folks wanted to keep around and these generally get
gzipped. Some were error/query logs that had to be separately rolled
and removed. In any case, if left unattended, the logs will fill up a 9
GB disk drive sufficiently slowly that a human will delude himself into
thinking "I'll get to that when the disk is becoming full." The reality
is that every 6-12 months, the disks will fill up and the server will
stop serving pages reliably. The monitors described later in this
chapter will alert the sysadmins who will log in, rm and gzip a bunch of
files, and restart the Web servers. The outage might only last 15
minutes but it is embarrassing when it happens for the fourth or fifth
time in three years.
The naive approach to this problem is to build a db-backed Web
monitoring service on each Unix machine. Why a Web service? You don't
want to have to log into a machine to check it. The first thing the Web
service has to do is show a nicely formatted and explained display of
free disk space, CPU load, and network status. Tools such as HP
GlancePlus that require shell access and X Windows, are generally going
to be better for looking at an instantaneous snapshot of the system.
So if we just have a page that shows a summary of what you'd see by
typing "df -k" and running top or GlacePlus, then that's nothing to
write home about.
Let's try to think of something that a Web service can do that a human
running shell tools can't. Fundamentally, what computers are great at
is remembering to do something every night and storing the results in a
structured form. So we want our Unix monitor to keep track of disk
space usage over time. With this information, the monitor can
distinguish between a safe situation (/usr is 95% full) and an unsafe
situation (/mirror8/weblogs/ is 85% full but it was only 50% full
yesterday). Then it can know to highlight something in red or scream at
the sysadmins via email.
Finishing up the naive design, what we need is an AOLserver that will
prepare Web pages for us showing current and historic free disk space,
current and historic CPU load, and current and historic RAM usage. In
order to keep this history, it will rely on an RDBMS running on
the same computer.
Why do I keep referring to this as "the naive design"? First, you might
not be running an RDBMS on every computer that is important to your Web
operation. You probably have at least one RDBMS somewhere but you might
not want anyone connecting to it over the network. Second, you might
have 10 computers. Or 100. Are you really diligent enough to check
them all out periodically, even if it is as simple as visiting their
respective Unix monitors?
Each particular machine needs to run an AOLserver-based monitor.
However, the monitor needs to be configurable to either keep its history
locally or not. To handle monitors that don't keep local history and to
handle the case of the sysadmin who needs to watch 100 machines, the
monitor can offer up its current statistics via an XML page. Or a
monitor that is keeping history locally can be configured to
periodically pull XML pages from monitors that aren't keeping local
history. Thus, a sysadmin can look at one page and see a summary of
dozens or even hundreds of Unix systems (critical problems on top,
short summaries of each host below, drill-down detail pages linked from
the summaries).
We can't think of a good name for this so we call it ArsDigita
Cassandrix ("Cassandra for Unix"). It is
available from http://arsdigita.com/free-tools/cassandrix.
Making the Oracle layer reliable
What you'd naively think you were buying from Oracle is abstraction.
You give Oracle bytes and Oracle figures out where to put them, how to
index them, and how to reclaim their space once you've deleted them. In
fact, this is not how Oracle works at all. You have to tell Oracle in
which Unix files to put data. You have to tell Oracle how big a segment
in each file to give a new table. If you build up a million-row table
to perform a calculation and then delete all of those rows, Oracle will not
recover the space until you drop or truncate the table.
In theory, providing humans with many opportunities to tell Oracle what
to do will yield higher performance. The human is the one who knows how
data are to be accessed and how fast data sets will grow. In practice,
humans are lazy, sloppy, and easily distracted by more interesting
projects.
What happens is that Oracle becomes unable to update tables or insert
new information because a tablespace is full. Each disk drive on which
Oracle is installed might have 17 GB of free space but Oracle won't try
to use that space unless explicitly given permission. Thus the
canonical Oracle installation monitor watches available room in
tablespaces. There are lots of things like this out there. There is
even the meta-level monitor that will query out a bunch of Oracle
databases and show you a summary of many servers (see Chapter 6 of Oracle8
DBA Handbook). Oracle Corporation itself has gradually
addressed this need over the years with various incarnations of their
Enterprise Manager product.
Personally, I haven't found Enterprise
Manager useful because you need to have a Windows machine and have
SQL*Net enabled on your Oracle server (an expensive-to-manage security
risk). Furthermore, once you get all of that stuff set up, Enterprise
Manager won't actually answer many of the most important questions, e.g., those
having to do with deadlocks or the actual queries being run by
conflicting users. As with everything else in the RDBMS world, help is
just around the corner, at least if you like to read press releases.
Oracle is coming out with a new version of Enterprise Manager. It has
an immensely complicated architecture. You can connect via "thin
clients" (as far as I can tell, this is how Fortune 500 companies spell
"Web browser"). The fact of the matter is that if Oracle or anyone else
really understood the problem, there wouldn't be a need for new
versions. Perhaps it is time for an open-source movement that will let
us add the customizations we need.
What if, instead of installing packaged software and figuring out 6
months from now that it won't monitor what we need to know, we set up an
AOLserver with select privileges on the DBA views? To better manage
security, we run it with a recompiled version of our Oracle driver that
won't send INSERT or UPDATE statements to the database being monitored.
DBAs and customers can connect to the server via HTTPS from any browser
anywhere on the Internet ("thin client"!). We start with the scripts
that we can pull out of the various Oracle tuning, dba, and scripts
books you can find at a bookstore. Then we add the capability to answer
some of the questions that have tortured us personally (see
my Oracle tips page). Then we
make sure the whole thing is cleanly extendable and open-source so that
we and the rest of the world will eventually have something much better
than any commercial monitoring tool.
We can't think of a good name for this monitor either but at this point
in the discussion we have to give it a name. So let's call it
ArsDigita Cassandracle ("Cassandra for Oracle").
Here's a minimal subset of questions that Cassandracle needs to be able
to answer:
- whether there are any deadlocks in the database and who is causing
them
- whether any tablespaces are getting dangerously full
- whether any particular user or query or table is becoming a pig
- who is connected to the database and from where
- whether there are any performance bottlenecks in the database, e.g.,
whether we have enough cache blocks configured
- what PL/SQL procedures and functions are defined (list of objects,
hyperlinked to their definitions pulled from the database)
- what tables are defined (list of objects, hyperlinked to their
definitions pulled from the database)
The system has to assume that people aren't drilling down just for fun.
So the pages should be annotated with knowledge and example SQL for
fixing problems. For example, on a tablespace page, the system should
notice that a datafile isn't marked to autoextend and offer the syntax
to change that. Or notice that maxextents isn't unlimited and offer the
syntax to change that (explaining the implications).
ArsDigita Cassandracle is available from http://arsdigita.com/free-tools/cassandracle.
Making AOLserver reliable
Although AOLserver is extremely reliable, we don't rely on it being
reliable. In fact, we engineer our services so that the Web server
could crash every 10 minutes and users need never be aware. Here are the
elements of a Web service that does not depend on a Web server program
being reliable:
- keep user session state either in cookies on the user's browser or
in the RDBMS on the server; do not keep session state (except perhaps
for cached passwords) in the server
- run the Web server program from /etc/inittab so that if it crashes,
init will immediately restart it. Here's an example line from
/etc/inittab:
nsp:34:respawn:/home/nsadmin/bin/nsd -i -c /home/nsadmin/philg.ini
This tells init that, when it is in run state 3 or 4 (i.e., up and
running), to run the NaviServer daemon (nsd) in interactive mode, with
the config file /home/nsadmin/philg.ini
.
- in the event that AOLserver gets stuck or clogged with threads
waiting for a system resource (see the MTA example below), run ArsDigita
Keepalive on the same machine to make sure that the public server is
delivering files or scripts. If it isn't, Keepalive will kill a hung
AOLserver and notify the sysadmins.
Note that this same set of techniques and open-source software would
work with any other Web server program, especially easily with a
threaded Web server that only requires one Unix process.
Notifications of problems with your Web scripts
After launching a new Web service, it is common practice to check the
error logs every hour or so to make sure that users aren't getting
slammed. It is also good practice to install ArsDigita
Reporte, which will run every night and build you a report that
shows which URLs are generating server errors.
There are a couple of problems with these common practices. First,
users and the publisher might not be too thrilled about server errors
piling up for 24 hours before anyone notices. Second, server errors
that occur infrequently are likely to go unnoticed forever. The site is
launched so the developers aren't regularly going through the error
log. Users aren't used to perfection on their desktops or from their
ISP. So they won't necessarily send complaint email to webmaster. And
even if they do, such emails are likely to get lost amongst the
thousands of untraceable complaints from Win 3.1 and Macintosh users.
The ArsDigita Server Architecture mandates running
the ArsDigita
Watchdog server on whichever computer is running the Web service (or
another machine that has access to the public server's log files).
Watchdog is a separate simple AOLserver that does not rely on the Oracle
database (so that an Oracle failure won't also take down the Watchdog
server).
Watchdog is a collection of AOLserver Tcl scripts that, every 15
minutes, checks the portion of the log file that it hasn't seen
already. If there are errors, these are collected up and emailed to the
relevant people. There are a few interesting features to note
here. First, the monitored Web services need not be AOLserver-based;
Watchdog can look at the error log for any Web server program. Second,
if you're getting spammed with things that look like errors but in fact
aren't serious, you can specify ignore patterns. Third, Watchdog lets
you split up errors by directory and mail notifications to the right
people (e.g., send /bboard errors to joe@yourdomain.com and /news errors
to jane@yourdomain.com).
Watchdog is an open-source software application, available for download
from http://arsdigita.com/free-tools/watchdog.html.
Verifying Network Connection and DNS
You want to know if the Internet connection to your server is down. You
want to know if the DNS servers for your Internet domain are down. You
want to know if the InterNIC, in its monopolistic wisdom, has decided
to shut off your domain at the root servers.
It is impossible to test Internet connectivity from within your server
cluster. It is impossible to test the InterNIC's root servers from
a machine that uses your DNS servers as its name server. Thus, you
really want to monitor end-to-end connectivity from a machine outside of
your network that requests "http://www.yourdomain.com/SYSTEM/test.text"
or something (i.e., the request is for a hostname, not an IP address).
The ArsDigita Server Architecture mandates monitoring from
our own Uptime
service, either the public installation or a private copy.
How to avoid outages of this nature in the first place? Pay your
InterNIC bills in advance. We know countless domain owners, including
some multi-$billion companies, who've been shut off by the InterNIC
because they were a few days late in paying a bill. Sometimes InterNIC
sends you a letter saying "if you don't pay by Date X, we'll shut you
down" but then you find that they've shut you down on Date X-4.
In 1999, Microsoft failed to pay their Network Solutions renewal for a
domain used by the Hotmail service that they acquired. On Christmas Eve
the service became unreachable. Michael Chaney, a Linux programmer,
debugged the problem and paid the $35 so that he could read his email
again. Thus we find the odd situation of a Microsoft-owned service that
runs without any Microsoft software (Hotmail runs on Solaris and FreeBSD
Unix) and with financial backing from the Linux community.
You can avoid service loss due to power outage by buying a huge,
heavy, and expensive uninterruptible power supply. Or you can simply
co-locate your server with a company that runs batteries for everyone
with a backup generator (above.net and Exodus do this).
Most Internet service providers (ISP) are terribly sloppy. Like any
fast-growth industry, the ISP biz is a magnet for greedy people with a
dearth of technical experience. That's fine for them. They'll all go
public and make $100 million. But it leaves you with an invisible
server and talking on the phone with someone they hired last week.
Moreover, even if your ISP were reliable, they are connected to the
Internet by only one Tier 1 provider, e.g., Sprint. If the Sprint
connection to your town gets cut, your server is unreachable from the
entire Internet.
There is an alternative: co-locate at
AboveNet
or
Exodus.
These guys have peering arrangements with virtually all the Tier 1
providers. So if Sprint has a serious problem, their customers won't be
able to see your server, but the other 99% of the Internet will be able
to get to your pages just fine.
For public Internet services, the ArsDigita Server Architecture mandates
co-location at AboveNet
or
Exodus.
For Intranet services, the ArsDigita Server Architecture mandates
location as close as possible to a company's communications hub.
E-mail Server: It can take your Web services down
The mail transfer agent (MTA) on a Unix machine is not commonly
considered mission-critical. After all, if email can't be delivered the
attempting server will usually retry in four hours and continue retrying
for four or five days. In fact, a failure in the MTA can take down Web
services. Consider the following scenario:
- your Web server is sending out lots of email alerts to users
- many of those alerts are bouncing back because folks have changed
their email addresses
- you don't have any special scripts in place to handle bounces
programmatically
- bounces are saved on disk for postmaster@yourbox.com to read
- nobody ever checks the postmaster@yourbox.com inbox
- the MTA is installed in its default directory (in the typically
small /var partition)
- /var fills up after a few months or years of this
- the MTA stops accepting mail from the Web server
Doesn't sound so bad, does it? Perhaps users get a few error messages.
In fact, MTA failure can lead to a complete loss of Web service. At greenspun.com, we used to run Netscape
Messaging Server. Periodically, the server would get confused and
attempted connections to Port 25 would hang and/or time out. The
problem here is that we had a program to insert a message into a
bulletin board that worked the following way:
- get a db connection from AOLserver
- check user input
- insert message into database
- query the database to see if there are any registered alerts for new
postings of this nature
- while looping through rows coming back from the database, try to
send email if appropriate
Upon termination, AOLserver would return the database connection to its
pool, just as with any other program that sits behind a dynamic page.
This worked fine as long as sending email was reasonably quick. Even
with 80 or 100 alerts registered, the thread would terminate after 10 or
15 seconds. Suppose that each attempt to send email requires waiting 30
seconds for a wedged Netscape Messaging Server and then timing out.
That means a thread trying to send 100 alerts will take at least 50
minutes to execute. During those 50 minutes, a connection to Oracle
will be tied up and unavailable to other threads. As AOLserver is
typically configured to open a maximum of 8 or 12 connections to a
database, that means the entire pool will be occupied if more than 12
messages are posted to discussion forums in a 50-minute period. Anyone
else who requests a database-generated page from greenspun.com will find
that his or her thread blocks because no database connections are
available. Within just a few seconds, the server might be choked with
100 or more stacked-up threads, all waiting for a database connection.
The bottom line? Users get a "server busy" page.
How to defend against this situation? First, by building a robust MTA
installation that doesn't get wedged and, if it does, you find out about
it quickly. A wedged MTA would generally result in errors being written
to the error log and therefore you'd expect to get email from the
ArsDigita Watchdog monitor... except that the MTA is wedged so the
sysadmins wouldn't see the email until days later. So any monitoring of
the MTA needs to be done by an external machine whose MTA is presumably
functional.
The external machine needs to connect to the SMTP port (25) of the
monitored server every five minutes. This is the kind of monitoring
that the best ISPs do. However, I don't think it is adequate because it
usually only tests one portion of the average MTA. The part of the MTA
that listens on port 25 and accepts email doesn't have anything to do
with the queue handler that delivers mail. Sadly, it is the queue
handler that usually blows chunks. If you don't notice the problem
quickly, you find that your listener has queued up 500 MB of mail and it
is all waiting to be delivered but then your /var partition is full and
restarting the queue handler won't help... (see beginning of this
section).
What you really need to do is monitor SMTP throughput. You need a
program that connects to the monitored server on port 25 and specifies
some mail to be sent to itself. The monitor includes a script to
receive the sent mail and keep track of how long the round-trip took.
If mail takes more than a minute to receive, something is probably wrong
and a sysadmin should be alerted.
We are going to build a database-backed Web service to do this
monitoring, mostly by adapting email handling scripts we needed for the Action
Network. In a fit of imagination, we've called it ArsDigita MTA
Monitor. It is available from http://arsdigita.com/free-tools/mmon.html.
Even with lots of fancy monitors, it is unwise and unnecessary to rely
on any MTA working perfectly. Morever, if an MTA is stuck, it is hard
to know how to unstick it quickly without simply dropping everyone's
accumulated mail. This might not be acceptable.
My preferred solution is to reprogram applications so that threads
release database handles before trying to send email. In the case of
the insert-into-bboard program above, the thread uses the database
connection to accumulate email messages to send in an
ns_set
data structure. After releasing the database
connections back to AOLserver, the thread proceeds to attempt to send
the accumulated messages.
One minor note: you don't want someone using your server to relay
spam (i.e., connect to your machine and send email with a bogus return
address to half the Internet). Configuring your machine to deny relay
is good Internet citizenship but it can also help keep your service up
and running. If your MTA is working to send out 300,000 spam messages
and process the 100,000 bounces that come back, it will have a hard time
delivering email alerts that are part of your Web service. It might
get wedged and put you back into some of the horror scenarios discussed
at the beginning of this section.
For email, the ArsDigita Server Architecture mandates
- anti-relay configuration of MTA
- bounce handling via scripts
- identification and monitoring of free space in the MTA's partition
(via ArsDigita Cassandrix)
- monitoring of SMTP port response every five minutes
- monitoring of SMTP throughput (connection in, mail received back on
monitoring server) every fifteen minutes
- Web scripts that release database connections back to the pool
before trying trying to send email
Load Testing
Generally Web traffic builds gradually and you'll have months of warning
before running out of server capacity. However, before going public
with something like real-time race score results for the Boston
Marathon, you probably want to do some load testing.
We have built a tool called ArsDigita Traffic Jamme that is
really an AOLserver repeatedly calling ns_httpget
. It is a
quick and dirty little tool that can generate about 10 requests per
second from each load machine.
The software is available from
http://www.arsdigita.com/free-tools/tj.html.
If you are ordering a computer and can't figure out how big a machine to
get, our practical experience is that you can handle about 10 db-backed
requests per second with each 400 MHz SPARC CPU (using AOLserver
querying into indexed Oracle8 tables). Our benchmarks show that a 160
MHz HP-PA RISC CPU is just as fast (!). If you're serving static files
with a threaded server program like AOLserver, you can saturate a 10
Mbit link with a ridiculously tiny computer. So don't worry much about
the extra server overhead from delivering photos, illustrations, and
graphics.
What if your one big computer fails the load test?
Some of the folks using this architecture have services that get close
to 200 million hits per day. That's 40 million page loads, which is
about 1000 page loads per second during peak afternoon hours. By the
standard above, it would seem that in mid-1999 you could handle this
with 40 of the fastest HP PA CPUs or 100 SPARC CPUs ... Oops! The
biggest Sun E10000 can only hold 64 CPUs.
Another way to run out of juice is if you're running a public Web
collaboration service with private data. In this case, all the pages
will be SSL-encrypted. This puts a tremendous CPU load on the Web
server, especially if your service revolves around images or, even
worse, PDF files with embedded print-resolution images .
The solution? Buy the big computer, but only use it to run Oracle. Buy
a rack of small Unix machines and run AOLserver, including the SSL
module, on these. Suppose that you end up with 21 computers total.
Haven't you violated the fundamental tenets of the philosophy expounded
here? What kind of idiot would build a service that depends on 21
computers all being up and running 24x7? Well, let's not go into that
right now... but anyway, that's not what the ArsDigita Server
Architecture proposes.
Generally we rely on only one source of server-side persistence: Oracle.
As noted in the preceding section, we "keep user session state either in
cookies on the user's browser or in the RDBMS on the server". Sometimes
data are cached in AOLserver's virtual memory but the service doesn't
fail if a request misses the cache, it is only slowed down by enough
time to do an Oracle query. Given this fact, it doesn't really matter
if Joe User talks to Server 3 on his first request and Server 17 on his
second. So we can use load-balancing network hardware to give all the
machines in a rack of Web servers the same IP address. If one of the
servers dies, all future requests will be routed to the other servers.
Joe User doesn't depend on 21 computers all being up. Joe User depends
on the database server being up, at least 1 out of 20 of the Web servers
being up, and the fancy network hardware being up. We've solved our
scalability problem without dramatically reducing site reliability.
(Sources of fancy network hardware:
Alteon,
Foundry
Networks,
and
Cisco
(Local Director).)
What if you're breaking some of the rules and relying on AOLserver to
maintain session state? You have to make sure that if Joe User starts
with Server 3 every subsequent request from Joe will also go to Server
3. You don't need fancy network hardware anymore. Give each server a
unique IP address. Give each user a different story about what the IP
address corresponding to www.yourdomain.com is. This is called
"round-robin DNS". A disadvantage of this approach is that if a server
dies, the users who've been told to visit that IP address will be denied
service. You'll have to be quick on your toes to give that IP address
to another machine. Also, you won't get the best possible load
balancing among your Web servers. It is possible that the 10,000 people
whom your round-robin DNS server directed to Server 3 are much more
Web-interested than the 10,000 people whom your round-robin DNS server
directed to Server 17. These aren't strong arguments against
round-robin DNS, though. You need some kind of DNS server and DNS is
inherently redundant. You've eliminated the fancy network hardware (a
potential source of failure). You'll have pretty good load balancing.
You'll have near state-of-the-art reliability.
[An astute reader will note that I didn't address the issue of what
happens when the computer running Oracle runs out of power. This is
partly because modern SMP Unix boxes are tremendously powerful. The
largest individual HP and Sun machines are probably big enough for any
current Web service. The other reason that I didn't address the issue
of yoking many computers together to run Oracle is that it is
tremendously difficult, complex, and perilous. The key required elements
are software from Oracle (Parallel Server), a disk subsystem that can
be addressed by multiple computers (typically an EMC disk array), and a
few $million to pay for it all.]
How to administer a configuration like this? See
"Web Server Cluster
Management".
|
Making it work for five years
Design, engineering, and programming are trivial. Maintenance and
support are hard.
This might sound like an odd perspective coming from ArsDigita, a company whose reputation is
based on design, engineering, and programming. But think about your own
experience. Have you ever had a flash of inspiration? A moment where
you had a great and clever idea that made a project much more elegant?
Congratulations.
Have you ever gone for five straight years without making any mistakes?
Neither have we. Here's a letter that I sent a customer after we were
all patting ourselves on the back for building an ecommerce system that
successfully processed its first few thousand orders:
We are now running a fairly large-scale software and hardware system.
The components include
1) software at factory
2) factory bridge
3) Web server software
4) CyberCash interface
5) data warehouse software
6) disaster recovery software and hardware
A lot of valuable knowledge is encapsulated inside of various folks'
heads. This wouldn't be such a bad problem if
a) we all worked in the same building
b) we only needed the knowledge 9 to 5 when everyone was at work
But in practice we have people separated in time and space and therefore
knowledge separated in time and space.
What we therefore need is an on-line community! (Who would have guessed
that I would have suggested this.)
Most Web services are necessarily maintained by people scattered in
space and time. There is no practical way to assemble everyone with the
required knowledge in one place and then keep them there 24x7 for five
years.
Note that corporate IT departments face the same sorts of problems.
They need to keep knowledge and passwords where more than one person can
get to them. They need to make sure that backups and testing are
happening without merely relying on an individual's word and memory.
They need to make sure that bugs get identified, tracked, and fixed.
The corporate IT folks have it easy, though, in many ways. Typically
their operation need not run 24x7. An insurance company probably has
overnight analysis jobs but a database that is down at 2:00 am won't
affect operations. By contrast, at an ecommerce site, downtime at 2:00
am Eastern time means that customers in Europe and Asia won't be able to
order products.
Another way in which corporate IT folks have it easy is that there is a
logical physical place for written system logs and password files. The
mainframe is probably located in a building where most of the IT workers
have offices. There are elaborate controls on who gets physical access
to the mainframe facility and therefore logs may be safely stored in a
place accessible to all IT workers.
What do we need our virtual community to do? First, it has to run on a
separate cluster and network than the online service. If www.foobar.com
is unreachable, any staff member ought to be able to visit the
collaboration server and at least get the phone number for the sysadmins
and ISP responsible for the site. Thus the only information that a
distraught staffer needs to remember is his or her email address, a
personal password, and the hostname of the collaboration server. We can
standardize on a convention of staff as the hostname, e.g.,
http://staff.foobar.com.
Second, the community needs to distinguish among different levels of
users. You want a brand-new staff person involved with the service to
be able to report a bug ("open a ticket"). But you don't want this
person to be able to get information that would appropriately be given
only to those with the root password on the online server.
Third, the community needs to make the roles of different members
explicit. Anyone should be able to ask "Who is responsible for
backups?" or "Who are the Oracle dbas?"
Fourth, the community needs to keep track of what routine tasks have
been done, by whom, and when. If tasks are becoming overdue, these need
to be highlighted and brought to everyone's attention. For example, the
community should have visible policies about how often backup tapes are
transferred off-site, about how often backup tapes are verified, and
about how often Oracle dumps are restored for testing. This is the
function closest to the "virtual logbook" idea. It is essential that
this be high quality software that is easy for people to use. Consider
the corporate IT department where everyone can see the tapes, see people
changing them, see people verifying some of the tapes, see people taking
some of the tapes off-site. If some of these processes were being
ignored, the staff would notice. However, with a Web service, the
machine is probably at a co-location service such as AboveNet or Exodus.
Most staffs will probably never even lay eyes on the server or the
backup tapes. So a miscommunication among staffers or between staff and
the ISP could lead to backup tapes never getting changed or verified.
Fifth, the community site needs to keep track of what important admin
tasks are going on, who is doing them, and why. For example, suppose
that some file systems have become corrupt and Joe Admin is restoring
them from tape. Joe has commented out the normal nightly backup script
from root's crontab because otherwise there is a danger that the cron
job might write bad new data over the important backup tape (remember
that Joe will not be physically present at the server; he will have to
rely on the ISP to change tapes, set the write-protect switch on the
tape, etc.). If Jane Admin does not know this, she might note the
absence of the backup script in root's crontab with horror and rush to
put it back in. Remember that Jane and Joe may each be working from
homes in San Francisco and Boston, respectively. There needs to be a
prominent place at staff.foobar.com where Jane will be alerted of Joe's
activity.
Sixth, the community software needs to be able to track bug reports and
new feature ideas. In order to do a reasonable job on these, the
software needs to have a strong model of the software release cycle
behind the site. For example, a programmer ought to be able to say, in
a structured manner, "fixed in Release 3.2." Separately the community
database keeps track of when Release 3.2 is due to go live. Feature
request and bug tracking modules tie into the database's user groups and
roles tables. If a bug ticket stays open too long, email alerts are
automatically sent to managers.
Seventh, the community software needs to at least point to the major
documentation of the on-line service and also serve as a focal point for
teaching new customer service employees, programmers, db admins, and
sysadmins. A question about the service ("how do I give a
customer an arbitrary $5 refund?") or the software ("what cron job is
responsible for the nightly Oracle exports?") should archived with its
answer.
How do we accomplish all this? With the ArsDigita
Community System! Plus a few extra modules that are adapted to
ticket tracking and have special columns for "backup tapes verified
date" and so forth.
Taking it down for five hours
Suppose that you're upgrading from Oracle 8.1.5 to Oracle 8.1.6. You
don't want your Web server trying to connect to Oracle during this
upgrade. Nor if the database is unavailable do you want users to be
confronted with server errors. You'd like them to get a nice "we're
upgrading some software; please come back after 2:00 am eastern time"
message.
The naive approach:
- shut off main server
- bring up temporary server rooted in a Unix directory empty save for
an index page that says "come back later"
- shut off temporary server
- bring up main server
This works great for people who come to http://www.foobar.com but not
for those who've bookmarked
http://www.foobar.com/yow/some-internal-page.html. People following
bookmarks or links will get a "file not found" message. With AOLserver
2.3, you could configure the file-not-found message to say "come back
later" but that really isn't elegant.
A better way is to prepare by keeping a server rooted at
/web/comebacklater/www/ with its own /home/nsadmin/comebacklater.ini
file, ready to go at all times. This server is configured as follows:
- same hostname, IP address, and port as the production server
- Private and Shared Tcl libraries both pointing to
/web/comebacklater/tcl/ (this avoids any
ns_register_proc
commands that might be invoked by the shared library, e.g., those that
feed *.tcl URLs to the Tcl interpreter), e.g.,
[ns/server/comebacklater/tcl]
Library=/web/comebacklater/tcl
SharedLibrary=/web/comebacklater/tcl
- one file in the Tcl library, /web/comebacklater/tcl/whole-site.tcl
containing the following code:
ns_register_proc POST / comeback
ns_register_proc GET / comeback
proc comeback {ignore} {
ns_returnfile 200 text/html "[ns_info pageroot]/index.html"
}
Given this configured comebacklater server, and a commented-out entry in
/etc/inittab to invoke it, here's how to gracefully take a site down for
maintenance:
- visit /web/comebacklater/www/index.html and edit the file until it
contains the message that you'd like users to see
- go into /etc/inittab and comment out the line that invokes the
production server
init q
or kill -HUP 1
to instruct
init
to reread /etc/inittab
- kill the production server (
ps -ef | grep 'foobar.ini'
for the
production server's PID, then kill PID
)
- grep for the production server to make sure that init hasn't
restarted it for some reason
- go into /etc/inittab and uncomment the line that invokes the
comebacklater.ini server
init q
or kill -HUP 1
to instruct
init
to reread /etc/inittab
- grep for the comebacklater.ini server to make sure that init has
started it; also visit /home/nsadmin/log/comebacklater-error.log to make
sure that the server started without problems
- verify from a Web browser that the comebacklater server is operating
properly
- **** do your database maintenance (or whatever) *****
- go into /etc/inittab and comment out the comebacklater.ini line
init q
or kill -HUP 1
to instruct
init
to reread /etc/inittab
- kill the comebacklater.ini server
- go into /etc/inittab and uncomment the line that invokes the
production (foobar.ini) server
init q
or kill -HUP 1
to instruct
init
to reread /etc/inittab
- verify that the server started up without problems, from
/home/nsadmin/log/foobar-error.log
- verify from a Web browser that the production server is operating
properly
Summary
Here's a review of the main elements of the ArsDigita Server
Architecture:
- monitor of free disk space and disk space growth
- monitor of Oracle tablespace freespace and freespace shrinkage
- use of ArsDigita Keepalive to monitor live Web servers and restart
them if necessary
- frequent (every 15 minutes) automated checking of Web server error
logs and email notification to appropriate sysadmins and programmers
- daily log analysis via ArsDigita Reporte to find pages that
consistently produce errors or 404 Not Found results
- frequent automated checking of network connectivity and DNS
resolution of your hostname from outside your network (via ArsDigita
Uptime); prepayment of InterNIC bills; co-location at AboveNet
or
Exodus.
- careful attention to the mail transfer agent configuration,
automated bounced email handling, special monitoring of free disk space,
monitoring of SMTP response and throughput, and programming in such a
way that a hung MTA won't bring down the entire service
- staff collaboration server on a separate cluster where sysadmins get
together to log activities, where problem tickets are opened, discussed,
and closed, where contact and role information is available in emergencies
Another way to look at this is in terms of the infrastructure layers
that we introduced at the beginning of this document:
- Solaris or HP-UX. Disks mirrored with vendor tools. Monitored by
ArsDigita Cassandrix.
- Oracle RDBMS. Monitored by ArsDigita Cassandracle.
- AOLserver. Run from /etc/inittab. Monitored by ArsDigita Keepalive.
- AOLserver Tcl API. Monitored by ArsDigita Watchdog and Reporte.
plus the hidden layers
- Network, DNS, and power. Monitored externally by ArsDigita Uptime service.
- Email. Monitored by ArsDigita MTA Monitor.
More
Text and photos Copyright
1998 Philip Greenspun.