ArsDigita Server Architecture (ArsDigita Systems Journal)

Hydroelectric plant on the Connecticut River in Vernon, Vermont.

ArsDigita Server Architecture

for reliably delivering various kinds of web services, by Philip Greenspun (philg@mit.edu)

ArsDigita : ArsDigita Systems Journal : One article

(Originally published in March, 1999. This article resides in the Historical Archive section of the ArsDigita System Journal)
The ArsDigita Server Architecture is a way of building and delivering Web services cheaply and reliably. Since I've written a whole book about building sites, the focus of this document is on maintaining Web services. However, since you can't maintain something unless you build it first, we'll start with a little bit of background.
Web services are usually built with loose specs, to tight deadlines, by people who aren't the world's best educated or most experienced programmers. Once online, Web applications are subject to severe stresses. Operations get aborted due to network failure. So if your transaction processing house isn't in order, your database will become corrupt. You might get dozens of simultaneous users in a second. So if your concurrency control house isn't in order, your database will become corrupt.
A workable solution?

leave the hard stuff of concurrency control and transaction atomicity to a standard relational database management system (RDBMS)
develop pages in a safe interpreted language
Note that using Oracle + Perl DBI scripts running as CGI fulfill these requirements. However, this isn't really fast enough for our kinds of sites (e.g., www.scorecard.org; featured on ABC World News and the recipient of 40 requests per second, each requiring up to five database queries), so we add a couple more elements:

the Web server program is itself the database client
the Web server program pools connections to the database and makes them available to threads that need them
the underlying operating system is Unix rather than NT
Given these requirements, we settled on the following technology infrastructure for all the stuff we've built at ArsDigita:

Solaris or HP-UX
Oracle RDBMS
AOLserver
AOLserver Tcl API
Note that choice of commercial Unices is not intended to disparage Linux; Linux barely existed back in 1993 when we started building Web apps and it wasn't able to run Oracle until late 1998.
Aren't we done then? This is proven stuff. Unix has barely changed since the late 1970s, Oracle since the late 1980s, and AOLserver (formerly NaviServer) since 1995. We've launched more than 100 public Web services using this infrastructure.
We are not done. What if a disk drive fills up? Are we notified in advance? Are we notified when it happens? How long does it take to restore service?
That's what's this document is about.
Making the Unix layer reliable
Unix per se is remarkably reliable, capable of handling more than 100 million Web transactions between reboots. Unix is also very complicated and, if it isn't running reliably, you probably need help from a bona fide wizard. Unless you're behind a corporate firewall, you have to be concerned with security. You don't want someone breaking in and, if someone were to, you want to know immediately. This requires arcane and up-to-the-minute knowledge of new security holes so we rely on Scott Blomquist of techsquare.com to tell us what to do.
Assuming a secure Unix machine that starts up and shuts down cleanly, you need to think about the file systems. So that a hard disk failure won't interrupt Web services, the ArsDigita Server Architecture mandates mirrored disk drives using Solstice DiskSuite on Solaris or Mirror/UX on HP. The Architecture mandates straight mirroring (RAID 1) in order to retain high performance for the RDBMS. These disk drives should be on separate SCSI chains.
Mirrored disks aren't all that useful when they are 100% full. Unix programs love to write logs. Mailers log every action (see below). Web servers log accesses, errors, and, the way we typically configure AOLserver, database queries. The problem with these logging operations is that they fill up disk drives but not so fast that sysadmins are absolutely forced to write cron jobs to deal with the problem.
For example, in December 1998, the Web services maintained by ArsDigita were generating about 250 MB of log files every day. Some of these were access logs that folks wanted to keep around and these generally get gzipped. Some were error/query logs that had to be separately rolled and removed. In any case, if left unattended, the logs will fill up a 9 GB disk drive sufficiently slowly that a human will delude himself into thinking "I'll get to that when the disk is becoming full." The reality is that every 6-12 months, the disks will fill up and the server will stop serving pages reliably. The monitors described later in this chapter will alert the sysadmins who will log in, rm and gzip a bunch of files, and restart the Web servers. The outage might only last 15 minutes but it is embarrassing when it happens for the fourth or fifth time in three years.
The naive approach to this problem is to build a db-backed Web monitoring service on each Unix machine. Why a Web service? You don't want to have to log into a machine to check it. The first thing the Web service has to do is show a nicely formatted and explained display of free disk space, CPU load, and network status. Tools such as HP GlancePlus that require shell access and X Windows, are generally going to be better for looking at an instantaneous snapshot of the system. So if we just have a page that shows a summary of what you'd see by typing "df -k" and running top or GlacePlus, then that's nothing to write home about.
Let's try to think of something that a Web service can do that a human running shell tools can't. Fundamentally, what computers are great at is remembering to do something every night and storing the results in a structured form. So we want our Unix monitor to keep track of disk space usage over time. With this information, the monitor can distinguish between a safe situation (/usr is 95% full) and an unsafe situation (/mirror8/weblogs/ is 85% full but it was only 50% full yesterday). Then it can know to highlight something in red or scream at the sysadmins via email.
Finishing up the naive design, what we need is an AOLserver that will prepare Web pages for us showing current and historic free disk space, current and historic CPU load, and current and historic RAM usage. In order to keep this history, it will rely on an RDBMS running on the same computer.
Why do I keep referring to this as "the naive design"? First, you might not be running an RDBMS on every computer that is important to your Web operation. You probably have at least one RDBMS somewhere but you might not want anyone connecting to it over the network. Second, you might have 10 computers. Or 100. Are you really diligent enough to check them all out periodically, even if it is as simple as visiting their respective Unix monitors?
Each particular machine needs to run an AOLserver-based monitor. However, the monitor needs to be configurable to either keep its history locally or not. To handle monitors that don't keep local history and to handle the case of the sysadmin who needs to watch 100 machines, the monitor can offer up its current statistics via an XML page. Or a monitor that is keeping history locally can be configured to periodically pull XML pages from monitors that aren't keeping local history. Thus, a sysadmin can look at one page and see a summary of dozens or even hundreds of Unix systems (critical problems on top, short summaries of each host below, drill-down detail pages linked from the summaries).
We can't think of a good name for this so we call it ArsDigita Cassandrix ("Cassandra for Unix"). It is available from http://arsdigita.com/free-tools/cassandrix.
Making the Oracle layer reliable
What you'd naively think you were buying from Oracle is abstraction. You give Oracle bytes and Oracle figures out where to put them, how to index them, and how to reclaim their space once you've deleted them. In fact, this is not how Oracle works at all. You have to tell Oracle in which Unix files to put data. You have to tell Oracle how big a segment in each file to give a new table. If you build up a million-row table to perform a calculation and then delete all of those rows, Oracle will not recover the space until you drop or truncate the table.
In theory, providing humans with many opportunities to tell Oracle what to do will yield higher performance. The human is the one who knows how data are to be accessed and how fast data sets will grow. In practice, humans are lazy, sloppy, and easily distracted by more interesting projects.
What happens is that Oracle becomes unable to update tables or insert new information because a tablespace is full. Each disk drive on which Oracle is installed might have 17 GB of free space but Oracle won't try to use that space unless explicitly given permission. Thus the canonical Oracle installation monitor watches available room in tablespaces. There are lots of things like this out there. There is even the meta-level monitor that will query out a bunch of Oracle databases and show you a summary of many servers (see Chapter 6 of Oracle8 DBA Handbook). Oracle Corporation itself has gradually addressed this need over the years with various incarnations of their Enterprise Manager product. Personally, I haven't found Enterprise Manager useful because you need to have a Windows machine and have SQL*Net enabled on your Oracle server (an expensive-to-manage security risk). Furthermore, once you get all of that stuff set up, Enterprise Manager won't actually answer many of the most important questions, e.g., those having to do with deadlocks or the actual queries being run by conflicting users. As with everything else in the RDBMS world, help is just around the corner, at least if you like to read press releases. Oracle is coming out with a new version of Enterprise Manager. It has an immensely complicated architecture. You can connect via "thin clients" (as far as I can tell, this is how Fortune 500 companies spell "Web browser"). The fact of the matter is that if Oracle or anyone else really understood the problem, there wouldn't be a need for new versions. Perhaps it is time for an open-source movement that will let us add the customizations we need.
What if, instead of installing packaged software and figuring out 6 months from now that it won't monitor what we need to know, we set up an AOLserver with select privileges on the DBA views? To better manage security, we run it with a recompiled version of our Oracle driver that won't send INSERT or UPDATE statements to the database being monitored. DBAs and customers can connect to the server via HTTPS from any browser anywhere on the Internet ("thin client"!). We start with the scripts that we can pull out of the various Oracle tuning, dba, and scripts books you can find at a bookstore. Then we add the capability to answer some of the questions that have tortured us personally (see my Oracle tips page). Then we make sure the whole thing is cleanly extendable and open-source so that we and the rest of the world will eventually have something much better than any commercial monitoring tool.
We can't think of a good name for this monitor either but at this point in the discussion we have to give it a name. So let's call it ArsDigita Cassandracle ("Cassandra for Oracle").
Here's a minimal subset of questions that Cassandracle needs to be able to answer:

whether there are any deadlocks in the database and who is causing them
whether any tablespaces are getting dangerously full
whether any particular user or query or table is becoming a pig
who is connected to the database and from where
whether there are any performance bottlenecks in the database, e.g., whether we have enough cache blocks configured
what PL/SQL procedures and functions are defined (list of objects, hyperlinked to their definitions pulled from the database)
what tables are defined (list of objects, hyperlinked to their definitions pulled from the database)
The system has to assume that people aren't drilling down just for fun. So the pages should be annotated with knowledge and example SQL for fixing problems. For example, on a tablespace page, the system should notice that a datafile isn't marked to autoextend and offer the syntax to change that. Or notice that maxextents isn't unlimited and offer the syntax to change that (explaining the implications).
ArsDigita Cassandracle is available from http://arsdigita.com/free-tools/cassandracle.
Making AOLserver reliable
Although AOLserver is extremely reliable, we don't rely on it being reliable. In fact, we engineer our services so that the Web server could crash every 10 minutes and users need never be aware. Here are the elements of a Web service that does not depend on a Web server program being reliable:
keep user session state either in cookies on the user's browser or in the RDBMS on the server; do not keep session state (except perhaps for cached passwords) in the server
run the Web server program from /etc/inittab so that if it crashes, init will immediately restart it. Here's an example line from /etc/inittab:
nsp:34:respawn:/home/nsadmin/bin/nsd -i -c /home/nsadmin/philg.ini  
This tells init that, when it is in run state 3 or 4 (i.e., up and running), to run the NaviServer daemon (nsd) in interactive mode, with the config file /home/nsadmin/philg.ini.
in the event that AOLserver gets stuck or clogged with threads waiting for a system resource (see the MTA example below), run ArsDigita Keepalive on the same machine to make sure that the public server is delivering files or scripts. If it isn't, Keepalive will kill a hung AOLserver and notify the sysadmins.
Note that this same set of techniques and open-source software would work with any other Web server program, especially easily with a threaded Web server that only requires one Unix process.
Notifications of problems with your Web scripts
After launching a new Web service, it is common practice to check the error logs every hour or so to make sure that users aren't getting slammed. It is also good practice to install ArsDigita Reporte, which will run every night and build you a report that shows which URLs are generating server errors.
There are a couple of problems with these common practices. First, users and the publisher might not be too thrilled about server errors piling up for 24 hours before anyone notices. Second, server errors that occur infrequently are likely to go unnoticed forever. The site is launched so the developers aren't regularly going through the error log. Users aren't used to perfection on their desktops or from their ISP. So they won't necessarily send complaint email to webmaster. And even if they do, such emails are likely to get lost amongst the thousands of untraceable complaints from Win 3.1 and Macintosh users.
The ArsDigita Server Architecture mandates running the ArsDigita Watchdog server on whichever computer is running the Web service (or another machine that has access to the public server's log files). Watchdog is a separate simple AOLserver that does not rely on the Oracle database (so that an Oracle failure won't also take down the Watchdog server).
Watchdog is a collection of AOLserver Tcl scripts that, every 15 minutes, checks the portion of the log file that it hasn't seen already. If there are errors, these are collected up and emailed to the relevant people. There are a few interesting features to note here. First, the monitored Web services need not be AOLserver-based; Watchdog can look at the error log for any Web server program. Second, if you're getting spammed with things that look like errors but in fact aren't serious, you can specify ignore patterns. Third, Watchdog lets you split up errors by directory and mail notifications to the right people (e.g., send /bboard errors to joe@yourdomain.com and /news errors to jane@yourdomain.com).
Watchdog is an open-source software application, available for download from http://arsdigita.com/free-tools/watchdog.html.

Verifying Network Connection and DNS
You want to know if the Internet connection to your server is down. You want to know if the DNS servers for your Internet domain are down. You want to know if the InterNIC, in its monopolistic wisdom, has decided to shut off your domain at the root servers.
It is impossible to test Internet connectivity from within your server cluster. It is impossible to test the InterNIC's root servers from a machine that uses your DNS servers as its name server. Thus, you really want to monitor end-to-end connectivity from a machine outside of your network that requests "http://www.yourdomain.com/SYSTEM/test.text" or something (i.e., the request is for a hostname, not an IP address).
The ArsDigita Server Architecture mandates monitoring from our own Uptime service, either the public installation or a private copy.
How to avoid outages of this nature in the first place? Pay your InterNIC bills in advance. We know countless domain owners, including some multi-$billion companies, who've been shut off by the InterNIC because they were a few days late in paying a bill. Sometimes InterNIC sends you a letter saying "if you don't pay by Date X, we'll shut you down" but then you find that they've shut you down on Date X-4.
In 1999, Microsoft failed to pay their Network Solutions renewal for a domain used by the Hotmail service that they acquired. On Christmas Eve the service became unreachable. Michael Chaney, a Linux programmer, debugged the problem and paid the $35 so that he could read his email again. Thus we find the odd situation of a Microsoft-owned service that runs without any Microsoft software (Hotmail runs on Solaris and FreeBSD Unix) and with financial backing from the Linux community.

You can avoid service loss due to power outage by buying a huge, heavy, and expensive uninterruptible power supply. Or you can simply co-locate your server with a company that runs batteries for everyone with a backup generator (above.net and Exodus do this).
Most Internet service providers (ISP) are terribly sloppy. Like any fast-growth industry, the ISP biz is a magnet for greedy people with a dearth of technical experience. That's fine for them. They'll all go public and make $100 million. But it leaves you with an invisible server and talking on the phone with someone they hired last week. Moreover, even if your ISP were reliable, they are connected to the Internet by only one Tier 1 provider, e.g., Sprint. If the Sprint connection to your town gets cut, your server is unreachable from the entire Internet.
There is an alternative: co-locate at AboveNet or Exodus. These guys have peering arrangements with virtually all the Tier 1 providers. So if Sprint has a serious problem, their customers won't be able to see your server, but the other 99% of the Internet will be able to get to your pages just fine.
For public Internet services, the ArsDigita Server Architecture mandates co-location at AboveNet or Exodus. For Intranet services, the ArsDigita Server Architecture mandates location as close as possible to a company's communications hub.
E-mail Server: It can take your Web services down
The mail transfer agent (MTA) on a Unix machine is not commonly considered mission-critical. After all, if email can't be delivered the attempting server will usually retry in four hours and continue retrying for four or five days. In fact, a failure in the MTA can take down Web services. Consider the following scenario:

your Web server is sending out lots of email alerts to users
many of those alerts are bouncing back because folks have changed their email addresses
you don't have any special scripts in place to handle bounces programmatically
bounces are saved on disk for postmaster@yourbox.com to read
nobody ever checks the postmaster@yourbox.com inbox
the MTA is installed in its default directory (in the typically small /var partition)
/var fills up after a few months or years of this
the MTA stops accepting mail from the Web server
Doesn't sound so bad, does it? Perhaps users get a few error messages. In fact, MTA failure can lead to a complete loss of Web service. At greenspun.com, we used to run Netscape Messaging Server. Periodically, the server would get confused and attempted connections to Port 25 would hang and/or time out. The problem here is that we had a program to insert a message into a bulletin board that worked the following way:

get a db connection from AOLserver
check user input
insert message into database
query the database to see if there are any registered alerts for new postings of this nature
while looping through rows coming back from the database, try to send email if appropriate
Upon termination, AOLserver would return the database connection to its pool, just as with any other program that sits behind a dynamic page. This worked fine as long as sending email was reasonably quick. Even with 80 or 100 alerts registered, the thread would terminate after 10 or 15 seconds. Suppose that each attempt to send email requires waiting 30 seconds for a wedged Netscape Messaging Server and then timing out. That means a thread trying to send 100 alerts will take at least 50 minutes to execute. During those 50 minutes, a connection to Oracle will be tied up and unavailable to other threads. As AOLserver is typically configured to open a maximum of 8 or 12 connections to a database, that means the entire pool will be occupied if more than 12 messages are posted to discussion forums in a 50-minute period. Anyone else who requests a database-generated page from greenspun.com will find that his or her thread blocks because no database connections are available. Within just a few seconds, the server might be choked with 100 or more stacked-up threads, all waiting for a database connection.
The bottom line? Users get a "server busy" page.
How to defend against this situation? First, by building a robust MTA installation that doesn't get wedged and, if it does, you find out about it quickly. A wedged MTA would generally result in errors being written to the error log and therefore you'd expect to get email from the ArsDigita Watchdog monitor... except that the MTA is wedged so the sysadmins wouldn't see the email until days later. So any monitoring of the MTA needs to be done by an external machine whose MTA is presumably functional.
The external machine needs to connect to the SMTP port (25) of the monitored server every five minutes. This is the kind of monitoring that the best ISPs do. However, I don't think it is adequate because it usually only tests one portion of the average MTA. The part of the MTA that listens on port 25 and accepts email doesn't have anything to do with the queue handler that delivers mail. Sadly, it is the queue handler that usually blows chunks. If you don't notice the problem quickly, you find that your listener has queued up 500 MB of mail and it is all waiting to be delivered but then your /var partition is full and restarting the queue handler won't help... (see beginning of this section).
What you really need to do is monitor SMTP throughput. You need a program that connects to the monitored server on port 25 and specifies some mail to be sent to itself. The monitor includes a script to receive the sent mail and keep track of how long the round-trip took. If mail takes more than a minute to receive, something is probably wrong and a sysadmin should be alerted.
We are going to build a database-backed Web service to do this monitoring, mostly by adapting email handling scripts we needed for the Action Network. In a fit of imagination, we've called it ArsDigita MTA Monitor. It is available from http://arsdigita.com/free-tools/mmon.html.
Even with lots of fancy monitors, it is unwise and unnecessary to rely on any MTA working perfectly. Morever, if an MTA is stuck, it is hard to know how to unstick it quickly without simply dropping everyone's accumulated mail. This might not be acceptable.
My preferred solution is to reprogram applications so that threads release database handles before trying to send email. In the case of the insert-into-bboard program above, the thread uses the database connection to accumulate email messages to send in an ns_set data structure. After releasing the database connections back to AOLserver, the thread proceeds to attempt to send the accumulated messages.
One minor note: you don't want someone using your server to relay spam (i.e., connect to your machine and send email with a bogus return address to half the Internet). Configuring your machine to deny relay is good Internet citizenship but it can also help keep your service up and running. If your MTA is working to send out 300,000 spam messages and process the 100,000 bounces that come back, it will have a hard time delivering email alerts that are part of your Web service. It might get wedged and put you back into some of the horror scenarios discussed at the beginning of this section.
For email, the ArsDigita Server Architecture mandates

anti-relay configuration of MTA
bounce handling via scripts
identification and monitoring of free space in the MTA's partition (via ArsDigita Cassandrix)
monitoring of SMTP port response every five minutes
monitoring of SMTP throughput (connection in, mail received back on monitoring server) every fifteen minutes
Web scripts that release database connections back to the pool before trying trying to send email

Load Testing
Generally Web traffic builds gradually and you'll have months of warning before running out of server capacity. However, before going public with something like real-time race score results for the Boston Marathon, you probably want to do some load testing.
We have built a tool called ArsDigita Traffic Jamme that is really an AOLserver repeatedly calling ns_httpget. It is a quick and dirty little tool that can generate about 10 requests per second from each load machine.
The software is available from http://www.arsdigita.com/free-tools/tj.html.
If you are ordering a computer and can't figure out how big a machine to get, our practical experience is that you can handle about 10 db-backed requests per second with each 400 MHz SPARC CPU (using AOLserver querying into indexed Oracle8 tables). Our benchmarks show that a 160 MHz HP-PA RISC CPU is just as fast (!). If you're serving static files with a threaded server program like AOLserver, you can saturate a 10 Mbit link with a ridiculously tiny computer. So don't worry much about the extra server overhead from delivering photos, illustrations, and graphics.

What if your one big computer fails the load test?
Some of the folks using this architecture have services that get close to 200 million hits per day. That's 40 million page loads, which is about 1000 page loads per second during peak afternoon hours. By the standard above, it would seem that in mid-1999 you could handle this with 40 of the fastest HP PA CPUs or 100 SPARC CPUs ... Oops! The biggest Sun E10000 can only hold 64 CPUs.
Another way to run out of juice is if you're running a public Web collaboration service with private data. In this case, all the pages will be SSL-encrypted. This puts a tremendous CPU load on the Web server, especially if your service revolves around images or, even worse, PDF files with embedded print-resolution images .
The solution? Buy the big computer, but only use it to run Oracle. Buy a rack of small Unix machines and run AOLserver, including the SSL module, on these. Suppose that you end up with 21 computers total. Haven't you violated the fundamental tenets of the philosophy expounded here? What kind of idiot would build a service that depends on 21 computers all being up and running 24x7? Well, let's not go into that right now... but anyway, that's not what the ArsDigita Server Architecture proposes.
Generally we rely on only one source of server-side persistence: Oracle. As noted in the preceding section, we "keep user session state either in cookies on the user's browser or in the RDBMS on the server". Sometimes data are cached in AOLserver's virtual memory but the service doesn't fail if a request misses the cache, it is only slowed down by enough time to do an Oracle query. Given this fact, it doesn't really matter if Joe User talks to Server 3 on his first request and Server 17 on his second. So we can use load-balancing network hardware to give all the machines in a rack of Web servers the same IP address. If one of the servers dies, all future requests will be routed to the other servers. Joe User doesn't depend on 21 computers all being up. Joe User depends on the database server being up, at least 1 out of 20 of the Web servers being up, and the fancy network hardware being up. We've solved our scalability problem without dramatically reducing site reliability. (Sources of fancy network hardware: Alteon, Foundry Networks, and Cisco (Local Director).)
What if you're breaking some of the rules and relying on AOLserver to maintain session state? You have to make sure that if Joe User starts with Server 3 every subsequent request from Joe will also go to Server 3. You don't need fancy network hardware anymore. Give each server a unique IP address. Give each user a different story about what the IP address corresponding to www.yourdomain.com is. This is called "round-robin DNS". A disadvantage of this approach is that if a server dies, the users who've been told to visit that IP address will be denied service. You'll have to be quick on your toes to give that IP address to another machine. Also, you won't get the best possible load balancing among your Web servers. It is possible that the 10,000 people whom your round-robin DNS server directed to Server 3 are much more Web-interested than the 10,000 people whom your round-robin DNS server directed to Server 17. These aren't strong arguments against round-robin DNS, though. You need some kind of DNS server and DNS is inherently redundant. You've eliminated the fancy network hardware (a potential source of failure). You'll have pretty good load balancing. You'll have near state-of-the-art reliability.
[An astute reader will note that I didn't address the issue of what happens when the computer running Oracle runs out of power. This is partly because modern SMP Unix boxes are tremendously powerful. The largest individual HP and Sun machines are probably big enough for any current Web service. The other reason that I didn't address the issue of yoking many computers together to run Oracle is that it is tremendously difficult, complex, and perilous. The key required elements are software from Oracle (Parallel Server), a disk subsystem that can be addressed by multiple computers (typically an EMC disk array), and a few $million to pay for it all.]
How to administer a configuration like this? See "Web Server Cluster Management".

Making it work for five years

Design, engineering, and programming are trivial. Maintenance and support are hard.
This might sound like an odd perspective coming from ArsDigita, a company whose reputation is based on design, engineering, and programming. But think about your own experience. Have you ever had a flash of inspiration? A moment where you had a great and clever idea that made a project much more elegant?
Congratulations.
Have you ever gone for five straight years without making any mistakes? Neither have we. Here's a letter that I sent a customer after we were all patting ourselves on the back for building an ecommerce system that successfully processed its first few thousand orders:
We are now running a fairly large-scale software and hardware system.
The components include

1) software at factory
2) factory bridge
3) Web server software
4) CyberCash interface
5) data warehouse software
6) disaster recovery software and hardware

A lot of valuable knowledge is encapsulated inside of various folks'
heads.  This wouldn't be such a bad problem if 

a) we all worked in the same building
b) we only needed the knowledge 9 to 5 when everyone was at work

But in practice we have people separated in time and space and therefore
knowledge separated in time and space.

What we therefore need is an on-line community!  (Who would have guessed
that I would have suggested this.)
Most Web services are necessarily maintained by people scattered in space and time. There is no practical way to assemble everyone with the required knowledge in one place and then keep them there 24x7 for five years.
Note that corporate IT departments face the same sorts of problems. They need to keep knowledge and passwords where more than one person can get to them. They need to make sure that backups and testing are happening without merely relying on an individual's word and memory. They need to make sure that bugs get identified, tracked, and fixed.
The corporate IT folks have it easy, though, in many ways. Typically their operation need not run 24x7. An insurance company probably has overnight analysis jobs but a database that is down at 2:00 am won't affect operations. By contrast, at an ecommerce site, downtime at 2:00 am Eastern time means that customers in Europe and Asia won't be able to order products.
Another way in which corporate IT folks have it easy is that there is a logical physical place for written system logs and password files. The mainframe is probably located in a building where most of the IT workers have offices. There are elaborate controls on who gets physical access to the mainframe facility and therefore logs may be safely stored in a place accessible to all IT workers.
What do we need our virtual community to do? First, it has to run on a separate cluster and network than the online service. If www.foobar.com is unreachable, any staff member ought to be able to visit the collaboration server and at least get the phone number for the sysadmins and ISP responsible for the site. Thus the only information that a distraught staffer needs to remember is his or her email address, a personal password, and the hostname of the collaboration server. We can standardize on a convention of staff as the hostname, e.g., http://staff.foobar.com.
Second, the community needs to distinguish among different levels of users. You want a brand-new staff person involved with the service to be able to report a bug ("open a ticket"). But you don't want this person to be able to get information that would appropriately be given only to those with the root password on the online server.
Third, the community needs to make the roles of different members explicit. Anyone should be able to ask "Who is responsible for backups?" or "Who are the Oracle dbas?"
Fourth, the community needs to keep track of what routine tasks have been done, by whom, and when. If tasks are becoming overdue, these need to be highlighted and brought to everyone's attention. For example, the community should have visible policies about how often backup tapes are transferred off-site, about how often backup tapes are verified, and about how often Oracle dumps are restored for testing. This is the function closest to the "virtual logbook" idea. It is essential that this be high quality software that is easy for people to use. Consider the corporate IT department where everyone can see the tapes, see people changing them, see people verifying some of the tapes, see people taking some of the tapes off-site. If some of these processes were being ignored, the staff would notice. However, with a Web service, the machine is probably at a co-location service such as AboveNet or Exodus. Most staffs will probably never even lay eyes on the server or the backup tapes. So a miscommunication among staffers or between staff and the ISP could lead to backup tapes never getting changed or verified.
Fifth, the community site needs to keep track of what important admin tasks are going on, who is doing them, and why. For example, suppose that some file systems have become corrupt and Joe Admin is restoring them from tape. Joe has commented out the normal nightly backup script from root's crontab because otherwise there is a danger that the cron job might write bad new data over the important backup tape (remember that Joe will not be physically present at the server; he will have to rely on the ISP to change tapes, set the write-protect switch on the tape, etc.). If Jane Admin does not know this, she might note the absence of the backup script in root's crontab with horror and rush to put it back in. Remember that Jane and Joe may each be working from homes in San Francisco and Boston, respectively. There needs to be a prominent place at staff.foobar.com where Jane will be alerted of Joe's activity.
Sixth, the community software needs to be able to track bug reports and new feature ideas. In order to do a reasonable job on these, the software needs to have a strong model of the software release cycle behind the site. For example, a programmer ought to be able to say, in a structured manner, "fixed in Release 3.2." Separately the community database keeps track of when Release 3.2 is due to go live. Feature request and bug tracking modules tie into the database's user groups and roles tables. If a bug ticket stays open too long, email alerts are automatically sent to managers.
Seventh, the community software needs to at least point to the major documentation of the on-line service and also serve as a focal point for teaching new customer service employees, programmers, db admins, and sysadmins. A question about the service ("how do I give a customer an arbitrary $5 refund?") or the software ("what cron job is responsible for the nightly Oracle exports?") should archived with its answer.
How do we accomplish all this? With the ArsDigita Community System! Plus a few extra modules that are adapted to ticket tracking and have special columns for "backup tapes verified date" and so forth.
Taking it down for five hours

Suppose that you're upgrading from Oracle 8.1.5 to Oracle 8.1.6. You don't want your Web server trying to connect to Oracle during this upgrade. Nor if the database is unavailable do you want users to be confronted with server errors. You'd like them to get a nice "we're upgrading some software; please come back after 2:00 am eastern time" message.
The naive approach:

shut off main server
bring up temporary server rooted in a Unix directory empty save for an index page that says "come back later"
shut off temporary server
bring up main server

This works great for people who come to http://www.foobar.com but not for those who've bookmarked http://www.foobar.com/yow/some-internal-page.html. People following bookmarks or links will get a "file not found" message. With AOLserver 2.3, you could configure the file-not-found message to say "come back later" but that really isn't elegant.
A better way is to prepare by keeping a server rooted at /web/comebacklater/www/ with its own /home/nsadmin/comebacklater.ini file, ready to go at all times. This server is configured as follows:
same hostname, IP address, and port as the production server
Private and Shared Tcl libraries both pointing to /web/comebacklater/tcl/ (this avoids any ns_register_proc commands that might be invoked by the shared library, e.g., those that feed *.tcl URLs to the Tcl interpreter), e.g.,
[ns/server/comebacklater/tcl]
Library=/web/comebacklater/tcl
SharedLibrary=/web/comebacklater/tcl
one file in the Tcl library, /web/comebacklater/tcl/whole-site.tcl containing the following code:
ns_register_proc POST / comeback 
ns_register_proc GET / comeback 
 
proc comeback {ignore} { 
    ns_returnfile 200 text/html "[ns_info pageroot]/index.html"
} 
Given this configured comebacklater server, and a commented-out entry in /etc/inittab to invoke it, here's how to gracefully take a site down for maintenance:

visit /web/comebacklater/www/index.html and edit the file until it contains the message that you'd like users to see
go into /etc/inittab and comment out the line that invokes the production server
init q or kill -HUP 1 to instruct init to reread /etc/inittab
kill the production server (ps -ef | grep 'foobar.ini' for the production server's PID, then kill PID)
grep for the production server to make sure that init hasn't restarted it for some reason
go into /etc/inittab and uncomment the line that invokes the comebacklater.ini server
init q or kill -HUP 1 to instruct init to reread /etc/inittab
grep for the comebacklater.ini server to make sure that init has started it; also visit /home/nsadmin/log/comebacklater-error.log to make sure that the server started without problems
verify from a Web browser that the comebacklater server is operating properly
**** do your database maintenance (or whatever) *****
go into /etc/inittab and comment out the comebacklater.ini line
init q or kill -HUP 1 to instruct init to reread /etc/inittab
kill the comebacklater.ini server
go into /etc/inittab and uncomment the line that invokes the production (foobar.ini) server
init q or kill -HUP 1 to instruct init to reread /etc/inittab
verify that the server started up without problems, from /home/nsadmin/log/foobar-error.log
verify from a Web browser that the production server is operating properly

Summary

Here's a review of the main elements of the ArsDigita Server Architecture:

monitor of free disk space and disk space growth
monitor of Oracle tablespace freespace and freespace shrinkage
use of ArsDigita Keepalive to monitor live Web servers and restart them if necessary
frequent (every 15 minutes) automated checking of Web server error logs and email notification to appropriate sysadmins and programmers
daily log analysis via ArsDigita Reporte to find pages that consistently produce errors or 404 Not Found results
frequent automated checking of network connectivity and DNS resolution of your hostname from outside your network (via ArsDigita Uptime); prepayment of InterNIC bills; co-location at AboveNet or Exodus.
careful attention to the mail transfer agent configuration, automated bounced email handling, special monitoring of free disk space, monitoring of SMTP response and throughput, and programming in such a way that a hung MTA won't bring down the entire service
staff collaboration server on a separate cluster where sysadmins get together to log activities, where problem tickets are opened, discussed, and closed, where contact and role information is available in emergencies

Another way to look at this is in terms of the infrastructure layers that we introduced at the beginning of this document:

Solaris or HP-UX. Disks mirrored with vendor tools. Monitored by ArsDigita Cassandrix.
Oracle RDBMS. Monitored by ArsDigita Cassandracle.
AOLserver. Run from /etc/inittab. Monitored by ArsDigita Keepalive.
AOLserver Tcl API. Monitored by ArsDigita Watchdog and Reporte.
plus the hidden layers

Network, DNS, and power. Monitored externally by ArsDigita Uptime service.
Email. Monitored by ArsDigita MTA Monitor.

More

Auditing an ArsDigita Server Architecture
ArsDigita Year 2000 statement

Text and photos Copyright 1998 Philip Greenspun.

asj-editors@arsdigita.com

Reader's Comments

If you have to deal with a very large site - dozens to hundreds of hosts - you'll need more than what Phillip suggests here. Here are a couple of suggestions:
1) Become a member of USENIX. The USENIX LISA (System Adminstration for Large System) conferences are an excellent resource, and if you're a USENIX member, you can can the papers on-line.
2) One resource that was talked up at LISA 98 was the cfengine tool. Go check http://www.iu.hioslo.no/cfengine/ for more details.
3) If you have lots of systems, having a model for going about doing it is more important than what tool you use. (Once you know what you're trying to get done, there are many tools that will let you accomplish your goal.) I was very keen on a paper at LISA 98 called "Bootstrapping an Infrastructure"; the authors have made their paper available at http://www.infrastructures.org.

-- Paul Holbrook, January 25, 1999

How neutral are you with respect to choice of processors? After all, you helped develop PA-RISC. ;)

-- Guan Yang, June 10, 2000

I've mined ArsDigita for nuggets of gold for years, and was pleasantly shocked to stumble across Paul's mention of infrastructures.org in his comment above -- I'm one of the authors of that "Bootstrapping" paper. Readers might be interested to know that my wife and I are in the early stages of setting up a hosting business, using some the principles described in that paper to keep the reliability up and the costs down -- it will show up at www.t7a.net when it goes live, in case anyone's interested. Whatever code we develop will stay open-source. The whole idea is to lower that "$10,000 of system administration time" per CPU that Phil mentions over in Chapter 8 of Philip and Alex's Guide to Web Publishing.

-- Steve Traugott, November 6, 2000

Since www.infrastructures.org seems to be having serious uptime or connectivity problems, note that the paper is also available here.

-- Andrew Piskorski, January 1, 2001

ArsDigita Server Architecture

Making the Unix layer reliable

Making the Oracle layer reliable

Making AOLserver reliable

Notifications of problems with your Web scripts

Verifying Network Connection and DNS

E-mail Server: It can take your Web services down

Load Testing

What if your one big computer fails the load test?

Making it work for five years

Taking it down for five hours

Summary

plus the hidden layers

More

Reader's Comments

Related Links