Using CVS for Web Development
by Philip Greenspun (philg@mit.edu)
ArsDigita : ArsDigita Systems Journal : One article
If you have a very clear publishing objective, specs that never change,
and one very smart developer, you don't need version control. If you
have evolving objectives, changing specifications, and multiple
contributors, you need version control.
The Solution
- three Web servers (can be on one physical computer)
- two Oracle users/tablespaces (can be in one Oracle instance)
- one Concurrent Versions System (CVS) root
- two people trained to understand CVS
Let's go through these item by item.
Item 1: Three Web Servers
Suppose that your overall objective is to serve a Web service accessible
at "foobar.com". You need a production server, rooted at /web/foobar
(Server 1). You don't want your programmers making changes on
the live production site. That's sort of the whole point of this
document. So you need a development server, rooted at /web/foobar-dev/
(Server 2). You might think that this is enough. When everyone
is happy with the dev server, have a code freeze, test a bit, then copy
the dev code over to the production directory and restart.
What's wrong with the two-server plan? Nothing if you are running
photo.net circa 1997. The development team consisted of me and Jin.
The testing team... me and Jin! Note that there was no possibility of
simultaneous development and testing. ArsDigita.com customers, however, usually have
enough budget to pay for four or five programmers plus 20 or 30 internal
staffers who may be updating content, testing changes, and sometimes
contributing code. For a complex site, the publisher may wish to spend
a week testing before launching a revision. It isn't acceptable to idle
authors and developers while a handful of testers bangs away at the
development server. The solution? A staging server, rooted at
/web/foobar-staging/ (Server 3).
Here's how the three are used:
- developers work continuously in /web/foobar-dev/
- when the publisher is mostly happy with the development site, a
named version is created and installed at /web/foobar-staging
- the testers bang away at the /web/foobar-staging server
- when the testers and publishers sign off on the staging server's
performance, the site is released to /web/foobar/ (production)
- any fixes made to the staging server are merged back into the
development server
Item 2: two Oracle users/tablespaces
Suppose that you have a working production site. You could connect your
/web/foobar-dev/ to the production Oracle user. After all, Oracle's
raison d'être is concurrency control. It will be happy to run
eight simultaneous connections to your production site plus two or three
to the development server. The fly in this ointment is that one of your
developers might get a little sloppy and write a program that sends
drop table users
rather than drop table
users_experimental_extra_table
to the database.
So it would seem that we'll need at least one new Oracle playground.
Here are the steps:
- create a new Oracle user and tablespace, named "foobardev" (assuming
the production user is "foobar")
- import a recent Oracle export.dmp file to populate your tablespace
with what was on the production site (if you're following the tenets of
the ArsDigita Server
Architecture you'll always have one from the previous night anyway).
Cry with pain as you discover that Oracle imports don't work with LOB
columns unless you're importing into an installation that has a
tablespace with the same name as the one from which the tables were
exported.
- every time you alter a table, add a table, or populate a new table,
record the operation in /web/foobar-dev/www/doc/sql/patches.sql
- when you're ready to move from staging to production, hastily
apply all the data model modifications from patches.html to the
production Oracle user
Shouldn't we have three Oracle users? One for dev, one for staging, one
for production? No. It usually isn't worth it. Adding a column to a
relational database table seldom breaks queries. Until Oracle 8.1.5,
you weren't able to drop a column. And anyway the radical data model
changes tend to take place when a site has yet to be launched.
The bottom line is that it takes work to keep three Oracle users'
objects in sync. It is half as much work to sync two and almost as
useful. How to deploy these two Oracle users? Park one behind the
production server. Use the other one behind the dev and staging
servers.
Item 3: one Concurrent Versions System (CVS) root
The Concurrent Versions System (CVS) is a powerful file system-based tool that
can do the following things:
- remember what all the previous checked-in versions of a file
contained, using its repository
- show you the difference between what's in your tree and what's in
the repository
- help you merge changes made simultaneously by multiple authors who
might have been unaware of each other's work
- group a snapshot of currently checked-in versions of files as
"Release 2.1" or "JuneIssue"
CVS is free and open-source.
CVS does all of this via its repository or "CVS root". This is a
directory, typically /usr/local/cvsroot/. Most Unix machines don't have
enough space in the /usr partition to store all Web content. Remember
that the CVS root will be at least as large as all of the files under
source control. Thus we will use /cvsweb as our CVS root and, if need
be, migrate it to a separate disk subsystem.
Create a project from your development Web sources (from
/web/foobar-dev/) so that they will end up at /cvsweb/foobar/.
Item 4: Two Trained CVS Users
Don't plan to teach all of your contributors the arcana of CVS. The
ones who use GNU Emacs will need to learn to type c-x c-q and c-c c-c to
contribute change comments. But the contributors who use primitive
tools (FTP, HTTP PUT, vi) can remain blissfully unaware of the fact that
CVS is in use.
Who is really using CVS then? A cron job. Every day just before
midnight the cron job should check in all changes from the dev server to
the main branch, with the change comment 'nightly check-in YYYY-MM-DD'.
The cron job should notify the Release Master if any files that are in
the repository have been deleted so that he or she can decide whether
the removal was a mistake or if typing cvs remove
is
warranted (the files don't really go away; they go into an "attic").
One person is designated the Release Master. Normally this person does
nothing. When the publisher is happy with the behavior of the
development server, the Release Master creates a CVS branch named
"199909Launch" or whatever. The Release Master updates the staging
server from CVS with this branch. Development proceeds with checkins to
the main CVS branch. These won't affect updates from the 199909Launch
branch.
Once the staging server has been thoroughly tested, the Release Master
checks in any changes that have been made. The check-in happens twice,
once to the 199909Launch branch (there won't be any conflicts since
nobody has been touching this) and once to the main branch (conflicts
may need to be resolved).
When the publisher decides to go live, the Release Master takes the
following steps:
- manually update the /parameters/foobar.ini file as necessary
- update production server from the CVS branch 199909Launch
- apply any data model changes (quickly!) from /doc/patches.html.
If there are significant data model changes, do this in the middle of
the night and consider bringing up a "comebacklater" server for a few
minutes!
If the Release Master is doing all of this hard work, why do we need to
train anyone else in CVS? A Web service is 24x7 but one person can't
work 24x7. So we need a Release Apprentice for each Web service who
knows everything that there is to know about this system.
Exactly which directories do we control?
A programmer's intuitions about which directories to control will
generally be 180-degrees off. For example, a programmer might think
that it isn't worth controlling graphics files. After all, CVS can't
really do much with these besides compare them byte by byte and tag them
with dates.
The ArsDigita Community System generally contains the following under
/web/foobar:
- /www -- the main Web server root; we must control this
- /tcl -- private Tcl library; we must control this
- /parameters -- server personality; we'd like to control this but we
can't unless we're careful to make sure that each server has a uniquely
named auxconfig .ini file, e.g., foobar.ini, foobar-dev.ini, and
foobar-staging.ini. Remember that, if nothing else, the server name in
each section of this .ini must be different (e.g., "foobar" and
"foobar-dev"). So it would be disastrous to update the production
server's aux .ini file with the dev server's aux .ini file.
- /bin -- email handling scripts forked by the mailer (generally
qmail); no real reason to control this unless you're running dev and
production on separate computers
- /templates -- for sites with fancy graphics... the fancy graphics;
we must control this
- (most servers) misc directories containing files uploaded by users,
not kept under the Web server root due to security concerns; can't
control this or we risk rolling back months of user uploads!
The bottom line is that it would be nice to just say "all of
/web/foobar-dev" but we can't do this unless we're careful with the
auxconfigdir (/parameters) and make sure to keep user-uploaded files out
of the /web/foobar/ directory.
Do you need a farm of big fancy servers to implement this?
How big and how many computers do you need to adopt the procedures
described in this document? Three Web servers, two Oracle users, the
CVS package, ... Sounds complicated. Actually you can run it all on a
$2000 Linux box.
If you're worried about your developers being sloppy and editing files
in /web/foobar/ when they thought they were in /web/foobar-dev/ remember
that you can always use cvs update
to revert the production
site to the most recent approved version.
Suppose that you've ample money for server hardware, co-location fees,
and sysadmin resources. You probably want to split the production
machine out and only give the Release Master and Release Apprentice
access to that box. Let the developers and staging/testing folks fight
it out on a development server.
Why not one development area per developer?
Classically, CVS is used by C developers and each C programmer works
from his or her own directory. This makes sense because there is no
persistence in the C world. You compile your code, run a binary that
builds data structures in RAM and when the program terminates it doesn't
leave anything behind (except maybe a core file). Checking out a CVS
tree and working on it isn't a big deal.
Compare this to the world of db-backed Web servers. If you want to
check out a copy of the tree and play with it, you have to create an
Oracle user and tablespace, import a recent Oracle export.dmp file to
populate your tablespace with what was on the production site, find a
free IP address or port and set up a Web server, and then keep your
Oracle table definitions in sync with any alterations other developers
may be making.
In the C world, developers live to satisfy themselves. More than
likely, not another soul on the planet will ever run the code that they
are authoring. So it is fine for them to work alone. In the Web world,
developers always work with the publisher and users. Those
collaborators will need to be alerted to this new server so that they
can offer criticism and advice. They might need special passwords or
firewall access since most publishers don't like to let the public see
their unfinished development efforts.
In the C world, you've got the luxury of one or two years between
product releases. All the work is done by people with at least four
years of training. In the Web world, a significant new release may need
to be produced in four weeks. Much of the work may be done by people
with no formal training of any kind, e.g., designers and content authors
editing templates or static .html pages. Given the chronic shortage of
personnel in this industry, do you want to limit yourself to being able
to hire only those who've been through a CVS training course? To those
who are formally minded enough to read the CVS man pages? Remember that
most of the contributors on your site will not be programmers.
The bottom line? It is just too much work to set up each contributor
with his or her own little server.
Good Things About This System
To end this article on a positive note, let's summarize the good things
about this system:
- if something is screwy with the production server, you can easily
revert to a known and tested version
- a programmer who is a trained CVS user can protect and comment his
or her changes by explicitly doing a cvs checkin
- a contributor who is ignorant of CVS is protected by the nightly
cron job against losing more than one day of work
More
asj-editors@arsdigita.com
Reader's Comments
If you are setting up a new cvs server, spend a few extra minutes to configure CVS using the client-server ("pserver") mode, instead of the older file system mode. This will save you pain later and may keep you out of hot water. Pain, because moving the repository (your old one dies, your company IPO's and your boss wants to buy a big fancy server farm, you want to hide the repository behind a firewall) is matter of changing an environment variable. You get immedieate access control (developers can be protected from updating the production environment). CVS in file system mode can "hang" because it leaves a lock file around for each file and directory. Then you need a cvs guru to dive in and fix it. One note: you can't live in a mixed environment. It is either one mode or the other.
An expert tip on using client server: CVS uses gzip for compressing data across the network. The default setting is -z3 which is a pitiful waste of time. Recompile CVS to use -z9 by default (the network is the bottleneck, not CPU resources), or add it to everyone's .cvsrc configuration file (it lives in the users' home directory).
I've had some extremely painful experiences with CVS and large binary files. (Large is +32Mb) When CVS checks a file out of the repository, even if it is doing nothing more than a straight copy (no diff'ing, merging, etc.) the program brings the whole file into contiguous memory. This bloats the CVS process resident set size to at least the size of the file, +6Mb for the program, give or take. The process is inefficient, so subsequent large files don't reusue the space well. CVS bloats even more. Make sure that your server is configured with a lot of swap space (it should have a lot of memory anyway). Even so, performance will drag down into the ground until CVS is finished (could be 30 minutes for a large working set), then things will "mysteriously" return to normal.
-- Ken Mayer, July 23, 1999
Your proposed once per day automatic check-in of everything is a nice idea for a group such as your ArsDigita companay with it's fairly non standard
mission statement.
In more mundane companies however you usually have at least one mid-lewel manager who will see the amount of code checked-in every day as a measurement of individual emplye efficiency, and wrech all sorts of havoc with this misguided "knowledge".
I'm sure some of you have expierenced mid-level managers who were too dump to even figure out how to do this, but I have never been that unlucky ;-)
Apart from this your proposed method sounds remarkely similar to what I have been doing for various db backed websites over the last few years. It has proven itself to me to be a great time saver and I don't even want to calculate how many near disasters with their associated all night fix-up sessions it has saved me or my co-workers from.
The pserver is surely the only way to share CVS among a group of people without running into all sorts of non-interesting problems with nfs etc. You can also tunnel it through ssh for secure over-the-net operations.
-- Kristian Sørensen, July 24, 1999
Regarding putting the stuff in /parameters - the .ini files - under CVS, and requiring different .ini files for your three servers: this is a darn good reason to use Tcl configuration files in AOLserver 3.0 instead of .ini files. Then config file can use Tcl to determine whether it's a production, dev, or staging server (based on an environment variable, or the server home, etc.), and use the appropriate config values.
-- Rob Mayoff, February 26, 2000
Although my company does not use CVS, we have used Microsoft's Visual Source Safe and Intersolv's PVCS Version Manager. Both were a pain to setup and have people use them. All complains usually go away after the first time that version control saves your day after some screwup.As for the managers, they usually don't care. Obviously some misguided soul is going to use this tool to gather information on who worked on what and for how long, but around the office 99% of the people are interested in it because it saves us from many headaches.
I don't think I ever want to work on a project without some kind of version control.
-- Pedro Vera-Perez, March 14, 2000
When dealing with large teams of developers using CVS can be a real headache. One alternative would be BitKeeper which solves most (if not all) CVS's problems. It was written by the guys that did SUN's TeamWare's source management system.
-- Petru Paler, April 18, 2000
The Mozilla project is doing well with CVS and a quite large number of developers.
Also, CVS pserver mode is rather unsafe, but it works really well over SSH. I made a small Perl script that is used as a shell for accounts that I only want them to do CVS that checks if the user is going to run CVS, and shows them a message if not.
#! /usr/bin/perl -Tw
use strict;
delete $ENV{ENV};
$ENV{PATH}='/usr/bin:/bin:/usr/sbin:/sbin';
if($ARGV[1] ne 'cvs server') {
print STDERR "This account can only be used for CVS access\n";
exit(1);
}
exec("cvs server");
-- Pierre Phaneuf, November 6, 2000
The whole purpose of having a source control system is so
multiple developers can work on the same set of source files.
Suggesting only 2 persons need to know CVS while others still
check stuff in with FTP shows that the author does not appreciate
the full potential of CVS or a similar source control system.
-- jay teo, March 9, 2001
I have to disagree with your comments on having one development area per developer. I think its mandatory and would defeat the whole purpose of CVS. Why would you have two developers working in the same directory at the same time? The file system doesn't allow for two users to edit the same file at the same time.Also there is no need to create another instance of Oracle for this purpose. The same development instance can be used for all developers. Database changes happen more infrequently than does code changes.
Where I work, the process of creating a seperate development space has been packaged into an RPM, or you can use whatever packaging system you prefer. There is no need for different IP address, just use a different port. All development areas can reside on the same machine.
-- Thai Nguyen, February 14, 2002
Related Links
- CVS guide by Mark D- Useful introduction and reference for CVS. (contributed by Walter T.E. McGinnis)
- The CVS Book- Sections of a CVS book released under the GNU GPL. (contributed by Rob Campbell)
- PVCS- This is a version control package required when producing software for some government agencies (especially DoD). (contributed by Pedro Vera-Perez)
- StarTeam- - Revision Control system runs on windows and unix ( Very nice to use and easy to maintain ) (contributed by Chris Spears)
- CVS and the Microsoft Source Code Control interface- A project to integrate CVS repository through the Microsoft SCC interface, so you can use CVS within various Windows IDE (contributed by Stephane Boisson)
- Perforce SCM tool- Very nice cross-platform tool - has some similarities to CVS but offers nicer branching, and atomic changesets. Free for open source development (e.g. used by Perl porters group). (contributed by Robert Cowham)
- WinCVS GUI- This is a great CVS frontend with a very active community and great documentation. I now have our entire IT team using cvs without them ever touching a command line. (contributed by Phillip Thurmond)