Clickstream Data Warehousing (ArsDigita Systems Journal)

Clickstream Data Warehousing

by Richard Li (richardl@arsdigita.com), Jon Salz (jsalz@mit.edu)

Submitted on: 2000-07-01
Last updated: 2000-08-14

ArsDigita : ArsDigita Systems Journal : One article

Planning, development, and promotion are the most obvious tasks involved in building a web service. But analyzing how end users use a service may be more important to that service's success. Understanding how users reach your site and how they behave once they are there allows you to improve future users' experiences with the site. It also allows you to improve your ROI by increasing sales and increasing the "stickiness" of user visits.
We turn to clickstreams, the streams of requests (clicks) users generate as they move from page to page within a web site, to provide this kind of information. Clickstreams contain a tremendous amount of quantitative data that can be used to improve users' experiences with a site. When properly mined, clickstream analysis can provide answers to very specific questions such as:

What are the most popular pages on a site? the most popular products?
How do people arrive at the site? Where do they go when they leave? How long do they stay?
What percentage of sessions resulted in a sales transaction?
Which pages on our web site seem to be "session killers," where the remote user stops the session and leaves? Do a certain percentage of people leave my web site with loaded shopping carts at a single Web page?
Which browsers and operating systems do visitors use?
What are the most common paths users take throughout the site?
How successful are various banner ad campaigns? How much do people spend on my site if they arrived via a particular banner ad?

The answers tell us where to direct development and promotion efforts in the future. If we find that particular services we've put a lot of effort into are being neglected by users, we want to feature them near the top of the site to improve their popularity. If we find that users are leaving their shopping carts full at a certain point in the E-commerce checkout process, we may want to hire a usability consultant to simplify the UI so users will complete their transactions.
The standard approach today to analyzing user behavior and usage on a web site is through log file analysis of web server logs. Web servers log each HTTP request into a text file in the filesystem; a log file analyzer can analyze the log files and present the data in a variety of useful ways. However, web server log files can only answer a subset of the questions that can be answered by clickstream analysis (see Chapter 9: User Tracking in Philip and Alex's Guide to Web Publishing, http://www.arsdigita.com/books/panda/user-tracking).
Standard log file analyzers cannot provide the level of analysis made possible by clickstream data warehousing for several reasons. First, a standard log file does not contain all the information that the web server knows about a visitor, such as user identity. In addition, building a data warehouse for clickstream data enables the clickstream data to be correlated from different sources. For example, we can correlate the clickstream behavior of a particular group of users to the dollar value they currently have in their shopping cart.
Introduction to Data Warehousing

A data warehouse is a relational database whose goal is the analysis of large amounts of data. This goal is distinct from the other common use of databases, transaction processing, whose goal is to maintain records of transactions. A data warehouse contains data that is derived from transaction data, but may also include data from other sources. A data warehouse is used to analyze data for patterns and to answer questions that may not be asked until after the data are actually available. In other words, a data warehouse must be a flexible structure for storing vast quantities of data so that it can be prepared to answer any question that may be posed to it, questions that inevitably will not be anticipated at design time.
A star schema is the standard proven design for a data warehouse. The heart of the star schema is the fact table, which has one row for each fact in the system. Choosing the level of detail in the information that goes into a particular row is known as determining the granularity of the data warehouse. Each column of the fact table is a foreign key that references a particular dimension; data about a particular dimension is stored in dimension tables that are joined with the fact table as needed. For instance, a data warehouse for a consumer products retailer company might have a granularity of one product sale per row. The dimensions of the data warehouse may include which product was sold, where it was sold, the price at which it was sold, and the date when it was sold. When a query requires data from multiple dimensions, these dimensions are joined by the database to form what appears to be one large table with the necessary data.
The process of building a data warehouse can be broken down into three phases: gathering data, data transformation and data warehouse population, and querying the database. In the example above, the data gathering process would collect data about every sale transaction, store location, and any other data required by the data warehouse. The data gathering process usually will require that data to be gathered from multiple sources: sales information is collected from each retailer, geographic information comes from a company database, and so forth. The data transformation phase takes all the data gathered in the previous step and loads the data into the warehouse. This process reconciles any inconsistencies in the data (for example, data from one source might use code BLU to indicate an item with the color blue, while another source may use the code BLUE to mark the same attribute), computes any necessary values from the raw data, and populates the data warehouse with the processed data. The third phase is targeted at the end user, and involves providing user interfaces that allow users to query and analyze the data in the warehouse. Each of these phases is actually an ongoing process; new data are continually gathered, transformed, and analyzed.
Additional reading:

Ralph Kimball's The Data Warehouse Toolkit (http://www.amazon.com/exec/obidos/ISBN=0471153370/photonetA/) is an excellent introduction to data warehousing.
The Data Webhouse Toolkit (http://www.amazon.com/exec/obidos/ISBN=0471376809/photonetA/), also by Ralph Kimball, discusses clickstream data warehousing; this book motivated much of the design of the clickstream architecture implemented here.
Oracle Clickstream Intelligence

Design Goals
Four key considerations drive the design of the clickstream data warehouse architecture.

Performance and scalability On a heavily trafficked site, tens or hundreds of millions of clicks can occur every day. This enormous amount of data requires that the data model be robust enough to support queries involving billions of rows of data, and that the data population scheme is efficient and scalable.
Transparency The system should be transparent to both the end user and to the web developer once it is installed and properly configured. If the system requires developers to adopt additional coding practices to support clickstream logging, it will likely provide inaccurate data or be unused.
Tools-agnostic Different web sites use different technologies; the clickstream system should not preclude the use of any particular software. Moreover, integrating the clickstream system on a live site should not require rearchitecting substantial parts of the original web site. The particular implemetnation we discuss here will rely on AOLserver and Oracle.
Extensibility The system should support ad hoc queries into a very large data warehouse without incurring substantial performance penalties even if the query is written by an end user inexperienced in tuning SQL.

Collecting and Analyzing Data

As mentioned before, clickstreams provide a tremendous amount of data. We have one data point per hit to a site -- potentially millions per day and billions per year -- so if we're not careful, analyzing our data could be very expensive. Typically, the life cycle of any particular data point is as follows:

Immediately after the user requests a page on the web site, the server writes a line to a log file containing very detailed information about the request.
Periodically a separate data warehousing server requests the latest chunk of the log file, loading them into a event log table (cs_event_log) in the Oracle database.
Entries in the event log table are analyzed by the clickstreaming core code in Oracle, which consists of a few thousand lines of PL/SQL. This involves populating the dimensions -- page, session, user agent, etc. -- and inserting a row into the data warehousing fact table for each row in the event log table.
Someone queries the data warehouse, obtaining some sort of report.
Let's take a look under the hood at each of these steps.
Step 1: Event Logging
We use AOLserver filters (the ns_register_filter API) to trap requests to HTML pages so we can log information about them to files in the filesystem. Hits are generally logged into files with names like servername-cs.log.2000-04-25.14:00 (this example is for the period of time between 2:00 and 3:00 p.m. on 25 April 2000). Each line has the following fields:

The time at which the page was requested
The time at which the download of the page was completed
The URL requested
The IP address of the request
The user ID of the user, if known
The query string in the request URL
The content type of the page delivered
The number of bytes delivered
Every 24 hours, the individual clickstream logs are concatenated into one large log file and compressed.
Step 2: Loading data into the warehouse
We use HTTP to transfer the compressed clickstream log files from the production web server(s) to a dedicated data warehouse machine. We use a dedicated data warehouse machine because we don't want to slow the production web server down while we're populating the database, and because we configure Oracle differently for data warehousing. Once the clickstream logs have been transferred, we generate a SQL*LOADER .ctl file and do a direct path load of the data into the cs_event_log table.
Step 3: Populating the warehouse
Once there is data to be processed inside the cs_event_log table, we run a PL/SQL procedure to populate the data warehouse. We use PL/SQL to maximize performance, since a considerable amount of computation occurs in this step. The PL/SQL procedure performs several major operations:

First, it calls PL/SQL procedures that populate the pages, referrers, and user_agents dimensions of the warehouse. These procedures obtain a list of all entries in the appropriate dimension, and compare this list of entries to the incoming data set. If there is any data in the cs_event_log table that is not in the dimension, the procedures insert the additional data. For instance, there is one row in the referrer dimension for each unique referrer. The cs_create_pages_and_referrers procedure scans the cs_event_log table and inserts any new referrers into the cs_dim_referrers table. Since SQL queries in PL/SQL against large dimensions are quite slow (because of context-switching overhead), we build dynamic PL/SQL procedures that cache all the necessary data directly inside the PL/SQL procedure.
Once the dimensions are properly populated, the fact table is updated with any new facts.
Once the necessary rows have been copied into the fact table, the dynamic sitemap tree is created in the cs_sitemap table (see the Sitemap section below.

Step 4: Analysis

Analyzing the data in the warehouse is extremely expensive -- not only are there millions of rows to scan and aggregate, but joins against one or more of the dimension tables are usually required as well. We attempted to address these performance issues in several ways.
We attempted to maximize the amount of computation that was done prior to querying, and cache the results of this precomputation. The primary way we performed precomputation was through the use of materialized views. Materialized views are views that precompute aggregates and joins, storing the results in tables. As new rows are inserted into the underlying tables, the materialized view can be incrementally refreshed. Unfortunately, restrictions imposed by Oracle on the use of materialized views preclude using materialized views for every aggregate and join needed for queries. In addition, we had substantial problems in getting materialized views to work consistently, and we were forced to periodically recreate materialized views from scratch, an extremely time-consuming process.
We also attempted to optimize the queries with data warehouse hints such as /*+ STAR */, but, upon further analysis, we determined that the performance of these queries was I/O bound, and that these hints were not substantially improving the performance of the queries. We believe that there is room in this area for substantial improvement, since we were developing on a shared development machine with an Oracle installation configured for OLTP.
Since the queries we were executing against the data warehouse grew exponentially slower as more rows were loaded into the warehouse, we decided to store only the results of analyses against the data. In this new scheme, all the data in the data warehouse is deleted on a daily basis, and clickstream data for the most recent 24-hour period is loaded. This data are then analyzed; the results of each analysis is stored in an HTML file for the given day. Thus the data warehouse holds only the clickstream data for one day at any given time. The price paid for the scalability, reliability, and performance that this approach offers is that aggregation across multiple days becomes more complex, and new reports cannot be performed on previously collected data.
The Data Model
The cs_event_log table contains all the data that is directly loaded into the warehouse with SQL*LOADER, and is identical in format to the clickstream log files:
create table cs_event_log (
	event_id		integer,
	event_time		integer,
	end_time		integer,
	date_id			integer,
	url			varchar(4000),
	instance_id		varchar(4000),
	user_ip			varchar(50),
	user_id			integer,
	query			varchar(4000),
	bytes			integer,
	content_type		varchar(200),
	session_id		integer,
	browser_id		integer,
	user_agent_string	varchar(4000),
	accept_language		varchar(10),
	referring_url		varchar(4000),
	method			varchar(10),
	status			integer,
	secure_p		char(1)
);
The data warehouse itself is a standard star schema, and consists of a single fact table and multiple dimensions. Note that the diagram below illustrating the star schema shows only a subset of the dimensions that we have in the data model.

The granularity of the fact table is one page view. Each column of the fact table that references a particular dimension is a foreign key. The only way that data are entered into the data warehouse is via the PL/SQL procedures. Since the PL/SQL guarantees the preservation of the constraints, we disable all constraints to enhance performance. We maintain the constraints in the data model in the disabled state so that the Oracle cost-based optimizer can take advantage of this information when it optimizes queries.
create table cs_fact_table (
	cs_id			integer primary key,
	-- load start/end, dwell time if able to determine
	load_start		integer,
	load_time		integer,
	dwell_time		integer,
	page_id			integer constraint cs_ft_page_id_fk references
		cs_dim_pages not null disable,
	-- which instance of the page is it?
	instance_id		varchar(3000),
	instance_provided_p	char(1) default 'f' not null check(instance_provided_p
		in ('t','f')) disable,
	referrer_id		integer constraint cs_ft_referrer_id_fk references
		cs_dim_referrers disable,
	date_id			integer 
                                constraint cs_ft_date_id_nn not null disable
                                constraint cs_ft_date_id_fk references cs_dim_dates disable,
	session_id		integer
                                constraint cs_ft_session_id_fk references cs_dim_sessions
		disable,
	user_agent_id		integer
                                constraint cs_ft_user_agent_id_fk references
		cs_dim_user_agents disable,
	user_ip                 varchar(50)
                                constraint cs_ft_user_ip_nn not null disable,
	browser_id		integer,
        user_id			integer,
	-- HTTP method (GET, POST, HEAD)
        method                  varchar(10)
                                constraint cs_ft_method_nn not null disable,
	-- HTTP status code
        status                  integer
                                constraint cs_ft_status_nn not null disable,
        bytes                   integer,
	content_type		varchar(200),
	-- might want to turn accept_language into a dimension?
	accept_language         varchar(10),
	secure_p		char(1) 
                                constraint cs_ft_secure_p_nn not null disable,
                                constraint cs_ft_secure_p_ck check(secure_p in ('t','f'))
		disable,
	-- how many css into the session are we? 1, 2, 3...
	cs_within_session	integer,
	user_state_id		integer
                                constraint cs_ft_user_state_id_fk references cs_dim_user_state
		disable
);
The cs_dim_user_agents dimension is an example of a typical dimension in the clickstream data warehouse. This dimension records information about the particular user agent for each particular HTTP request:
create table cs_dim_user_agents (
	user_agent_id		 integer
                                 constraint cs_dua_user_agent_id_pk primary key,
	-- The string returned by the browser
	user_agent_string	 varchar(3000) not null,
	-- Mozilla, Opera, IE
	browser_type		 varchar(3000),
	browser_version		 varchar(3000),
	-- The major part of the browser version. For MSIE and Mozilla, this is just
        -- the first three characters.
	browser_version_major    varchar(3000),
	-- MacOS, Win32, Unix
        operating_system	 varchar(50),
	-- MacOS 8, Windows 95, Windows 98, Windows NT, Linux, Solaris, etc.
	operating_system_variant varchar(50)
);
The current data model includes the following dimensions:

Page dimension: What was the name of the page requested? What type of page was requested? What was the url? What was the content type?
Referrer dimension: What kind of referrer was it (e.g., search engine)? What was the referring url? What search text, if any, was used? If it was from inside the site, what is the referring page id?
Date dimension: Which day of the week was the request? Day in the month? Day in the year? Which week in the year?
User agent dimension: Which browser is the user using? Which version? Which operating system?
Session dimension: When did a particular session start? When did it end? Which user? How many clicks were part of the session? What was the total value of the sales in the session?
User state dimension: What is the current value of the shopping cart?
User demographics dimension: What is the income level of the user? What is his/her race?

An example analysis: the dynamic sitemap (screenshot)
The dynamic site map illustrates the most popular paths users follow when navigating the site. For each page in the site, we determine which page is the most common referrer to that page. If this "best referrer" is external, the page shows up as a root-level node in the sitemap (since people typically get to the page from some external source); if the best referrer is internal, the page shows up beneath that referrer in the sitemap tree.
cs_create_sitemap builds this tree breadth-first: first it inserts into the sitemap all pages which belong in the top level (i.e., have an external "best referrer"). Then it inserts into the second level of the sitemap any pages that have a top-level node as a best referrer; then it inserts any pages which have a second-level node as best referrer, etc. It loops until it can't insert any more nodes.
When this process is complete, there may still be some pages which haven't been inserted into the tree: consider the case where two pages have each other as best referrers. When cycles such as these occur, we pick the node in the cycle which is most commonly hit (call it X), and we delete as potential parents all other nodes in the cycle. We then continue the loop described in the previous paragraph. The node X is guaranteed at this point to be inserted somewhere in the site map, since none of the nodes in the cycle can possibly be X's best referrer any more. We repeat this process until all cycles are resolved.
The end result is a dynamically built sitemap illustrating the paths users most frequently take navigating the site.
An example query: Entry Pages (screenshot)
How users typically enter a web site is a common question for web site operators. We can easily answer this question with a single query:
select    p.page_id, p.page_url name, count(*) value
from      cs_fact_table f, cs_dim_pages p, cs_dim_sessions s
where     f.page_id = p.page_id
and       f.session_id = s.session_id
and       f.cs_within_session = 1
and       p.exclude_p = 'f'
and       s.clicks > 1
group by  p.page_id, p.page_url
order by  3 desc
This query performs a join between the main fact table and two dimension tables, the cs_dim_pages table and the cs_dim_sessions table. Let's dissect the remaining where clauses:
f.cs_within_session = 1
ensures that the page_id retrieved is the first page clicked in a given session
p.exclude_p = 'f'
excludes requests that should not be counted (e.g., requests for images)
s.clicks > 1
ensures that we are only analyzing sessions that have registered more than one page request. We throw away one-hit sessions because we cannot distinguish between first time visitors who hit the web site once and visitors who have non-cookied browsers. If we included one-hit sessions, our data would be skewed by non-cookied visitors who are browsing the site, since every page request from these visitors would look like an entry page.
Future work
Several of our clients currently use our clickstream data warehouse architecture, and rely on it to answer many of the important business questions described in the overview. We plan on improving the architecture by:

Refining and developing a comprehensive set of queries that answer most of the questions that users want.
Extending the data population code to include all dimensions in the data model and determining what additional dimensions should be added.
Defining a standard way to extend the data model by adding custom dimensions, or augment existing ones; and developing a mechanism to transfer additional data necessary to populate these extra fields from the production web server to the database.
Validating the data collected and analyzed by comparing the results against other log analyzers.
A development snapshot of the clickstream software discussed in this article is available for download.

asj-editors@arsdigita.com

Clickstream Data Warehousing

Introduction to Data Warehousing

Design Goals

Collecting and Analyzing Data

Step 1: Event Logging

Step 2: Loading data into the warehouse

Step 3: Populating the warehouse

Step 4: Analysis

The Data Model

An example analysis: the dynamic sitemap (screenshot)

An example query: Entry Pages (screenshot)

Future work

Related Links