Data Warehousing for Cavemen
by Philip Greenspun (philg@mit.edu), Jin S. Choi (jsc@arsdigita.com)
ArsDigita : ArsDigita Systems Journal : One article
"Another segment of society that has constructed a language of its own
is business. ... [The businessman] is speaking a language that is
familiar to him and dear to him. Its portentous nouns and verbs invest
ordinary events with high adventure; the executive walks among ink
erasers caparisoned like a knight. This we should be tolerant of--every
man of spirit wants to ride a white horse. ... A good many of the
special words of business seem designed more to express the user's
dreams than to express his precise meaning."
-- last chapter of
The
Elements of Style, Strunk and White
This document is intended to make you buzzword-compliant with the MIS
world. In terms simple enough even for an MIT computer science
Ph.D. to understand, we're going to explain OLTP, data warehousing,
and OLAP. Kiss that ghetto post-doc
goodbye and watch big companies line up to pay you $300/hour to
romance their most critical data.
Let's imagine a conversation between the Chief Information Officer of
WalMart and a sales guy from Sybase. We've picked these
companies for concreteness but they stand for "big Management
Information System (MIS) user" and "big relational database management
system (RDBMS) vendor".
Walmart: "I want to keep track of sales in all of my stores
simultaneously."
Sybase: "You need our wonderful RDBMS software. You can stuff data in as
sales are rung up at cash registers and simultaneously query data out
right here in your office. That's the beauty of concurrency control."
So Walmart buys a $1 million Sun E10000 multi-CPU server and a
$500,000 Sybase license. They buy
Database
Design for Smarties
and build themselves a
nice normalized SQL data model:
SALES table
product id | store id | quantity sold | date/time of sale |
567 | 17 | 1 | 1997-10-22 09:35:14 |
219 | 16 | 4 | 1997-10-22 09:35:14 |
219 | 17 | 1 | 1997-10-22 09:35:17 |
... |
PRODUCTS table
product id | product name | product category | manufacturer id
|
---|
567 | Colgate Gel Pump 6.4 oz. | 1 | 68 |
219 | Diet Coke 12 oz. can | 2 | 5 |
... |
PRODUCT_CATEGORIES table
product category id | product category name |
1 | toothpaste |
2 | soda |
... |
MANUFACTURERS table
manufacturer id | manufacturer name |
68 | Colgate |
5 | Coca Cola |
... |
STORES table
store id | city id | store location | phone number
|
---|
16 | 34 | 510 Main Street | 415-555-1212 |
17 | 58 | 13 Maple Avenue | 914-555-1212 |
... |
CITIES table
city id | city name | state | population
|
---|
34 | San Francisco | California | 700,000 |
58 | East Fishkill | New York | 30,000 |
... |
After a few months of stuffing data into these tables, a WalMart
executive, call her Jennifer Amolucre asks "I noticed that there
was a Colgate promotion recently, directed at people who live in small
towns. How much Colgate toothpaste did we sell in those towns
yesterday? And how much on the same day a month ago?"
At this point, reflect that because the data model is normalized, this
information can't be obtained from scanning one table. A normalized
data model is one in which all the information in a row depends only
on the primary key. For example, the city population is not contained
in the STORES table. That information is stored once per city in the
CITIES table and only CITY_ID is kept in the STORES table. This
ensures efficiency for transaction processing. If Walmart has to
update a city's population, only one record on disk need be touched.
As computers get faster, what is more interesting is the consistency
of this approach. With the city population kept only in one place,
there is no risk that updates will be applied to some records and not
to others. If there are multiple stores in the same city, the
population will be pulled out of the same slot for all the stores all
the time.
Ms. Amolucre's query will look something like this...
select sum(sales.quantity_sold)
from sales, products, product_categories, manufacturers, stores, cities
where manufacturer_name = 'Colgate'
and product_category_name = 'toothpaste'
and cities.population < 40000
and trunc(sales.date_time_of_sale) = trunc(sysdate-1) -- restrict to yesterday
and sales.product_id = products.product_id
and sales.store_id = stores.store_id
and products.product_category_id = product_categories.product_category_id
and products.manufacturer_id = manufacturers.manufacturer_id
and stores.city_id = cities.city_id;
If you find this query tough to read, you might want to refresh your
knowledge of SQL by browsing through SQL for Web Nerds at
http://www.arsdigita.com/books/sql/. Anyway,
the basic idea is that you have to do a 6-way JOIN of some fairly
good-sized tables. Moreover, these tables are being updated as
Ms. Amolucre's query is executed.
Soon after the establishment of Jennifer Amolucre's quest for
marketing information, store employees notice that there are times
during the day when it is impossible to ring up customers. Any
attempt to update the database results in the computer freezing up for
20 minutes. Eventually the database administrators realize that the
system collapses every time Ms. Amolucre's toothpaste query gets run.
They complain to Sybase tech support.
Walmart: "We type in the toothpaste query and our system wedges."
Sybase: "Of course it does! You built an on-line transaction processing
(OLTP) system. You can't feed it a decision support system (DSS) query
and expect things to work!"
Walmart: "But I thought the whole point of SQL and your RDBMS was that
users could query and insert simultaneously."
Sybase: "Uh, not exactly. If you're reading from the database, nobody
can write to the database. If you're writing to the database, nobody
can read from the database. So if you've got a query that takes 20
minutes to run and don't specify special locking instructions, nobody
can update those tables for 20 minutes."
Walmart: "That sounds like a bug."
Sybase: "Actually it is a feature. We call it pessimistic
locking."
Walmart: "Can you fix your system so that it doesn't lock up?"
Sybase: "No. But we made this great loader tool so that you can copy
everything from your OLTP system into a separate DSS system at 100
GB/hour."
There are database management systems that achieve consistency among
concurrent users via versioning rather than locking, notably Oracle
and Postgres. However, even if you were using Oracle, where readers
never wait for writers and writers never wait for readers, you still
might not want the transaction processing operation to slow down in
the event of a marketing person entering an expensive query.
Basically what IT vendors want Walmart to do is set up another RDBMS
installation on a separate computer. Walmart needs to buy another $1
million of computer hardware. They need to buy another RDBMS license.
They also need to hire programmers to make sure that the OLTP data is
copied out nightly and stuffed into the DSS system--data
extraction. Walmart is now building the data warehouse.
Insight 1
A data warehouse is a separate RDBMS installation that contains copies
of data from on-line systems. A physically separate data warehouse is
not absolutely necessary if you have a lot of extra computing
horsepower. With a DBMS that uses optimistic locking you might even
be able to get away with keeping only one copy of your data.
As long as we're copying...
As long as you're copying data from the OLTP system into the DSS
system ("data warehouse"), you might as well think about organizing
and indexing it for faster retrieval. Extra indices on production
tables are bad because they slow down inserts and updates. Every time
you add or modify a row to a table, the RDBMS has to update the
indices to keep them consistent. But in a data warehouse, the data
are static. You build indices once and they take up space and
sometimes make queries faster and that's it.
If you know that Jennifer Amolucre is going to do the toothpaste query
every day, you can denormalize the data model for her. If you add a
TOWN_POPULATION column to the STORES table and copy in data from the
CITIES table, for example, you sacrifice some cleanliness of data
model but now Ms. Amolucre's query only requires a 5-way JOIN. If you
add MANUFACTURER and PRODUCT_CATEGORY columns to the SALES table,
you don't need to JOIN in the PRODUCTS table.
Where does denormalization end?
Once you give up the notion that the data model in the data warehouse
need bear some resemblance to the data model in the OLTP system, you
begin to think about reorganizing the data model further. Maybe it
would be nice to achieve the following:
- New questions can be asked by people with limited SQL experience,
i.e., many different questions can be answered with morphologically
similar SQL. Ideally the task of constructing SQL queries can be
simplified enough to be doable from a menu system.
- Response time is predictable. A minor change in a question will
not result in a thousand-fold increase in system response time.
The irreducible problem with the OLTP data model is that it is tough
for novices to construct queries. Given that computer systems are
not infinitely fast, a practical problem is inevitably that the
response times of a query into the OLTP tables will vary in a way that
is unpredictable to the novice.
Suppose, for example, that Bill Novice wants to look at sales on
holidays versus non-holidays with the OLTP model. Bill will need to
go look at the data model, which on a production system will contain
hundreds of tables, to find out if any of them contain information on
whether or not a date is a holiday. Then he will need to use it in a
query, something that isn't obvious given the peculiar nature of the
Oracle date
data type:
select sum(sales.quantity_sold)
from sales, holiday_map
where trunc(sales.date_time_of_sale) = trunc(holiday_map.holiday_date)
That one was pretty simple because JOINing to the
holiday_map
table knocks out sales on days that aren't
holidays. To compare to sales on non-holidays, he will need to come
up with a different query strategy, one that knocks out sales on days
that are holidays. Here is one way:
select sum(sales.quantity_sold)
from sales
where trunc(sales.date_time_of_sale)
not in
(select holiday_date from holiday_map)
Note that the morphology (structure) of this query is completely
different from the one asking for sales on holidays.
Suppose now that Bill is interested in unit sales just at those stores
where the unit sales tended to be high overall. First Bill has to
experiment to find a way to ask the database for the big-selling
stores. Probably this will involve grouping the sales
table by the store_id
column:
select store_id
from sales
group by store_id
having sum(quantity_sold) > 1000
Now we know how to find stores that have sold more than 1000 units
total, so we can add this as a subquery:
select sum(quantity_sold)
from sales
where store_id in
(select store_id
from sales
group by store_id
having sum(quantity_sold) > 1000)
Morphologically this doesn't look very different from the preceding
non-holiday query. Bill has had to figure out how to use the GROUP BY
and HAVING constructs but otherwise it is a single table query with a
subquery. Think about the time to execute, however. The
sales
table may contain millions of rows. The
holiday_map
table probably only contains 50 or 100 rows,
depending on how long the OLTP system has been in place. The most
obvious way to execute these subqueries will be to perform the
subquery for each row examined by the main query. In the case of the
"big stores" query, the subquery requires scanning and sorting the
entire sales
table. So the time to execute this query
might be 10,000 times longer than the time to execute the "non-holiday
sales" query. Should Bill Novice expect this behavior? Should he
have to think about it? Should the OLTP system grind to a halt
because he didn't think about it hard enough?
Virtually all the organizations that start by trying to increase
similarity and predictability among decision support queries end up
with a dimensional data warehouse. This necessitates a new
data model that shares little with the OLTP data model.
Dimensional Data Modeling: First Steps
Dimensional data modeling starts with a fact table. This is
where we record what happened, e.g., someone bought a Diet Coke in
East Fishkill. What you want in the fact table are facts about the
sale, ideally ones that are numeric, continuously valued, and
additive. The last two properties are important because typical fact
tables grow to a billion rows or more. People will be much happier
looking at sums or averages than detail. An important decision to
make is the granularity of the fact table. If Walmart doesn't care
about whether or not a Diet Coke was sold at 10:31 AM or 10:33 AM,
recording each sale individually in the fact table is too granular.
CPU time, disk bandwidth, and disk space will be needlessly consumed.
Let's aggregate all the sales of any particular product in one store
on a per-day basis. So we will only have one row in the fact table
recording that 200 cans of Diet Coke were sold in East Fishkill on
November 30, even if those 200 cans were sold at 113 different times
to 113 different customers.
create table sales_fact (
sales_date date not null,
product_id integer,
store_id integer,
unit_sales integer,
dollar_sales number
);
So far so good, we can pull together this table with a query JOINing
the sales
, products
, and
product_prices
(to fill the dollar_sales
column) tables. This JOIN will group by product_id
,
store_id
, and the truncated
date_time_of_sale
. Constructing this query will require
a professional programmer but keep in mind that this work only need be
done once. The marketing experts who will be using the data warehouse
will be querying from the sales_fact
table.
In building just this one table, we've already made life easier for
marketing. Suppose they want total dollar sales by product. In the
OLTP data model this would have required tangling with the
product_prices
table and its different prices for the
same product on different days. With the sales fact table, the query
is simple:
select product_id, sum(dollar_sales)
from sales_fact
group by product_id
We have a fact table. In a dimensional data warehouse there
will always be just one of these. All of the other tables will define
the dimensions. Each dimension contains extra information
about the facts, usually in a human-readable text string that can go
directly into a report. For example, let us define the time
dimension:
create table time_dimension (
time_key integer primary key,
-- just to make it a little easier to work with; this is
-- midnight (TRUNC) of the date in question
oracle_date date not null,
day_of_week varchar(9) not null, -- 'Monday', 'Tuesday'...
day_number_in_month integer not null, -- 1 to 31
day_number_overall integer not null, -- days from the epoch (first day is 1)
week_number_in_year integer not null, -- 1 to 52
week_number_overall integer not null, -- weeks start on Sunday
month integer not null, -- 1 to 12
month_number_overall integer not null,
quarter integer not null, -- 1 to 4
fiscal_period varchar(10),
holiday_flag char(1) default 'f' check (holiday_flag in ('t', 'f')),
weekday_flag char(1) default 'f' check (weekday_flag in ('t', 'f')),
season varchar(50),
event varchar(50)
);
Why is it useful to define a time dimension? If we keep the date of
the sales fact as an Oracle date column, it is still just about as
painless as ever to ask for holiday versus non-holiday sales. We need
to know about the existence of the holiday_map
table and
how to use it. Suppose we redefine the fact table as follows:
create table sales_fact (
time_key integer not null references time_dimension,
product_id integer,
store_id integer,
unit_sales integer,
dollar_sales number
);
Instead of storing an Oracle date in the fact table, we're keeping
an integer key pointing to an entry in the time dimension. The time
dimension stores, for each day, the following information:
- whether or not the day was a holiday
- into which fiscal period this day fell
- whether the day was part of the "Christmas season" or not
If we want a report of sales by season, the query is
straightforward:
select td.season, sum(f.dollar_sales)
from sales_fact f, time_dimension td
where f.time_key = td.time_key
group by td.season
If we want to get a report of sales by fiscal quarter or sales by day
of week, the SQL is structurally identical to the above. If we want
to get a report of sales by manufacturer, however, we realize that we
need another dimension: product. Instead of storing the
product_id
that references the OLTP products
table, much better to use a synthetic product key that references a
product dimension where data from the OLTP products
,
product_categories
, and manufacturers
tables
are aggregated.
Since we are Walmart, a multi-store chain, we will want a
stores dimension. This table will aggregate information from
the stores
and cities
tables in the OLTP
system. Here is how we would define the stores dimension in an Oracle
table:
create table stores_dimension (
stores_key integer primary key,
name varchar(100),
city varchar(100),
county varchar(100),
state varchar(100),
zip_code varchar(100),
date_opened date,
date_remodeled date,
-- 'small', 'medium', 'large', or 'super'
store_size varchar(100),
...
);
This new dimension gives us the opportunity to compare sales for large
versus small stores, for new and old ones, and for stores in different
regions. We can aggregate sales by geographical region, starting at
the state level and drilling down to county, city, or ZIP code. Here
is how we'd query for sales by city:
select sd.city, sum(f.dollar_sales)
from sales_fact f, stores_dimension sd
where f.stores_key = sd.stores_key
group by sd.city
Dimensions can be combined. To report sales by city on a
quarter-by-quarter basis, we would use the following query:
select sd.city, td.fiscal_period, sum(f.dollar_sales)
from sales_fact f, stores_dimension sd, time_dimension td
where f.stores_key = sd.stores_key
and f.time_key = td.time_key
group by sd.stores_key, td.fiscal_period
(extra SQL compared to previous query shown in bold).
The final dimension in a generic Walmart-style data warehouse is
promotion. The marketing folks will want to know how much a
price reduction boosted sales, how much of that boost was permanent,
and to what extent the promoted product cannibalized sales from other
products sold at the same store. Columns in the promotion dimension
table would include a promotion type (coupon or sale price), full
information on advertising (type of ad, name of publication, type of
publication), full information on in-store display, the cost of the
promotion, etc.
At this point it is worth stepping back from the details to notice
that the data warehouse contains less information than the OLTP system
but it can be more useful in practice because queries are easier to
construct and faster to execute. Most of the art of designing a good
data warehouse is in defining the dimensions. Which aspects of the
day-to-day business may be condensed and treated in blocks? Which
aspects of the business are interesting?
In hopes of retaining the more business-minded readers while still
satisfying the technical drones, we've pushed our example of a real
data warehouse and the SQL code that we used to populate it into
Appendix B. In the next section we will consider OLAP.
What if?
Suppose that a new MBA, Giovanni Giovanericco, arrives at Walmart and
asks the DSS system "How many toothpaste tubes were sold as a function
of advertising dollars spent per person in towns with populations less
than 40,000?" No problem. Suppose that the query is changed to "How
would we predict toothpaste sales to change if we doubled spending on
advertising?" Can you formulate that in SQL?
Suppose you ask "Show me the sales in stores with the best-paid
managers." Quite easy in SQL. "Find the correlation between management
pay and sales." Quite difficult.
It turns out that SQL is the wrong language for many purposes. Just
as a trivial example, consider how painful has been your experience
with Web sites that directly expose SQL-style set queries to the
users. If you specify what you want, you get zero rows back. If you
loosen up your choices to be sure of getting at least one row, you
find yourself wading through 1000 results. The SQL query processor
finds sets of rows that exactly satisfy the WHERE clauses. It then
returns "0 rows selected". It does not return "0 rows selected but
the last 17 were killed off by the hour_of_travel constraint and 98
before that were killed off by the cheapest_fare clause."
Walmart: "You said SQL was great and would solve our problems. But we
can't ask our most important questions in SQL."
Sybase: "We only sold you an OLTP system and then a DSS system. These
questions you've brought to us are online analytical processing (OLAP)
queries. You can't expect to run these against a relational
database. You need an OLAP system. It will only cost you another $1
million in hardware and $500,000 in software licenses.
If your data set isn't too huge, you can be more flexible and do more
interesting calculations after sucking all the data out of an RDBMS
into virtual memory data structures and exploring from there. That's
more or less what OLAP systems do. If you make some pretense at
finding patterns automatically or semi-automatically, you can call
what you are doing data mining. This is one of the great tech
jobs in a big company. The farther you get from the OLTP system, the
better off you are. In the OLTP world, if you're down for one minute
the entire company knows about it. In the DSS world, you can buy all
the hardware that suits your fancy, be down for a day or two, and
nobody will complain too much. In the OLAP world, you can buy the
biggest computers on the planet, put your feet up on your desk for
months and months, and tell anyone who asks that you're still mining
the data.
What do you in fact find at a big company?
A big company these days will have three copies of the same data.
There will be a huge RDBMS installation for OLTP. Every night data
from the OLTP system will be copied into the even bigger RDBMS
installation for DSS queries. Every night subsets of data from the
DSS system will be copied into various kinds of OLAP systems. This is
how more and more computers get sold even though companies aren't
really accomplishing much more than they were in 1965.
Buzzwords that you should recognize
If you want to make $500,000 per year without learning any skills,
mastering the argot of the data warehousing world is not a bad place
to start. Here is a primer:
- datamart
- a view or a copy of a subset of the data in a data warehouse. In a
company with lots of divisions, it might be overwhelming for "the
toothpaste" guys to see all the tables. So a collection of views and
tables specific to toothpaste is produced. If the toothpaste guys want
to do a lot of DSS queries, they sometimes will get their very own RDBMS
installation (more money for the hardware and software vendors).
- dimensional data model
- see "star schema"
- EIS
- executive information system (same as decision support system (DSS))
- ODS
- operational data store -- an archive of operational data in its
original raw form
- OLAP
- online analytical processing
- OLTP
- online transaction processing; the information systems that run
the underlying business
- SKU
- stock keeping unit. A unique key for a product sold by a store.
- star schema
- see "dimensional data model"
Shopping list
Now that you know the buzzwords, you are ready to start shopping. The
world of IT vendors stands ready to sell you tools that will make
building and operating your data warehouse much easier.
The first tool that you need is intelligence and thought. If you pick
the right dimensions and put the required data into them, your data
warehouse will be useful. If you don't get your dimensions right, you
won't even be able to ask the interesting questions. If you're not
smart or thoughtful, probably the best thing to do is find a boutique
consulting firm with expertise in building data warehouses for your
industry. Get them to lay out the initial star schema. They won't
get it right but it should be close enough to live with for a few
months. If you can't find an expert, The
Data Warehouse Toolkit (Ralph Kimball 1996) contains
example schemata for 10 different kinds of businesses.
You will need some place to store your data and query parts back out.
Since you are using SQL your only choice is a relational database
management system. There are specialty vendors that have historically
made RDBMSes with enhanced features for data warehousing, such as the
ability to compute a value based on information from the current row
compared to information from a previously output row of the report.
This gets away from the strict unordered set-theoretic way of looking
at the world that E.F. Codd sketched in 1970 but has proven to be
useful. Starting with version 8.1.6, Oracle has added most of the
useful third-party features into their standard product. Thus all but
the very smallest and very largest modern data warehouses tend to be
built using Oracle (see the
"SQL for Analysis" chapter in the Oracle8i
Data Warehousing Guide volume of the Oracle documentation).
Oracle contains two features that may enable you to construct and use
your data warehouse without investing in separate hardware. First is
the optimistic locking system that Oracle has employed since the late
1980s. If someone is doing a complex query it will not affect
transactions that need to update the same tables. Essentially each
query runs in its own snapshot of the database as it existed when the
query was started. The second Oracle feature is materialized
views or summaries. It is possible to instruct the
database to keep a summary of sales by quarter, for example. If
someone asks for a query involving quarterly sales, the small summary
table will be consulted instead of the comprehensive sales table.
This could be 100 to 1000 times faster.
One typical goal of a data warehousing project is to provide a unified
view of a company's disparate information systems. The only way to do
this is to extract data from all of these information systems and
clean up those data for consistency and accuracy. This is purportedly
a challenging task when RDBMSes from different vendors are involved,
though it might not seem so on the surface. After all, every RDBMS
comes with a C library. You could write a C program to perform
queries on the Brand X database and do inserts on the Brand Y
database. Perl and Tcl have convenient facilities for transforming
text strings and there are db connectivity interfaces from these
scripting languages to DBMS C libraries. So you could write a Perl
script. Most databases within a firm are accessible via the Web, at
least within a company's internal network. Oracle includes a Java
virtual machine and Java libraries to fetch Web pages and parse XML.
So you could write a Java or PL/SQL program running inside your data
warehouse Oracle installation to grab the foreign information and
bring it back.
If you don't like to program or have a particularly knotty
connectivity problem involving an old mainframe, various companies
make software that can help. For high-end mainframe stuff, Oracle
Corporation itself offers some useful layered products. For low-end
"more-convenient-than-Perl" stuff, Data Junction (www.datajunction.com) is
useful.
Given an already-built data warehouse, there are a variety of useful
query tools. The theory is that if you've organized your data model
well enough, a non-technical user will be able to navigate around via
a graphic user interface or a Web browser.
The best known query tool is Crystal Reports (www.seagatesoftware.com).
The free open-source ArsDigita Community System contains a primitive
HTML-form based query tool for a dimensional data warehouse. See http://www.arsdigita.com/doc/dw
for details.
Is there a bottom line to all of this? If you can think sufficiently
clearly about your organization and its business to construct the
correct dimensions and program SQL reasonably well, you will be
successful with the raw RDBMS. Extra software tools can potentially
make the project a bit less painful or a bit shorter but they won't
be of critical importance.
More Information
The construction of data warehouses is a guild-like activity. Most of
the expert knowledge is contained within firms that specialize not in
data warehousing but in data warehousing for a particular kind of
company. For example, there are firms that do nothing but build data
warehouses for supermarkets. There are firms that do nothing but
build data warehouses for department stores. Part of what keeps this
a tight guild is the poor quality of textbooks and journal articles on
the subject. Most of the books on data warehousing are written by and
for people who do not know SQL. The books focus on (1) stuff that you
can buy from a vendor, (2) stuff that you can do from a graphical user
interface after the data warehouse is complete, and (3) how to
navigate around a large organization to get all the other suits to
agree to give you their data, their money, and a luxurious schedule.
The only worthwhile introductory book that we've found on data warehousing in
general is
Ralph
Kimball's
The Data Warehouse Toolkit. Kimball is
also the author of an inspiring book on clickstream data warehousing:
The
Data Webhouse Toolkit. The latter book is good if you are
interested in applying classical dimensional data warehousing
techniques to user activity analysis.
It isn't exactly a book and it isn't great for beginners but the Oracle8i
Data Warehousing Guide volume of the official Oracle server documentation
is extremely useful.
Data on consumer purchasing behavior are available from A.C. Nielsen (www.acnielsen.com),
Information Resources Incorporated (IRI; www.infores.com), and a bunch of other
companies listed in
http://dir.yahoo.com/Business_and_Economy/Business_to_Business/Marketing_and_Advertising/Market_Research/.
Appendix A: Data Model for Walmart
Here's the data model for our Walmart example (Oracle 8.1.6 syntax):
create table product_categories (
product_category_id integer primary key,
product_category_name varchar(100) not null
);
create table manufacturers (
manufacturer_id integer primary key,
manufacturer_name varchar(100) not null
);
create table products (
product_id integer primary key,
product_name varchar(100) not null,
product_category_id references product_categories,
manufacturer_id references manufacturers
);
create table cities (
city_id integer primary key,
city_name varchar(100) not null,
state varchar(100) not null,
population integer not null
);
create table stores (
store_id integer primary key,
city_id references cities,
store_location varchar(200) not null,
phone_number varchar(20)
);
create table sales (
product_id not null references products,
store_id not null references stores,
quantity_sold integer not null,
-- the Oracle "date" type is precise to the second
-- unlike the ANSI date datatype
date_time_of_sale date not null
);
-- put some data in
insert into product_categories values (1, 'toothpaste');
insert into product_categories values (2, 'soda');
insert into manufacturers values (68, 'Colgate');
insert into manufacturers values (5, 'Coca Cola');
insert into products values (567, 'Colgate Gel Pump 6.4 oz.', 1, 68);
insert into products values (219, 'Diet Coke 12 oz. can', 2, 5);
insert into cities values (34, 'San Francisco', 'California', 700000);
insert into cities values (58, 'East Fishkill', 'New York', 30000);
insert into stores values (16, 34, '510 Main Street', '415-555-1212');
insert into stores values (17, 58, '13 Maple Avenue', '914-555-1212');
insert into sales values (567, 17, 1, to_date('1997-10-22 09:35:14', 'YYYY-MM-DD HH24:MI:SS'));
insert into sales values (219, 16, 4, to_date('1997-10-22 09:35:14', 'YYYY-MM-DD HH24:MI:SS'));
insert into sales values (219, 17, 1, to_date('1997-10-22 09:35:17', 'YYYY-MM-DD HH24:MI:SS'));
-- keep track of which dates are holidays
-- the presence of a date (all dates will be truncated to midnight)
-- in this table indicates that it is a holiday
create table holiday_map (
holiday_date date primary key
);
-- where the prices are kept
create table product_prices (
product_id not null references products,
from_date date not null,
price number not null
);
insert into product_prices values (567,'1997-01-01',2.75);
insert into product_prices values (219,'1997-01-01',0.40);
Appendix B: Data Warehouse for Levis Strauss
In 1998, ArsDigita Corporation built a Web service as a front end to
an experimental custom clothing factory operated by Levi Strauss. Users
would visit our site to choose a style of khaki pants, enter their
waist, inseam, height, weight, and shoe size, and finally check out
with their credit card. Our server would attempt to authorize a
charge on the credit card through CyberCash. The factory IT system
would poll our server's Oracle database periodically so that it could
start cutting pants within 10 minutes of a successfully authorized
order.
The whole purpose of the factory and Web service was to test and
analyze consumer reaction to this method of buying clothing.
Therefore, a data warehouse was built into the project almost from the
start.
We did not buy any additional hardware or software to support the data
warehouse. The public Web site was supported by a mid-range
Hewlett-Packard Unix server that had ample leftover capacity to run
the data warehouse. We created a new "dw" Oracle user, GRANTed SELECT
on the OLTP tables to the "dw" user, and wrote procedures to copy all
the data from the OLTP system into a star schema of tables owned by
the "dw" user. For queries, we added an IP address to the machine and
ran a Web server program bound to that second IP address.
Here is how we explained our engineering decisions to our customer
(Levi Strauss):
We employ a standard star join schema for the following reasons:
* Many relational database management systems, including Oracle 8.1,
are heavily optimized to execute queries against these schemata.
* This kind of schema has been proven to scale to the world's
largest data warehouses.
* If we hired a data warehousing nerd off the street, he or she
would have no trouble understanding our schema.
In a star join schema, there is one fact table ("we sold a pair of
khakis at 1:23 pm to Joe Smith") that references a bunch of dimension
tables. As a general rule, if we're going to narrow our interest
based on a column, it should be in the dimension table. I.e., if
we're only looking at sales of grey dressy fabric khakis, we should
expect to accomplish that with WHERE clauses on columns of a product
dimension table. By contrast, if we're going to be aggregating
information with a SUM or AVG command, these data should be stored in
the columns of the fact table. For example, the dollar amount of the
sale should be stored within the fact table. Since we have so few
prices (essentially only one), you might think that this should go in
a dimension. However, by keeping it in the fact table we're more
consistent with traditional data warehouses.
After some discussions with Levi's executives, we designed in the
following dimension tables:
- time
for queries comparing sales by season, quarter, or holiday
- product
for queries comparing sales by color or style
- ship to
for queries comparing sales by region or state
- promotion
for queries aimed at determining the relationship between discounts and sales
- consumer
for queries comparing sales by first-time and repeat buyers
- user experience
for queries looking at returned versus exchanged versus accepted items
(most useful when combined with other dimensions, e.g., was a
particular color more likely to lead to an exchange request)
These dimensions allow us to answer questions such as
- In what regions of the country are pleated pants most popular?
(fact table joined with the product and ship-to dimensions)
- What percentage of pants were bought with coupons and how has that
varied from quarter to quarter?
(fact table joined with the promotion and time dimensions)
- How many pants were sold on holidays versus non-holidays?
(fact table joined with the time dimension)
The Dimension Tables
The time_dimension
table is identical to the example
given above.
create table time_dimension (
time_key integer primary key,
-- just to make it a little easier to work with; this is
-- midnight (TRUNC) of the date in question
oracle_date date not null,
day_of_week varchar(9) not null, -- 'Monday', 'Tuesday'...
day_number_in_month integer not null, -- 1 to 31
day_number_overall integer not null, -- days from the epoch (first day is 1)
week_number_in_year integer not null, -- 1 to 52
week_number_overall integer not null, -- weeks start on Sunday
month integer not null, -- 1 to 12
month_number_overall integer not null,
quarter integer not null, -- 1 to 4
fiscal_period varchar(10),
holiday_flag char(1) default 'f' check (holiday_flag in ('t', 'f')),
weekday_flag char(1) default 'f' check (weekday_flag in ('t', 'f')),
season varchar(50),
event varchar(50)
);
We populated the time_dimension
table with a single
INSERT statement. The core work is done by Oracle date formatting
functions. A helper table, integers
, is used to supply a
series of numbers to add to a starting date (we picked July 1, 1998, a
few days before our first real order).
-- Uses the integers table to drive the insertion, which just contains
-- a set of integers, from 0 to n.
-- The 'epoch' is hardcoded here as July 1, 1998.
-- d below is the Oracle date of the day we're inserting.
insert into time_dimension
(time_key, oracle_date, day_of_week, day_number_in_month,
day_number_overall, week_number_in_year, week_number_overall,
month, month_number_overall, quarter, weekday_flag)
select n, d, rtrim(to_char(d, 'Day')), to_char(d, 'DD'), n + 1,
to_char(d, 'WW'),
trunc((n + 3) / 7), -- July 1, 1998 was a Wednesday, so +3 to get the week numbers to line up with the week
to_char(d, 'MM'), trunc(months_between(d, '1998-07-01') + 1),
to_char(d, 'Q'), decode(to_char(d, 'D'), '1', 'f', '7', 'f', 't')
from (select n, to_date('1998-07-01', 'YYYY-MM-DD') + n as d
from integers);
A bit of Oracle minutia is helpful in understanding this transaction.
If you add a number to an Oracle date, you get another Oracle date.
So adding 3 to "1998-07-01" will yield "1998-07-04".
There are several fields left to be populated that we cannot derive
using Oracle date functions: season, fiscal period, holiday flag,
season, event. Fiscal period depended on Levi's choice of fiscal
year. The event
column was set aside for arbitrary blocks
of time that were particularly interesting to the Levi's marketing
team, e.g., a sale period. In practice, it was not used.
To update the holiday_flag
field, we used two helper
tables, one for "fixed" holidays (those which occur on the same day
each year), and one for "floating" holidays (those which move around).
create table fixed_holidays (
month integer not null check (month >= 1 and month <= 12),
day integer not null check (day >= 1 and day <= 31),
name varchar(100) not null,
primary key (month, day)
);
-- Specifies holidays that fall on the Nth DAY_OF_WEEK in MONTH.
-- Negative means count backwards from the end.
create table floating_holidays (
month integer not null check (month >= 1 and month <= 12),
day_of_week varchar(9) not null,
nth integer not null,
name varchar(100) not null,
primary key (month, day_of_week, nth)
);
Some example holidays:
insert into fixed_holidays (name, month, day)
values ('New Year''s Day', 1, 1);
insert into fixed_holidays (name, month, day)
values ('Christmas', 12, 25);
insert into fixed_holidays (name, month, day)
values ('Veteran''s Day', 11, 11);
insert into fixed_holidays (name, month, day)
values ('Independence Day', 7, 4);
insert into floating_holidays (month, day_of_week, nth, name)
values (1, 'Monday', 3, 'Martin Luther King Day');
insert into floating_holidays (month, day_of_week, nth, name)
values (10, 'Monday', 2, 'Columbus Day');
insert into floating_holidays (month, day_of_week, nth, name)
values (11, 'Thursday', 4, 'Thanksgiving');
insert into floating_holidays (month, day_of_week, nth, name)
values (2, 'Monday', 3, 'President''s Day');
insert into floating_holidays (month, day_of_week, nth, name)
values (9, 'Monday', 1, 'Labor Day');
insert into floating_holidays (month, day_of_week, nth, name)
values (5, 'Monday', -1, 'Memorial Day');
An extremely clever person who'd recently read SQL
for Smarties would probably be able to come up with an SQL
statement to update the holiday_flag
in the
time_dimension
rows. However, there is no need to work
your brain that hard. Oracle includes two procedural languages, Java
and PL/SQL. You can simply implement the following pseudocode in the
procedural language of your choice:
foreach row in "select name, month, day from fixed_holidays"
update time_dimension
set holiday_flag = 't'
where month = row.month and day_number_in_month = row.day;
end foreach
foreach row in "select month, day_of_week, nth, name from floating_holidays"
if row.nth > 0 then
# If nth is positive, put together a date range constraint
# to pick out the right week.
ending_day_of_month := row.nth * 7
starting_day_of_month := ending_day_of_month - 6
update time_dimension
set holiday_flag = 't'
where month = row.month
and day_of_week = row.day_of_week
and starting_day_of_month <= day_number_in_month
and day_number_in_month <= ending_day_of_month;
else
# If it is negative, get all the available dates
# and get the nth one from the end.
i := 0;
foreach row2 in "select day_number_in_month from time_dimension
where month = row.month
and day_of_week = row.day_of_week
order by day_number_in_month desc"
i := i - 1;
if i = row.nth then
update time_dimension
set holiday_flag = 't'
where month = row.month
and day_number_in_month = row2.day_number_in_month
break;
end if
end foreach
end if
end foreach
The product dimension
The product dimension contains one row for each unique combination of
color, style, cuffs, pleats, etc.
create table product_dimension (
product_key integer primary key,
-- right now this will always be "ikhakis"
product_type varchar(20) not null,
-- could be "men", "women", "kids", "unisex adults"
expected_consumers varchar(20),
color varchar(20),
-- "dressy" or "casual"
fabric varchar(20),
-- "cuffed" or "hemmed" for pants
-- null for stuff where it doesn't matter
cuff_state varchar(20),
-- "pleated" or "plain front" for pants
pleat_state varchar(20)
);
To populate this dimension, we created a one-column table for each
field in the dimension table and use a multi-table join without a
WHERE clause. This generates the cartesian product of all the
possible values for each field:
create table t1 (expected_consumers varchar(20));
create table t2 (color varchar(20));
create table t3 (fabric varchar(20));
create table t4 (cuff_state varchar(20));
create table t5 (pleat_state varchar(20));
insert into t1 values ('men');
insert into t1 values ('women');
insert into t1 values ('kids');
insert into t1 values ('unisex');
insert into t1 values ('adults');
[etc.]
insert into product_dimension
(product_key, product_type, expected_consumers,
color, fabric, cuff_state, pleat_state)
select
product_key_sequence.nextval,
'ikhakis',
t1.expected_consumers,
t2.color,
t3.fabric,
t4.cuff_state,
t5.pleat_state
from t1,t2,t3,t4,t5;
Notice that an Oracle sequence, product_key_sequence
, is
used to generate unique integer keys for each row as it is inserted
into the dimension.
The promotion dimension
The art of building the promotion dimension is dividing the world of
coupons into a broad categories, e.g., "between 10 and 20 dollars".
This categorization depended on the learning that the marketing
executives did not care about the difference between a $3.50 and a
$3.75 coupon.
create table promotion_dimension (
promotion_key integer primary key,
-- can be "coupon" or "no coupon"
coupon_state varchar(20),
-- a text string such as "under $10"
coupon_range varchar(20)
);
The separate coupon_state
and coupon_range
columns allow for reporting of sales figures broken down into
fullprice/discounted or into a bunch of rows, one for each range of
coupon size.
The consumer dimension
We did not have access to a lot of demographic data about our
customers. We did not have a lot of history since this was a new
service. Consequently, our consumer dimension is extremely simple.
It is used to record whether or not a sale in the fact table was to a
new or a repeat customer.
create table consumer_dimension (
consumer_key integer primary key,
-- 'new customer' or 'repeat customer'
repeat_class varchar(20)
);
The user experience dimension
If we are interested in building a report of the average amount of
time spent contemplating a purchase versus whether the purchase was
ultimately kept, the user_experience_dimension
table
will help.
create table user_experience_dimension (
user_experience_key integer primary key,
-- 'shipped on time', 'shipped late'
on_time_status varchar(20),
-- 'kept', 'returned for exchange', 'returned for refund'
returned_status varchar(30)
);
The ship-to dimension
Classically one of the most powerful dimensions in a data warehouse,
our ship_to_dimension
table allows us to group sales by
region or state.
create table ship_to_dimension (
ship_to_key integer primary key,
-- e.g., Northeast
ship_to_region varchar(30) not null,
ship_to_state char(2) not null
);
create table state_regions (
state char(2) not null primary key,
region varchar(50) not null
);
-- to populate:
insert into ship_to_dimension
(ship_to_key, ship_to_region, ship_to_state)
select ship_to_key_sequence.nextval, region, state
from state_regions;
Notice that we've thrown out an awful lot of detail here. Had this
been a full-scale product for Levi Strauss, they would probably have
wanted at least extra columns for county, city, and zip code. These
columns would allow a regional sales manager to look at sales within a
state.
(In a data warehouse for a manufacturing wholesaler, the ship-to
dimension would contain columns for the customer's company name, the
division of the customer's company that received the items, the sales
district of the salesperson who sold the order, etc.)
The Fact Table
The granularity of our fact table is one order. This is finer-grained
than the canonical Walmart-style data warehouse as presented in
Appendix A, where a fact is the quantity of a particular SKU sold in
one store on one day (i.e., all orders in one day for the same item
are aggregated). We decided that we could afford this because the
conventional wisdom in the data warehousing business in 1998 was that
up to billion-row fact tables were manageable. Our retail price was
$40 and it was tough to foresee a time when the factory could make
more than 1,000 pants per day. So it did not seem extravagant to
budget one row per order.
Given the experimental nature of this project we did not delude
ourselves into thinking that we would get it right the first time.
Since we were recording one row per order we were able to cheat by
including pointers from the data warehouse back into the OLTP
database: order_id
and consumer_id
. We
never had to use these but it was nice to know that if we couldn't get
a needed answer for the marketing executives the price would have been
some custom SQL coding rather than rebuilding the entire data
warehouse.
create table sales_fact (
-- keys over to the OLTP production database
order_id integer primary key,
consumer_id integer not null,
time_key not null references time_dimension,
product_key not null references product_dimension,
promotion_key not null references promotion_dimension,
consumer_key not null references consumer_dimension,
user_experience_key not null references user_experience_dimension,
ship_to_key not null references ship_to_dimension,
-- time stuff
minutes_login_to_order number,
days_first_invite_to_order number,
days_order_to_shipment number,
-- this will be NULL normally (unless order was returned)
days_shipment_to_intent number,
pants_id integer,
price_charged number,
tax_charged number,
shipping_charged number
);
After defining the fact table, we populated it with a single insert
statement:
-- find_product, find_promotion, find_consumer, and find_user_experience
-- are PL/SQL procedures that return the appropriate key from the dimension
-- tables for a given set of parameters
insert into sales_fact
select o.order_id, o.consumer_id, td.time_key,
find_product(o.color, o.casual_p, o.cuff_p, o.pleat_p),
find_promotion(o.coupon_id),
find_consumer(o.pants_id),
find_user_experience(o.order_state, o.confirmed_date, o.shipped_date),
std.ship_to_key,
minutes_login_to_order(o.order_id, usom.user_session_id),
decode(sign(o.confirmed_date - gt.issue_date), -1, null, round(o.confirmed_date - gt.issue_date, 6)),
round(o.shipped_date - o.confirmed_date, 6),
round(o.intent_date - o.shipped_date, 6),
o.pants_id, o.price_charged, o.tax_charged, o.shipping_charged
from khaki.reportable_orders o, ship_to_dimension std,
khaki.user_session_order_map usom, time_dimension td,
khaki.addresses a, khaki.golden_tickets gt
where o.shipping = a.address_id
and std.ship_to_state = a.usps_abbrev
and o.order_id = usom.order_id(+)
and trunc(o.confirmed_date) = td.oracle_date
and o.consumer_id = gt.consumer_id;
As noted in the comment at top, most of the work here is done by
PL/SQL procedures such as find_product
that dig up the
right row in a dimension table for this particular order.
The preceding insert will load an empty data warehouse from the
on-line transaction processing system's tables. Keeping the data
warehouse up to date with what is happening in OLTP land requires a
similar INSERT with an extra restriction WHERE clause limiting orders
to only those order ID is larger than the maximum of the order IDs
currently in the warehouse. This is a safe transaction to execute as
many times per day as necessary--even two simultaneous INSERTs would
not corrupt the data warehouse with duplicate rows because of the
primary key constraint on order_id
. A daily update is
traditional in the data warehousing world so we scheduled one every 24
hours using the Oracle dbms_job
package
(http://www.oradoc.com/ora816/server.816/a76956/jobq.htm#750).
Sample Queries
We have (1) defined a star schema, (2) populated the dimension tables,
(3) loaded the fact table, and (4) arranged for periodic updating of
the fact table. Now we can proceed to the interesting part of our
data warehouse: getting information back out.
Using only the sales_fact
table, we can ask for
- the total number of orders, total revenue to date, tax paid,
shipping costs to date, the average price paid for each item sold, and
the average number of days to ship:
select count(*) as n_orders,
round(sum(price_charged)) as total_revenue,
round(sum(tax_charged)) as total_tax,
round(sum(shipping_charged)) as total_shipping,
round(avg(price_charged),2) as avg_price,
round(avg(days_order_to_shipment),2) as avg_days_to_ship
from sales_fact;
- the average number of minutes from login to order (we exclude user
sessions longer than 30 minutes to avoid skewing the results from
people who interrupted their shopping session to go out to lunch or
sleep for a few hours):
select round(avg(minutes_login_to_order), 2)
from sales_fact
where minutes_login_to_order < 30
- the average number of days from first being invited to the site by
email to the first order (excluding periods longer than 2 weeks to
remove outliers):
select round(avg(days_first_invite_to_order), 2)
from sales_fact
where days_first_invite_to_order < 14
Joining against the ship_to_dimension
table lets us ask
how many pants were shipped to each region of the United States:
select ship_to_region, count(*) as n_pants
from sales_fact f, ship_to_dimension s
where f.ship_to_key = s.ship_to_key
group by ship_to_region
order by n_pants desc
Region | Pants Sold |
New England Region | 612 |
NY and NJ Region | 321 |
Mid Atlantic Region | 318 |
Western Region | 288 |
Southeast Region | 282 |
Southern Region | 193 |
Great Lakes Region | 177 |
Northwestern Region | 159 |
Central Region | 134 |
North Central Region | 121 |
Note: these data are based on a random subset of orders from the
Levi's site and we have also made manual changes to the report values.
The numbers are here to give you an idea of what these queries do, not
to provide insight into the Levi's custom clothing business.
Joining against the time_dimension
, we can ask how many
pants were sold for each day of the week:
select day_of_week, count(*) as n_pants
from sales_fact f, time_dimension t
where f.time_key = t.time_key
group by day_of_week
order by n_pants desc
Day of Week | Pants Sold |
Thursday | 3428 |
Wednesday | 2823 |
Tuesday | 2780 |
Monday | 2571 |
Friday | 2499 |
Saturday | 1165 |
Sunday | 814 |
We were able to make pants with either a "dressy" or "casual" fabric.
Joining against the product_dimension
table can tell us
how popular each option was as a function of color:
select color, count(*) as n_pants, sum(decode(fabric,'dressy',1,0)) as n_dressy
from sales_fact f, product_dimension p
where f.product_key = p.product_key
group by color
order by n_pants desc
Color | Pants Sold | % Dressy |
dark tan | 486 | 100 |
light tan | 305 | 49 |
dark grey | 243 | 100 |
black | 225 | 97 |
navy blue | 218 | 61 |
medium tan | 209 | 0 |
olive green | 179 | 63 |
Note: 100% and 0% indicate that those colors were available only in
one fabric.
Here is a good case of how the data warehouse may lead to a practical
result. If these were the real numbers from the Levi's warehouse,
what would pop out at the manufacturing guys is that 97% of the black
pants sold were in one fabric style. It might not make sense to keep
an inventory of casual black fabric if there is so little consumer
demand for it.
Query Generation: The Commercial Closed-Source Route
The promise of a data warehouse is not fulfilled if all users must
learn SQL syntax and how to run SQL*PLUS. From 10
years of exposure to advertising for query tools, we decided that the state of
forms-based query tools must be truly advanced. We thus suggested to
Levi Strauss that they use Seagate Crystal Reports and Crystal Info to
analyze their data. These packaged tools, however, ended up not
fitting very well with what Levi's wanted to accomplish. First,
constructing queries was not semantically simpler than coding SQL.
The Crystal Reports consultant that we brought in said that most of
his clients ended up having a programmer set up the report queries and
the business people would simply run the report every day against new
data. If professional programmers had to construct queries, it seemed
just as easy just to write more admin pages using our standard Web
development tools, which required about 15 minutes per page. Second,
it was impossible to ensure availability of data warehouse queries to
authorized users anywhere on the Internet. Finally there were
security and social issues associated with allowing a SQL*Net
connection from a Windows machine running Crystal Reports out through
the Levi's firewall to our Oracle data warehouse on the Web.
Not knowing if any other commercial product would work better and not
wanting to disappoint our customer, we extended the ArsDigita
Community System with a data warehouse query module that runs as a
Web-only tool. This is a free open-source system and comes with the
standard ACS package that you can download from http://www.arsdigita.com/download/.
Query Generation: The Open-Source ACS Route
The "dw" module in the ArsDigita Community System is designed with the
following goals:
- naive users can build simple queries by themselves
- professional programmers can step in to help out the naive users
- a user with no skill can re-execute a saved query
We keep one row per query in the queries
table:
create table queries (
query_id integer primary key,
query_name varchar(100) not null,
query_owner not null references users,
definition_time date not null,
-- if this is non-null, we just forget about all the query_columns
-- stuff; the user has hand-edited the SQL
query_sql varchar(4000)
);
Unless the query_sql
column is populated with a
hand-edited query, the query will be built up by looking at several
rows in the query_columns
table:
-- this specifies the columns we we will be using in a query and
-- what to do with each one, e.g., "select_and_group_by" or
-- "select_and_aggregate"
-- "restrict_by" is tricky; value1 contains the restriction value, e.g., '40'
-- or 'MA' and value2 contains the SQL comparion operator, e.g., "=" or ">"
create table query_columns (
query_id not null references queries,
column_name varchar(30),
pretty_name varchar(50),
what_to_do varchar(30),
-- meaning depends on value of what_to_do
value1 varchar(4000),
value2 varchar(4000)
);
create index query_columns_idx on query_columns(query_id);
The query_columns
definition appears strange at first.
It specifies the name of a column but not a table. This module is
predicated on the simplifying assumption that we have one enormous
view, ad_hoc_query_view
, that contains all the dimension
tables' columns alongside the fact table's columns.
Here is how we create the view for the Levi's data warehouse:
create or replace view ad_hoc_query_view
as
select minutes_login_to_order, days_first_invite_to_order,
days_order_to_shipment, days_shipment_to_intent, pants_id,
price_charged, tax_charged, shipping_charged,
oracle_date, day_of_week,
day_number_in_month, week_number_in_year, week_number_overall,
month, month_number_overall, quarter, fiscal_period,
holiday_flag, weekday_flag, season, color, fabric, cuff_state,
pleat_state, coupon_state, coupon_range, repeat_class,
on_time_status, returned_status, ship_to_region, ship_to_state
from sales_fact f, time_dimension t, product_dimension p,
promotion_dimension pr, consumer_dimension c,
user_experience_dimension u, ship_to_dimension s
where f.time_key = t.time_key
and f.product_key = p.product_key
and f.promotion_key = pr.promotion_key
and f.consumer_key = c.consumer_key
and f.user_experience_key = u.user_experience_key
and f.ship_to_key = s.ship_to_key;
At first glance, this looks like a passport to sluggish Oracle
performance. We'll be doing a seven-way JOIN for every data warehouse
query, regardless of whether we need information from some of the
dimension tables or not.
We can test this assumption as follows:
-- tell SQL*Plus to turn on query tracing
set autotrace on
-- let's look at how many pants of each color
-- were sold in each region
SELECT ship_to_region, color, count(pants_id)
FROM ad_hoc_query_view
GROUP BY ship_to_region, color;
Oracle will return the query results first...
ship_to_region | color | count(pants_id) |
Central Region | black | 46 |
Central Region | dark grey | 23 |
Central Region | dark tan | 39 |
.. |
Western Region | medium tan | 223 |
Western Region | navy blue | 245 |
Western Region | olive green | 212 |
... and then explain how those results were obtained:
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=181 Card=15 Bytes=2430)
1 0 SORT (GROUP BY) (Cost=181 Card=15 Bytes=2430)
2 1 NESTED LOOPS (Cost=12 Card=2894 Bytes=468828)
3 2 HASH JOIN (Cost=12 Card=885 Bytes=131865)
4 3 TABLE ACCESS (FULL) OF 'PRODUCT_DIMENSION' (Cost=1 Card=336 Bytes=8400)
5 3 HASH JOIN (Cost=6 Card=885 Bytes=109740)
6 5 TABLE ACCESS (FULL) OF 'SHIP_TO_DIMENSION' (Cost=1 Card=55 Bytes=1485)
7 5 NESTED LOOPS (Cost=3 Card=885 Bytes=85845)
8 7 NESTED LOOPS (Cost=3 Card=1079 Bytes=90636)
9 8 NESTED LOOPS (Cost=3 Card=1316 Bytes=93436)
10 9 TABLE ACCESS (FULL) OF 'SALES_FACT' (Cost=3 Card=1605 Bytes=93090)
11 9 INDEX (UNIQUE SCAN) OF 'SYS_C0016416' (UNIQUE)
12 8 INDEX (UNIQUE SCAN) OF 'SYS_C0016394' (UNIQUE)
13 7 INDEX (UNIQUE SCAN) OF 'SYS_C0016450' (UNIQUE)
14 2 INDEX (UNIQUE SCAN) OF 'SYS_C0016447' (UNIQUE)
As you can see from the table names in bold face, Oracle was smart
enough to examine only tables relevant to our query:
product_dimension
, because we asked about color;
ship_to_dimension
, because we asked about region;
sales_fact
, because we asked for a count of pants sold.
Bottom line: Oracle did a 3-way JOIN instead of the 7-way JOIN
specified by the view.
To generate an SQL query into ad_hoc_query_view
from the
information stored in query_columns
is most easily done
with a function in a procedural language such as Java, PL/SQL, Perl,
or Tcl (here is pseudocode):
proc generate_sql_for_query(a_query_id)
select_list_items list;
group_by_items list;
order_clauses list;
foreach row in "select column_name, pretty_name
from query_columns
where query_id = a_query_id
and what_to_do = 'select_and_group_by'"]
if row.pretty_name is null then
append_to_list(group_by_items, row.column_name)
else
append_to_list(group_by_items, row.column_name || ' as "' || row.pretty_name || '"'
end if
end foreach
foreach row in "select column_name, pretty_name, value1
from query_columns
where query_id = a_query_id
and what_to_do = 'select_and_aggregate'"
if row.pretty_name is null then
append_to_list(select_list_items, row.value1 || row.column_name)
else
append_to_list(select_list_items, row.value1 || row.column_name || ' as "' || row.pretty_name || '"'
end if
end foreach
foreach row in "select column_name, value1, value2
from query_columns
where query_id = a_query_id
and what_to_do = 'restrict_by'"
append_to_list(where_clauses, row.column_name || ' ' || row.value2 || ' ' || row.value1)
end foreach
foreach row in "select column_name
from query_columns
where query_id = a_query_id
and what_to_do = 'order_by'"]
append_to_list(order_clauses, row.column_name)
end foreach
sql := "SELECT " || join(select_list_items, ', ') ||
" FROM ad_hoc_query_view"
if list_length(where_clauses) > 0 then
append(sql, ' WHERE ' || join(where_clauses, ' AND '))
end if
if list_length(group_by_items) > 0 then
append(sql, ' GROUP BY ' || join(group_by_items, ', '))
end if
if list_length(order_clauses) > 0 then
append(sql, ' ORDER BY ' || join(order_clauses, ', '))
end if
return sql
end proc
How well does this work in practice? Suppose that we were going to
run regional advertisements. Should the models be pictured with
pleated or plain front pants? We need to look at recent sales by
region. With the ACS query tool, a user can use HTML forms to specify
the following:
- pants_id : select and aggregate using count
- ship_to_region : select and group by
- pleat_state : select and group by
The preceding pseudocode turns that into
SELECT ship_to_region, pleat_state, count(pants_id)
FROM ad_hoc_query_view
GROUP BY ship_to_region, pleat_state
which is going to report sales going back to the dawn of time. If we
weren't clever enough to anticipate the need for time windowing in our
forms-based interface, the "hand edit the SQL" option will save us. A
professional programmer can be grabbed for a few minutes to add
SELECT ship_to_region, pleat_state, count(pants_id)
FROM ad_hoc_query_view
WHERE oracle_date > sysdate - 45
GROUP BY ship_to_region, pleat_state
Now we're limiting results to the last 45 days:
ship_to_region | pleat_state | count(pants_id) |
Central Region | plain front | 8 |
Central Region | pleated | 26 |
Great Lakes Region | plain front | 14 |
Great Lakes Region | pleated | 63 |
Mid Atlantic Region | plain front | 56 |
Mid Atlantic Region | pleated | 162 |
NY and NJ Region | plain front | 62 |
NY and NJ Region | pleated | 159 |
New England Region | plain front | 173 |
New England Region | pleated | 339 |
North Central Region | plain front | 7 |
North Central Region | pleated | 14 |
Northwestern Region | plain front | 20 |
Northwestern Region | pleated | 39 |
Southeast Region | plain front | 51 |
Southeast Region | pleated | 131 |
Southern Region | plain front | 13 |
Southern Region | pleated | 80 |
Western Region | plain front | 68 |
Western Region | pleated | 120 |
If we strain our eyes and brains a bit, we can see that plain front
pants are very unpopular in the Great Lakes and South but more popular
in New England and the West. It would be nicer to see percentages
within region, but standard SQL does not make it possible to combine
results to values in surrounding rows. We will need to refer to the
"SQL for Analysis" chapter in the Oracle data warehousing
documents to read up on extensions to SQL that makes this possible:
SELECT
ship_to_region,
pleat_state,
count(pants_id),
ratio_to_report(count(pants_id))
over (partition by ship_to_region) as percent_in_region
FROM ad_hoc_query_view
WHERE oracle_date > sysdate - 45
GROUP BY ship_to_region, pleat_state
We're asked Oracle to window the results ("partition by
ship_to_region") and compare the number of pants in each row to the
sum across all the rows within a regional group. Here's the result:
ship_to_region | pleat_state | count(pants_id) | percent_in_region |
... |
Great Lakes Region | plain front | 14 | .181818182 |
Great Lakes Region | pleated | 63 | .818181818 |
... |
New England Region | plain front | 173 | .337890625 |
New England Region | pleated | 339 | .662109375 |
... |
This isn't quite what we want. The "percents" are fractions of 1 and
reported with far too much precision. We tried inserting the Oracle
built-in round
function in various places of this SQL
statement but all we got for our troubles was "ERROR at line 5:
ORA-30484: missing window specification for this function". We had to
add an extra layer of SELECT, a view-on-the-fly, to get the report
that we wanted:
select ship_to_region, pleat_state, n_pants, round(percent_in_region*100)
from
(SELECT
ship_to_region,
pleat_state,
count(pants_id) as n_pants,
ratio_to_report(count(pants_id))
over (partition by ship_to_region) as percent_in_region
FROM ad_hoc_query_view
WHERE oracle_date > sysdate - 45
GROUP BY ship_to_region, pleat_state)
returns
ship_to_region | pleat_state | count(pants_id) | percent_in_region |
... |
Great Lakes Region | plain front | 14 | 18 |
Great Lakes Region | pleated | 63 | 82 |
... |
New England Region | plain front | 173 | 34 |
New England Region | pleated | 339 | 66 |
... |
asj-editors@arsdigita.com
Reader's Comments
There actually are some fairly good data warehouse tutorials on the net. But I'd rather read a good book (like philg's, for example), so go out and buy Ralph Kimball's. He's the one who really understood that a *big* business opportunity existed where Oracle and DB2 (not to mention older "legacy" databases) feared to tread -- producing sales reports. Actually his stuff is good and readable and full of useful examples and even a little mini-warehouse system he wrote in Microsoft Access (a feat worth a medal in itself).
I went to a technical presentation at the Oregon Graduate Institute about a year ago -- their public seminars used to be on things like the internals of Chorus (an early-90s French variant of Mach) or n-dimensional object-oriented whoozis or whatever. Now that it's the Era of the Net, of course, these talks have turned into thinly disguised marketing pitches.
And at this one, the pitch was for a local company developing a "data cube" front-end analysis tool for the aforementioned Moby Data Warehouses, like Kimball's Red Brick or the equivalent in Oracle or what-have-you. And the example the guy used about how great their tool is was that Walmart has used data warehouses to figure out that their sales of diapers and beer go up at 5 pm, so they put them in the display areas at the front of the store. Why? Because dad goes home and gets a call from mom on the cell phone to remind him about picking up the diapers, so of course he then picks up a six pack at the same time. And this takes $20 million and a cafeteria full of DBAs, sysadmins, financial specialists, data extracting/mining/tuning/cleaning/ignoring specialists and so forth to figure out.
I'm being a bit unkind about the corporate value of data warehouses. The presenter also, after vigorous questioning, admitted *another* result of data warehouses. It seems that corporate America has never been able to relate the selling price of a retail item to its production cost. I know that is a big shock -- it was to me -- but it's true. In other words, if you have an EEE-wide shoe size, you can find it in the big stores because they know people require all kinds of shoe sizes and stock accordingly, even though some sizes move slowly or not at all.
Now come the data warehouse reports to inform regional sales managers that the marginal cost of producing those EEEs is higher than the regular ones, in fact it cuts significantly into the overall margins for shoes. So . . . they stop carrying your size and you have to go elsewhere at a much increased direct and indirect cost in time, car fumes and annoyance, not to mention the much larger lifetime sales loss to Big Mall Box, Inc. because they lost you as a customer forever.
This reminds me of the other efficient market created by computers that we know about -- the airline reservation racket, where highly-advanced theorists and programmers have created the yield management system to maximize airline ticket profits. This allows me to sit next to you and pay $400 for a round trip that cost you $1200, even though we bought our seats less than 24 hours apart. It has worked so well that the airlines are now being impelled to force small travel agents -- the bread and butter providers of business travelers and the backbone of their cash flow -- out of business. You say you prefer to book your flights with your friendly local travel agent? Sorry, go with Mega Agency or book 'em online . . . if you dare.
The moral of the story: no doubt data warehouses and yield management systems have their place, but never forget what their place is, especially when the well-dressed sales-, er, "corporate executive" is pitching you the latest Pet Rock, er, data cube presentation technology.
-- Fred Heutte, November 2, 1997
Only a couple days after my last comment, I was paging through the October 27 issue of Information Week and came across a Tandem ad, running double-truck across pages 54-55:
[First, you have to imagine a large thirty-ish guy standing in a doorway, wearing a diaper. Ok, got that?]
"AT 6:32 PM EVERY WEDNESDAY, OWEN BLY BUYS DIAPERS AND BEER. DO NOT JUDGE OWEN. ACCOMMODATE HIM."
"If a data mining query discovers that between 6 and 8pm men buy diapers and beer, chances are you'll see more diapers and beer. It's with this kind of valuable -- and sometimes odd -- information that Tandem is helping people in retail, banking, telecommunications and insurance uncover business opportunities."
This also uncovers a recurring theme in modern data processing, one that changed the traditional acronym of GIGO from "garbage in/garbage out" to "garbage in/gospel out".
http://www.journalism.wisc.edu/jargon/jargon_21.html
-- Fred Heutte, November 6, 1997
Whenever I see someone trying to explain or justify data warehousing, almost inevitably the Parable of the Beer and Diapers is brought up. It has entered the apocrypha of the data warehousing community as the single compelling example of the utility of data warehousing. In every case, it is attributed to a different corporation, often to some nameless "large retail store." It has all the markings of an Urban Legend, like the one about the guy who wakes up in a hotel in a foreign city in a bathtub full of ice, a splitting pain in his side, and a note written on the mirror saying "Call 911," his kidneys having been removed by organ bootleggers. Does anyone know if the Beer and Diapers story actually happened, or if some enterprising Tom Vu of the data warehousing world made it up?
-- Jin Choi, February 2, 1999
The diapers and beer example was created to illustrate the potential of data mining. It (probably) was never really found in any data.It's creation has been attributed to "Tom Blishok who ran a retail consulting group for NCR" sometime around 1992.
Check: http://www.kdnuggets.com/news/2000/n13/23i.html and http://www.kdnuggets.com/news/2000/n14/8i.html for a more detailed discussion.
-- Doug McIver, September 27, 2000
Philip Greenspun article is really good. However datawarehousing do not have to be
that expensive.
In year 2000 I created a datawarehouse system
for EPM (Enseval Putera Megatrading)
using Visual Basic 6, SQL Server 7,
DataDynamics DynamiCube (DynamiCube is the front end for the datawarehouse) and ActiveReport.
For information EPM is distributor of Kalbe Farma
(the largest pharmaceutical group of companies in Indonesia).
The software development cost I charge is very low
its under US$20,000. The hardware is using HP server.
The whole system including software from Microsoft
cost lest than US$ 50,000.
Now the system is already in production and they are able
to analyze 5 years worth of data (5 million transaction).
Of course this system is small compared to WallMart or
Levi Straus system. However take a look at the price
I charge for software development. Datawarehousing do not
have to be expensive or hard to built.
Philip trouble with percentage in Oracle makes me laugh
since the ability to display value by percentage is
a standard ability of DynamiCube.
-- Samuel Franklyn, November 22, 2000
The article was really good, but I think it would have been better if Philip didnt have used any names (sybase, wallmart..etc)
-- Arun Shanmugam, April 25, 2001
Errr, actually, Wal*Mart has a datawarehouse to track all of the sales in all of its stores. And it doesn't run on Sybase, or Oracle or DB2. It isn't kept on a mainframe. It runs on NCR's Teradata database. And it is really a mind bogglingly large installation. I worked with it back in the early nineties when it kept stuff on an item by store by week basis (the days were kept as columns in the weekly row). At that time they had something on the order of four billion rows online for 18 months of data in 4 terabytes. Last I heard (3 years ago) they had 24 terabytes. Also now they have an identical backup system (in Tulsa I believe) for disaster recovery. And now they have some large fraction of this data kept (in a different system) at the item by store, by (register) scan level of detail for more sophisticated data mining.
-- Charles Eubanks, September 24, 2001