An Airline Data Warehouse Bus Template
Data Marts on Demand

Steve Elkins
Business Intelligence Practice Manager
Technium, Inc.
December, 1998

Current Airline Practice

At present, general management reporting in the marketing, planning and finance departments of most major airlines leaves much to be desired. The typical reporting environment consists of a mixture of two types of architectures. In the first, data is held in flat file databases (or proprietary SAS or FOCUS data structures). Programmers or departmental "power users" write reports to retrieve information from these databases using "4GL" programming languages like SAS or FOCUS. For the most part, report output is in the form of hardcopy paper reports. "Ad hoc" reporting consists of asking the departmental power user to write a new custom report, which may take several days to produce.

The second environment is based upon the normalized relational databases that were the product of the first generation of airline "data warehouses". In this environment, non-technical managers are limited to executing pre-defined queries that have been programmed in advance by experienced SQL programmers using a "managed query environment" type user interface. Non-technical users are still denied access to the database for true "ad hoc" reporting because the design of the database makes it too easy for them to either: a) compose a "query from hell" that runs forever and brings the computer to its knees, or b) compose a query in such a way that the wrong answer is unknowingly retrieved. Average query response times range from minutes to hours, preventing managers from maintaining their train of thought as they analyze the data.

These two basic foundation architectures are supplemented by the widespread use of Excel and Lotus 1-2-3 spreadsheets. Data from various sources is tediously entered into numerous standalone spreadsheets, by hand, to conduct planning analysis such as macro level demand and yield forecasting, or to collect budgeting data.

At some airlines, the use of "desktop OLAP" tools has begun to proliferate as end users attempt to circumvent the limitations of the basic architecture on their own, without IT assistance. In these cases, the departmental power users now write programs against the flat files and/or normalized relational databases to stage data for loading into small-scale multi-dimensional databases that reside on each managerís desktop PC. While this represents an improvement over the production of hardcopy reports in terms of the quality and flexibility of the analysis that the manager can conduct, desktop OLAP tools lack the scalability required to conduct many types of analysis. As desktop "cubes" proliferate, their maintenance becomes an IT nightmare. In addition, each desktop database represents an independent version of the truth according to its owner, with definitions and terminology that are unique to that desktop. Unproductive arguments result when different executives each bring their own reports to a meeting and the numbers donít match up because of differences in data sources and definitions.

The use of dimensional database designs and scalable OLAP reporting tools is rare. In those cases where they have been adopted, it is invariably at the departmental level, on a project by project basis. As a result, there are inconsistencies in definitions and terminology across the departmental databases that make it difficult to cross-utilize data from different applications and sources without confusion. Departmental databases usually have their own unique front-end presentations, which makes them inaccessible to potential users in other departments. This is even more true in the case of packaged decision support applications, such as yield management systems, that have their own proprietary data structures and user interfaces.

Business Intelligence Goals

In an ideal environment, all managers, even non-technical managers, would be able to compose their own requests for information, as and when necessary, from a complete library off all available data sources, without programmer assistance. Information, including responses to follow-up queries, would be returned to their desktops in seconds so that managers would be able to maintain their train of thought as they conducted interactive analysis to diagnose problems and identify opportunities. Data definitions and terminology would be consistent throughout the "library" Ė there would be no confusion about the interpretation of data and no arguments among executives about whose numbers are right.

The data warehousing profession has finally progressed to the point where it is possible to deliver on this promise through careful, but not protracted, advance planning and the use of appropriate software tools. In this white paper, I will outline the basic architecture required for an airline to achieve these objectives, and to provide simple examples of how this architecture would be implemented.

Enabling concepts

Airlines have been attempting, with very little success, to achieve these objectives for a number of years. What has changed is that there is now a relatively newfound recognition that all business queries are essentially of the form:

"I want to see such-and-such a statistical measure, arranged by this, by that or the other business perspective, or dimension."

For example,

"I want to see Passenger Journeys, by Month, by Journey Origin & Destination and by Point of Sale."

The recognition that business analysis is essentially dimensional in nature has, in turn, spawned the development of new database technologies and designs that are built specifically to answer this style of query. Earlier data warehousing attempts often failed because they employed outdated database technologies, normalized database designs and development methodologies that were originally developed to build operational transaction processing systems such as reservations systems and accounting systems. The business and technical requirements of operational and analytical applications are fundamentally different and these "OLTP" technologies and approaches turned out to be badly suited to business intelligence applications.

The Airline Data Warehouse Bus Template is a "data warehouse" blueprint that would explicitly allow all planning managers and analysts to use this new technology analyze all of the relevant statistical measures (e.g., Passenger Segments, RPKs, Revenue, etc.) available from all of the various available sources of data (Reservations Inventory and PNR records, Revenue Accounting, General Ledger, MIDT, etc.) according to a full range of consistently defined business dimensions (e.g., Date, City Pair, Cabin/Class, Point-of-Sale, etc.).

In addition, by using a combination of relational and multidimensional "OLAP" technology with both graphical and spreadsheet user interfaces, dependence on paper reports will be reduced, and the tedious practice of manually entering data into spreadsheets in order to conduct analysis will be eliminated, while the ability to continue to leverage the considerable expertise in spreadsheet technology that exists at most airlines would be preserved.

The Data Warehouse Bus

In order to achieve our stated objectives, the design of the "data warehouse" must be based upon "conformed" definitions of business dimensions and consistent definitions of statistical measures that apply regardless of the data source. Something as minor as using the "Day-of-Week" code "1" in one database, "Mo" in another and "Monday" in a third can render the task of combining data from these sources into a single analysis problematic. More serious is the common situation where three executives enter a meeting, each holding a report purporting to show counts of "Passengers" for the same period and market Ė except that none of the numbers match. The first 30 minutes of the meeting are then spent arguing about whose number are right when, in fact, they are all right. The first executiveís reports show the correct count of all Positive Space Passenger Legs extracted from Reservation System Inventory records; the secondís show counts of Total Passenger Boardings (segments) from the Departure Control System; while the thirdís show the number of Revenue Passenger Journeys from the Revenue Accounting System.

Organizing the data into a corporate data warehouse catalogue where consistent business dimension definitions and field naming conventions (for both codes and descriptions) are enforced and where all derived measures are calculated according to consistent, pre-defined business rules will prevent these kinds of problems and misunderstandings from occurring. For example, in this framework, the "Day-of-Week" code for Monday should always be "1" and its description, or alias should always be "Monday". The derived measure of "Yield" should be defined in advance as either "Passenger Revenue divided by Revenue Passenger Kilometers" or as "Passenger Revenues divided by Total Passenger Kilometers" and then used consistently in that way in all databases and applications from that day forward. The measure "Passengers" should not appear anywhere in the warehouse. It should always be "Passenger Legs (Res)" or "Passenger Journeys (RevAcctng)", etc. These naming conventions and business rules are part of the "metadata" (data definitions) stored in the data warehouse metadata catalogue, or repository. In his new book, The Data Warehouse Lifecycle Toolkit, data warehousing expert Ralph Kimball has coined the term "Data Warehouse Bus Architecture" to describe this framework. (See Attachment A.)

Technical Architecture

Delivering on these promises will involve a multi-tiered client-server technical architecture that will include both relational and multidimensional databases and both spreadsheet and graphical user interfaces.

At the bottom of the architecture lies the staging tables, which are composed of the raw data in the original form from which it is first extracted from the source operational systems. In the airline industry, most source data comes in the form of flat files extracted from legacy mainframe systems such as a reservations system. There will be one staging table created from each major source of data (e.g., one holding reservations system PNR data, one holding revenue accounting system coupon data, etc.) for each level of granularity at which it is collected (e.g., one for leg level inventory data from the res system and another for segment level inventory data).

From the staging tables, the data is then transformed into "star schema" relational databases that reflect the need to conduct dimensional business analysis. In our "data warehouse", there will be one star schema relational "data supermart" for each major data sourceís staging table. In a star schema relational database, a central "fact table" containing the statistical measures, or "facts", to be analyzed is linked to surrounding "dimension tables" that hold the "attributes" of a business perspective, or dimension. For example, the fact table holding leg level reservations system inventory data would have a field for each of the facts, or measures, (e.g., passenger legs, seats, etc.) and fields with links to the relevant business dimensions (e.g., tariff booking class (Class Dimension), leg departure date (Departure Date Dimension), data collection point (Days in Advance Dimension), and airline/flight number/leg origin city/leg destination city (Leg Dimension)). The linked dimension table for the business perspective "Departure Date" would have fields holding both the hierarchical attributes of a date, i.e., the natural levels of summarization (e.g., Month, Quarter, Year or, alternately, Week or Fiscal Year, etc.) and the characteristic or descriptive attributes of each date (e.g., Day-of-Week or Holiday?, etc.). It would hold both the codes for the dimension (e.g., 19981123) and their descriptions, or aliases (e.g., 23 November 1998).

The movement of data from the staging tables to the star schema relational databases should be made using an automated Extract, Mapping and Transformation tool that is able to capture and store all of the "metadata" defining our dimensions, fields and derived measures in a format that is accessible to the tools that we will use to deliver data to our managers. The EMT toolís metadata repository is needed to provide a single "official" source for all definitions and terminology throughout the warehouse. This repository must be seamlessly accessible to a wide range of tools used to deliver information to manager desktops. The EMT tool should largely automate the process of updating and synchronizing the tables in the data warehouse bus.

To query the star schema data supermart holding data from the reservation system leg inventory record source for the total of all "Revenue Passenger Legs" for, say, the month of November, 1998, you would select "November" from a list of all of the months contained in the "Month" field in the "Date" dimension table, and "Revenue Passenger Legs" from a list of all of the measures in the reservation system Leg Inventory fact table. The database would then find all of the dates that had the attribute of "November" in the "Month" field in the "Date" dimension table, and then would link to the fact table via the matching dates to find and sum the measure "Revenue Passenger Legs" for all of the records in the fact table that had one of those dates in the "Date" field.

There are tools available that would enable someone with a passing knowledge of how relational databases work to relatively easily and safely compose such a query and receive a response within a time frame that would be an improvement over the standards to which most business managers are currently accustomed (depending on the size of the underlying database). However, most business managers donít have even a passing knowledge of how relational databases work, and for those managers, technology that requires such an understanding places the information that they need just out of reach.

To reach the typical business manager, one additional layer of database technology is required to bridge the gap. An OLAP database stores the data in a multi-dimensional matrix that even more accurately, efficiently and transparently mimics the dimensional nature of business analysis. In an OLAP database, there are no tables to be linked; just a single (logical) many-sided "Rubikís Cube" full of data. The "hypercube" has one side for each business perspective. The data is found at the intersections of the vectors of the matrix formed by the cube. Thus, if the cube has three sides, corresponding, say, to the business perspectives of "Date", "Leg" and "Class", then the data for the measure "Revenue Passenger Legs" for "January", "SCL-MIA", "Y" would be found, pre-summarized (or very quickly calculated on demand according to a simple, predefined definition) within the Rubikís Cube at the cell where the vectors for Date, Leg and Class intersect at the addresses of January", "SCL-MIA" and "Y".

When the data is stored in an OLAP hypercube, even non-technical business managers are able to compose their own queries by picking the desired report categories from easy-to-understand "pick lists", and query response times are very fast; under 5 seconds is the standard for OLAP databases, and most answers are returned in the blink of an eye. OLAP databases allow managers and analysts to maintain their train of thought as they conduct ad hoc analysis "on-line", double-clicking to drill down to lower levels of detail to diagnose a problem or opportunity, and "slicing and dicing" across different business perspectives, on demand and without programmer assistance. Many OLAP databases allow you to pull data directly into spreadsheets for further analysis, thereby avoiding all manual keying of data into spreadsheets for analysis. Some of them allow you to save the results of your analysis back to the OLAP database for later viewing or revision.

Many companies now use this capability to interactively build their annual budgets by having contributors to the budgeting process develop, submit and revise their inputs via spreadsheets linked to OLAP databases. The result is a better budget, produced in a fraction of the time required to prepare budget proposals manually and submit them on paper. Budget variance reporting is subsequently performed against the same OLAP database, using the same spreadsheet reporting tools, which facilitates the identification of cost and revenue variations for corrective action.

Given the advantages of OLAP databases in terms of ease of use, flexibility and query response time, a business manager might well ask why relational databases are needed at all? The answer is that there are limits to how much data can be stored in a single OLAP "hypercube". A data source like the revenue accounting system, or the GDS "marketing information data tapes" (MIDT), might have tens of millions of records of individual customer transactions and as many as two dozen different business dimensions. With todayís technology, this volume of detailed data wonít fit into a single OLAP hypercube.

Thus, in our architecture, the star schema relational database acts as a comprehensive "library" of all of the available data, at its lowest available level of detail, and with all of the potential business dimensions. Facts and dimensions are "checked-out" of the library, as needed, to build more narrowly defined OLAP hypercubes holding data that has been summarized to address specific analysis requirements. It is seldom necessary to include more than 8 to 10 business dimensions in a hypercube designed to conduct a specific type of analysis, and rarely is it necessary to analyze individual customer transactions. For example, to conduct scheduling analysis we might need to "check-out" the dimensions of Date, Segment, Time-of-Day, Equipment Type and Number of Stops. For Pricing we might build a hypercube with the dimensions of Date, Segment, Day-of-Week, Tariff Class and Days in Advance. For Sales, the most useful dimensions might be Date, Journey, Tariff Class and Point-of-Sale. Including superfluous facts and business dimensions in a hypercube data mart makes it more difficult for business analysts to find the useful information that they need to perform their particular task.

Because there is a basic correspondence between the dimension tables in the star schema relational databases and the dimensional vectors in the OLAP databases, it is relatively easy to map data from the relational environment to the OLAP environment to create new focussed OLAP "data marts" from the comprehensive relational "data supermarts" in response to new or changed business analysis requirements. When the dimensional metadata that we need to define our OLAP hypercubes already exists in the metadata repository created with our EMT tool, the process of creating new OLAP datamarts in this manner is rendered trivial Ė we can create new datamarts on demand. We can even afford to create one-time temporary hypercubes to conduct custom analysis, such as the analysis of unique promotions, and then throw them away when the analysis is complete.

Moreover, if we take care to re-use the same "conformed" dimension tables in all of our data supermarts (e.g., the same "Date" table), and the same definitions of derived measures in all of our relational data supermart fact tables (e.g., Yield = Passenger Revenue ¸ Revenue Passenger Kilometers) we can rest assured that in all of our relational data supermarts, and in all of our OLAP data marts, all of our terminology and all of our business rules will be consistent, regardless of a business managerís point of access (or means of access) to the information in our data warehouse. We will have created a "common prayer book", and will never again have to waste time arguing about whoís numbers are right.

At this point, we can define our "data warehouse" to be comprised of our entire collection of conformed star schema relational "data supermarts", the conformed OLAP "datamarts" carved from them, and the infrastructure that surrounds them. As we bring new sources of data into our warehouse, we must take care that these new data supermarts are kept conformed to the existing standards. To the extent that the new sources allow us to enrich the content of our conformed dimension tables, we must retrofit our existing dimension tables with the improvements in order to maintain conformance throughout the data warehouse.

Development Methodology

It is impossible to completely define the entire data warehouse in detail in the initial requirements definition phase of the project. There has been a high rate of failure of data warehouse projects that attempt to discern all of the requirements in advance using the same project development methodologies that are used to develop operational systems. In these "Structured Development Methodologies" (SDM) or "System Development Life Cycle" (SDLC) methodologies (also referred to as "waterfall" methodologies), there is a protracted "requirements definition" phase at the beginning of the project where business managers are interviewed repeatedly in order to learn the business process in great detail before proceeding to the design phase. This design process doesnít work for data warehouses because it is impossible for business managers to articulate their thought processes in the same way that they can articulate the steps of a business process (such as the steps of processing a ticket coupon), which is what you need to know to model and design an operational system that automates a business process (such as a revenue accounting system). As Bill Inmon, the "father" of Data Warehousing says: "The pervasive and valid attitude of business managers is: Build me what I say I want; once Iíve seen that Iíll be able to tell you what I really want!" Thus, the design and development of a data warehouse must be iterative: you design and build prototype data marts; the business managers tell you how to improve them; you incorporate their suggestions into new designs, and so on. Inmon now refers to this as the "CLDS" design and development process, because it is just the opposite of the rigid SDLC process used to design operational systems.

In addition, Inmon describes the process of designing a data warehouse as being "data driven". That is to say, if you walk through the content of the available source data files with the business managers and ask them how they would like to use the data, you will learn the true analytical business requirements much faster than you will by conducting SDLC style open-ended interviews. If you commence your walkthrough with the explicit goal of mapping the available source data to a dimensional format (i.e., of identifying your "such and suches" (measures) and "bys" (dimensions)) you will be able to succeed even faster.

However, there must be some balance. When individual data marts are built in isolation without any advance planning or framework at all, the result is "multiple prayer books" Ė executives arguing about whose data mart has the correct numbers because of differences in terminology and business rules between the standalone departmental marts. To strike this balance, the foremost data warehousing experts all now recommend beginning the data warehouse project with a short (2 to 3 week) effort to draft a "data warehouse bus architecture", a basic set of conformed fact definitions and conformed dimension definitions that will span the most important data sources that are to be brought into the warehouse. As additional data sources are brought into the data warehouse, and as the existing data and business requirements become better understood, the exact details of the data warehouse bus design will evolve, but, at any given point in time, they will always be consistent across the entire data warehouse. Once the data warehouse bus design has been drafted, it is safe to proceed with the design and development of individual relational data supermarts and OLAP data marts within the general framework that has been provided.
 

The Draft Design of the Data Warehouse Bus Architecture

The draft design for a major Latin American airlineís marketing and planning data warehouse bus resulting from such an effort is reflected in an accompanying Visio document, also entitled "The Airline Data Warehouse Bus".

Fact Tables

The current document contains 6 draft fact table designs for relational data supermarts holding data from the following operational sources:

  1. JAC Segment Market Share Data (government compiled market share data)
  2. Gabriel (Reservations System) Leg Inventory Data
  3. Gabriel (Reservations System) Segment Inventory Data
  4. Gabriel (Reservations System) Segment PNR Data (Both interim and long term)
  5. GDS Marketing Information Data Tapes (MIDT) Segment Data
  6. Revenue Accounting System (Avropax) Segment Data
Each fact table design shows the statistical measures contained in that fact table, along with the "foreign keys" that link them to specific conformed dimension tables.

A sample data supermart fact table design, the one for the Gabriel (Reservation System) Leg Inventory Data, is shown in the following figure.

Sample Data Warehouse Bus Fact Table

Gabriel (Reservation System) Inventory Record Extract Fact Table

Flight Leg Grain

The "grain" of the fact table defines its level of detail. In this example, there is one record in the fact table for each unique combination of tariff class, leg departure date, data collection point and airline/flight number/leg origin airport/leg destination airport.

Dimension Tables

The document contains 13 groups of conformed dimension tables, arranged according to general business perspectives that include:

  1. Date & Time
  2. Number of Days in Advance
  3. Airport Pair & Flight
  4. Booking and Sales Location
  5. Airline
  6. Class & Cabin
  7. Fare Basis
  8. Reduction Code
  9. Booking Status
  10. OW/RT Itinerary Indicator
  11. Dominant Segment Indicator
  12. Aircraft
  13. Frequent Traveler
Each Dimension Table is portrayed both in "OLAP view" (on the right side of each page), which shows the various logical relationships in hierarchies and in characteristic attributes for the "member" categories in that dimension, and in the corresponding "RDBMS view" (on the left side of each page), which reflects the physical design of the dimension tables in the star schema relational data super marts. A sample dimension table group, the one for the date and time dimensions, is shown below.

Sample Data Warehouse Bus Dimension Table Group

Date and Time Dimensions

On these diagrams, each attribute in a dimension is categorized as either "hierarchical" or "characteristic" by the shape of its box, and is given both a logical name and a corresponding RDBMS field name for both its code name and its description/alias (not all of these are filled in yet). Field names should all conform to the field naming conventions shown in the chart at the back of the Visio document.

A data supermart "model" is composed of a data supermart fact table plus the appropriate dimension tables from the data warehouse bus that connect to it to form the "star". An example of a data supermart model, one for a Reservations System Leg Inventory data supermart follows.

Sample Data Warehouse Bus Data Mart


Yield Management (Leg Level)

In this case, data is extracted from the reservation system seat inventory records. This data is conveniently pre-summarized within the reservation system in order to compare seat reservations with seat supply in order to compute real-time seat availability. This data extract is already used to feed the yield management decision support system at many airlines. Its dimensionality and "granularity" are appropriate, as is, to support the existing yield management business analysis process. Consequently, the same model would also suffice to describe the structure and content of the OLAP hypercube datamart that would be built from this data to support yield management analysis.

In many cases, the OLAP data mart model would actually be a subset of the data and/or dimensionality in a larger data supermart. For example, the relational data supermart holding segment level PNR data will have about 2 dozen dimensions, no more than a dozen of which would ever be included in a single OLAP data mart.

In other cases, a single OLAP data mart will include a subset of data from more than one data supermart. For example, in order to perform route profitability analysis, it would be necessary to combine cost data from the general ledger, revenue data from the revenue accounting systems and statistical data from the reservations and departure control systems. As long as the data from these sources is all part of the data warehouse bus, with links to the same common dimension tables, it will be reasonably easy to combine data from these separate data supermarts into an integrated OLAP datamart to perform sophisticated multidimensional profitability analysis. The ease with which data from disparate sources can be combined when dimension definitions are conformed across sources is one of the major benefits of the data warehouse bus architecture.

Some tables in the Visio design document are annotated with symbols that connote a particular design issue that must be addressed in the detailed physical implementation of the data supermarts that include that table (e.g., "Slowly Changing Dimension"). Rather than include a detailed discussion of these issues here or in the Visio document, the following Glossary of Symbols and Icons is included at the beginning of the Visio document. It includes a reference to the page in Ralph Kimballís book on dimensional database design, "The Data Warehouse Toolkit", where that issue is discussed in detail. This book is the standard reference source on dimensional database design in the data warehousing profession.


While this design was commenced prior to the publication of Ralph Kimballís new book, "The Data Warehouse Lifecycle Toolkit", you will find that the form of the Visio documentation follows the recommended notation for documenting dimensional database designs described in Chapter 7 of that book quite closely.

The Visio design is a "living document". Additional data sources (e.g., General Ledger costs, Official Airline Guide flight schedules, Airline Tariff Publishing Company fares, USDOT databases, etc.) are already being incorporated into the design, which will be continuously refined and enhanced. Additional data field information, such as data types, field lengths and "cardinalities" (the approximate number of potential unique values in a field), will be included in future versions of the design to assist with database sizing estimates. The intent is to develop this document into a complete template that will dramatically shortcut the design and implementation of a planning data warehouse at any airline in the world.

The purpose of this white paper is to demonstrate that the design and development of a robust data warehouse architecture based upon Ralph Kimballís data warehouse bus concept has significant practical benefits to business managers, and that it need not be a multi-year development project. The draft design template upon which this white paper is based was created, from scratch, over the course of 3 weeks by a team of 3 people. During those three weeks, the first data supermart and mart in the bus were put into production, and two additional data marts and super marts were designed and prototyped (and will go into production as soon as the necessary disk space is obtained). Once the necessary physical infrastructure is in place, it should be possible to roll out additional data supermarts at a rate of at least one per month until the data warehouse is complete. With this experience under our belts, similar efforts at other airlines should take even less time.

Steve Elkins is currently Business Intelligence Practice Manager for Technium, Inc. Steve has over 25 years of airline industry experience, including stints as the Pricing and Yield Management department head at three major airlines. Steve was the first Marketing Systems Development department head at Northwest Airlines, where he sponsored a first generation marketing data warehousing effort. An authority on On-Line Analytical Processing technology, Steve is a regular writer for Intelligent Enterprise magazine.

Technium is a full service client-server consulting company with practices in Business Intelligence, Databases, Application Development, Web Development, Operating Systems and Networks. Business partners include Microsoft, Compaq, IBM, Hyperion Solutions and Informatica. Principal offices are in Chicago and Dallas.





The Data Warehouse Bus:

An Adaptive Architecture for Your Data Warehouse

Ralph Kimball

One of the most widely debated issues in data warehousing is how to go about planning the warehouse construction. Do we build the whole data warehouse all at once from a central, planned perspective (the monolithic approach) or do we opportunistically build separate subject areas whenever we feel like it (the stovepipe approach)? We should demolish two pervasive myths. The first myth is that the designer is forced to choose one of two extremes: either build a monolithic enterprise data warehouse, or build an isolated stovepipe. In truth, nobody believes in a totally monolithic approach and nobody defends a totally stovepipe approach. All the leading data warehouse practitioners use some kind of architected step-by-step approach to build an enterprise data warehouse. This represents a pragmatic middle ground between the two extremes. In this paper we sketch a specific variation of that step-by-step approach called the data warehouse bus architecture.

The second myth is that a data mart must be restricted to being a highly aggregated subset of a non-queryable data warehouse. This view of data marts is the source of many problems and misunderstandings. A data mart must be a natural complete subset of the overall data warehouse, and the collection of queryable data marts must be the queryable data warehouse itself. Finally, every useful data mart should be based on the most granular (atomic) data that can possibly be collected and stored.

The Planning Crisis

The task of planning an enterprise data warehouse is daunting. The newly appointed manager of the data warehousing effort in a large enterprise is faced with two huge and seemingly unrelated challenges. On the one hand the manager is supposed to understand the content and location of the most complicated asset owned by the enterprise: the legacy data. Somehow (usually overnight) the new data warehouse manager is supposed to become an authority on exactly what is contained in all those VSAM, ISAM, IMS, DB2, and Oracle tables. Every field in every table must be understood. The data warehouse manager must be able to retrieve any such element of data, and if necessary must be able to clean it up and correct it. If all this wasnít enough, on the other hand the data warehouse manager is supposed to understand exactly what keeps management awake at night. The data warehouse is expected to contain exactly the data needed to answer the Burning Questions. Of course, the data warehouse manager is "free" to drop in on senior management at any time to discuss current corporate priorities. Just make sure you get this data warehouse done pretty soon.

The pressure of this daunting task has built up to the point where it has a name. The name is data mart. Regardless of specific definitions, the phrase "data mart" means avoiding the impossibility of tackling the enterprise data warehouse planning job all at once. Data warehouse planners take refuge in carving off a little piece of the whole data warehouse and bringing it to completion, and calling it a data mart.

Unfortunately, in many cases the building of separate data marts rather than a single data warehouse has become an excuse for ignoring any kind of design framework that might tie the data marts together. Vendorsí marketing claims for a "data mart in a box" and a "15 minute data mart" are pandering to the marketplaceís need for a simple solution, but these claims are a real disservice to the manager of data warehousing who must make these data marts fit together into a coherent whole.

Isolated stovepipe data marts that cannot usefully be tied together are the bane of the data warehouse movement. They are much worse than a simple lost opportunity for analysis. Stovepipe data marts perpetuate incompatible views of the enterprise. Stovepipe data marts enshrine the reports that cannot be compared with each other. And stovepipe data marts become legacy implementations in their own right, where with their very existence they block the development of an integrated enterprise data warehouse.

So if building the data warehouse all at once is too daunting and building it as isolated pieces defeats the overall goal, what is to be done?

Data Marts With a Bus Architecture

The answer to the dilemma is to start the data warehouse planning task with a short overall data architecture phase that has very finite and specific goals and then to follow this architecture phase with a step-by-step implementation of separate data marts where each implementation step closely adheres to the architecture. In this way the data warehouse manager gets the best of both worlds. The architecture phase produces specific guidelines that the separate data mart development teams can follow, and the data mart development teams can work fairly independently and asynchronously. As the separate data marts come on line, they will fit together like the pieces of a puzzle. At some point, enough data marts exist to make good on the promise of an integrated enterprise data warehouse.

Any successful data warehouse implementer who succeeds in implementing an enterprise data warehouse will inevitably perform the following two steps: 1) create a surrounding architecture that defines the scope and implementation of the complete data warehouse, and 2) oversee the construction of each of the pieces of the complete data warehouse. Now stop and consider step 2. The biggest task in constructing the data warehouse is designing the extract system that gets the data from a specific legacy system into the data staging area where it can be transformed into the various load record images needed by the final database that presents the data for querying. Since the implementation of the extract logic is largely specific to each original data source, it really doesnít matter whether you think of the task as one task or whether you break it into pieces. Either way, you have to put your pants on one leg at a time. You will in effect be implementing your data marts one at a time no matter how you plan your project.

Conformed Dimensions and Facts


In the architecture phase that precedes the implementation of any of the data marts the goals are to produce a master suite of conformed dimensions and facts. The resulting set of standards are called the data warehouse bus architecture. We are assuming here a proper dimensional (star join) design for all of the data marts. Please refer to Figure 1.

The Data Warehouse Bus Architecture

A conformed dimension is a dimension that means the same thing with every possible fact table to which it can be joined. Generally this means that a conformed dimension is identically the same dimension in each data mart. Examples of obvious conformed dimensions include customer, product, location, deal (promotion), and calendar (time). A major responsibility of the central data warehouse design team is to establish, publish, maintain, and enforce the conformed dimensions.

The establishment of a conformed dimension is a very significant step. A conformed customer dimension is a master table of customers with a clean customer key and many well maintained attributes describing each customer. It is likely that the conformed customer dimension is an amalgamation and a distillation of data from several legacy systems and possibly outside sources. The address fields in the customer dimension should constitute the best mailable address that is known for each customer anywhere within the enterprise. It is often the responsibility of the central data warehouse team to create the conformed customer dimension and provide it as a resource to the rest of the enterprise, both for legacy use and for data warehouse use.

We have talked thus far about the central task of setting up conformed dimensions to tie our data marts together. This is 80% of the up front data architecture effort. The remaining 20% is establishing the conformed facts.

A conformed fact is a business measure that means the same thing in every fact table in which it appears.

Fortunately, the task of identifying the conformed facts is done at the same time as the identification of the conformed dimensions. We need conformed facts when we use the same terminology across data marts, and when we build single reports that drill across the data marts.

Examples of facts that must be conformed include revenue, profit, standard prices, and standard costs. The underlying equations that derive these facts must be the same if they are to be called the same thing. These conformed fact definitions need to be defined in the same dimensional context and with the same units of measurement from data mart to data mart. Revenues and profits need to be reported in the same time periods and in the same geographies.

Benefits of Using the Bus Architecture

Using the Data Warehouse Bus Architecture means successfully dividing and conquering. After the architectural phase of defining the conformed dimensions and conformed facts, the separate data mart development teams can proceed more or less independently. The separate data marts can be based on different technologies, as long as they have the same basic logical (dimensional) structure. Applications can be built that "drill across" these separate data marts to produce a high level report. In this way, the overall enterprise data warehouse can be built in a distributed, incremental fashion.

The clear and simple definition of a bus architecture consisting of conformed dimensions and conformed facts is both good news and bad news. It is good news because this architecture provides an understandable framework for IT organizations to build their data warehouses. It is bad news because it clearly illuminates the central challenge: getting the surrounding business to agree on a consistent set of labels and measures across the enterprise. The bus architecture becomes a kind of agenda for some important business meetings. At the end of these business meetings, the data warehouse team will have the guidance it needs to build a set of data marts that will fit together into a functioning whole.

Authorís note: parts of this white paper were excerpted from Ralph Kimballís book, The Data Warehouse Lifecycle Toolkit, John Wiley Publishers, 1998, ISBN 0-471-25547-5. Permission from John Wiley to use this material is gratefully acknowledged.