What is a Dimension?

A measure or dimension is normally an additive numerical value that represents a business metric.  You are also not limited to a single measure.  You can have multiple measures within a fact table.  For example, if your fact table is used to track purchases internationally you might have measures for each type of currency.  If you are building a fact table for the retail industry you might also have the following measures cost, list price, average sale price.

Dimensions describe the objects involved in a business intelligence effort. While facts correspond to events, dimensions correspond to people, items, or other objects. For example, in the retail scenario, we discussed that purchases, returns and calls are facts. On the other hand, customers, employees, items and stores are dimensions and should be contained in dimension tables.

 

Dimension tables contain details about each instance of an object. For example, the items dimension table would contain a record for each item sold in the store. It might include information such as the cost of the item, the supplier, color, sizes, and similar data.

Dimension tables are typically small, ranging from a few to several thousand rows. Occasionally dimensions can grow fairly large, however. For example, a large credit card company could have a customer dimension with millions of rows. Dimension table structure is typically very lean, for example customer dimension could look like following:

Customer_key
Customer_full_name
Customer_city
Customer_state
Customer_country

Fact tables and dimension tables are related to each other. Again returning to our retail model, the fact table for a customer transaction would likely contain a foreign key reference to the item dimension table, where the entry corresponds to a primary key in that table for a record describing the item purchased.

Most data warehouses will have one or multiple time dimensions. Since the warehouse will be used for finding and examining trends, data analysts will need to know when each fact has occurred. The most common time dimension is calendar time. However, your business might also need a fiscal time dimension in case your fiscal year does not start on January 1st as the calendar year.

Most data warehouses will also contain product or service dimensions since each business typically operates by offering either products or services to others. Geographically dispersed businesses are likely to have a location dimension.

Change data Capture

IBM WebSphere DataStage Change Data Capture

This is not to be confused with the stage Change Data Capture (CDC)

The following CDC companion products are available to work with IBM Information Server, these need to be installed separately:

  • IBM WebSphere® DataStage® Changed Data Capture for Microsoft® SQL Server
  • IBM WebSphere DataStage Changed Data Capture for Oracle
  • IBM WebSphere DataStage Changed Data Capture for DB2® for z/OS®
  • IBM WebSphere DataStage Changed Data Capture for IMS™

The product, based on which database is used, is installed to capture changes from source data and pass on the changes to target data.

Change data Capture

Change data Capture

It can be used in 2 modes: PUSH and PULL modes.

  • PUSH – changes are published as it happens
  • PULL – changes are captures at regular intervals – say once a day or every 5 minutes

CDC uses the native services of the database architecture, adheres to the database vendor’s documented formats and APIs, and minimizes the invasive impact on any operational systems.

Please refer to IBM product documentation here

What is a Fact Table?

A fact table is a table that joins dimension tables with measures.  For example, Lets say you wanted to know the time worked by employees, by location, by project and by task.  If you had a dimension for employees, location, project and task you would create a composite primary key using these foreign keys and add an additional column for the time worked measure. (more on measures in a little bit)

Keep in mind that fact tables are huge unlike dimension tables.  Fact tables are usually built to contain a hundred thousand records on the low side up to billions of records on the high side.  Therefore, these tables must be normalized to be efficient.

A fact table is normalized when it consists of the following two things:

  1. A composite primary key using the foreign keys to the dimension tables.
  2. Measure(s) used for numerical analysis.

 

Fact tables contain the data corresponding to a particular business process. Each row represents a single event associated with that process and contains the measurement data associated with that event. For example, a retail organization might have fact tables related to customer purchases, customer service telephone calls and product returns. The customer purchases table would likely contain information about the amount of the purchase, any discounts applied and the sales tax paid.

The information contained within a fact table is typically numeric data and it is often data that can be easily manipulated, particularly by summing together many thousands of rows. For example, the retailer described above may wish to pull a profit report for a particular store, product line or customer segment. The retailer can do this by retrieving information from the fact table that relates to those transactions meeting the specific criteria and then adding those rows together.

Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine. For example, a store selling automotive parts might have a fact table recording a sale of each item. The fact table of an educational entity could track credit hours awarded to students. A bakery could have a fact table that records manufacturing of various baked goods.

Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table “grain”. For instance, for a healthcare billing company it might be sufficient to track revenues by month; daily and hourly data might not exist or might not be relevant. On the other hand, the assembly line warehouse analysts might be very concerned in number of defective goods that were manufactured each hour. Similarly a marketing data warehouse might be concerned by the activity of a consumer group with a specific income-level rather than purchases made by each individual.

Fact Table Grain

When designing a fact table, developers must pay careful attention to the grain of the table — the level of detail contained within the table. The developer designing the purchase fact table described above would need to decide, for example, whether the grain of the table is a customer transaction or an individual item purchase. In the case of an individual item purchase grain, each customer transaction would generate multiple fact table entries, corresponding to each item purchased. The choice of grain is a fundamental decision made during the design process that can have significant impact on the business intelligence effort down the road.

Different Types of Dimensions and Facts in Data Warehouse

Dimension –

A dimension table typically has two types of columns, primary keys to fact tables and textual\descreptive data.

Fact –
A fact table typically has two types of columns, foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact’s data on detail or aggregated level.

 

Types of Dimensions –

Slowly Changing Dimensions:

 Attributes of a dimension that would undergo changes over time. It depends on the business requirement whether particular attribute history of changes should be preserved in the data warehouse. This is called a Slowly Changing Attribute and a dimension containing such an attribute is called a Slowly Changing Dimension.

 

Rapidly Changing Dimensions:

A dimension attribute that changes frequently is a Rapidly Changing Attribute. If you don’t need to track the changes, the Rapidly Changing Attribute is no problem, but if you do need to track the changes, using a standard Slowly Changing Dimension technique can result in a huge inflation of the size of the dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a Rapidly Changing Dimension.

 

Junk Dimensions:

A junk dimension is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by Rapidly Changing Dimensions.

 

Inferred Dimensions:

While loading fact records, a dimension record may not yet be ready. One solution is to generate an surrogate key with Null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.

 

Conformed Dimensions:

A Dimension that is used in multiple locations is called a conformed dimension. A conformed dimension may be used with multiple fact tables in a single database, or across multiple data marts or data warehouses.

 

Degenerate Dimensions:

 A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a separate dimension table. These are essentially dimension keys for which there are no other attributes. In a data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.

 

Role Playing Dimensions:

A role-playing dimension is one where the same dimension key — along with its associated attributes — can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign keys for both Ship Date and Delivery Date. But the same date dimension attributes apply to each foreign key, so you can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship date as well as delivery date, and hence the name of Role Playing dimension.

 

Shrunken Dimensions:

A shrunken dimension is a subset of another dimension. For example, the Orders fact table may include a foreign key for Product, but the Target fact table may include a foreign key only for ProductCategory, which is in the Product table, but much less granular. Creating a smaller dimension table, with ProductCategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the Product dimension is snowflaked, there is probably already a separate table for ProductCategory, which can serve as the Shrunken Dimension.

 

Static Dimensions:

Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension.

 

Types of Facts –

 

Additive:

Additive facts are facts that can be summed up through all of the dimensions in the fact table. A sales fact is a good example for additive fact.

Semi-Additive:

Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Eg: Daily balances fact can be summed up through the customers dimension but not through the time dimension.

Non-Additive:

Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Eg: Facts which have percentages, ratios calculated.

 

Factless Fact Table:

In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called “Factless Fact tables”.

Eg: A fact table which has only product key and date key is a factless fact. There are no measures in this table. But still you can get the number products sold over a period of time.

Based on the above classifications, fact tables are categorized into two:

Cumulative:

This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.

Snapshot:

This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Difference between ER modeling and dimension modeling

  Dimension Modelling ER modelling
Redundancy More Less Redundancy
Optimization Querying, can vary and can have huge volumes Pre-determined set of updates, inserts and select queries
Data Current/recent data Archived and historical data
Size MB to GBs GBs to TBs
Physical Model Denormalized Normalized
Complexity Dimensional model has logical grouped set of star-schemas An E-R diagram is a complex maze of hundreds of entities linked with each other
Split split as per the dimensions and facts split as per the entities

 

 

Why is DM better than ER for DW?

1) There are standard type of joins and framework.

All dimensions can be thought of as symmetrically equal entry points into the fact table.

The logical design is independent of expected query patterns. The user interfaces are symmetrical, the query strategies are symmetrical, and the SQL generated against the dimensional model is symmetrical.

In other words,

  • You will never find attributes in fact tables and facts in dimension tables.
  • If you see a non-fact field in the fact table, you can assume that it is a key to a dimension table

2) it is smoothly extensible to accommodate unexpected new data elements and new design decisions.

First, all existing tables (both fact and dimension) can be changed in place by simply adding new data rows in the table. Data should not have to be reloaded. Typically, No query tool OR reporting tool needs to be reprogrammed to accommodate the change. All old applications continue to run without yielding different results. You can, respectively, make the following graceful changes to the design after the data warehouse is up and running by:

  • Adding new unanticipated facts (that is, new additive numeric fields in the fact table), as long as they are consistent with the fundamental grain of the existing fact table.
  • Adding completely new dimensions, as long as there is a single value of that dimension defined for each existing fact record
  • Adding new, unanticipated dimensional attributes.
  • Breaking existing dimension records down to a lower level of granularity from a certain point in time forward.

3) There is a body of standard approaches for handling common modeling situations in the business world.

Each of these situations has a well-understood set of alternatives that can be specifically programmed in report writers, query tools, and other user interfaces. These modeling situations include:

  • Slowly changing dimensions, where a ‘constant’ dimension such as Product OR Customer actually evolves slowly and asynchronously. Dimensional modeling provides specific techniques for handling slowly changing dimensions, depending on the business environment.
  • Heterogeneous products, where a business such as a bank needs to:
    • track a number of different lines of business together within a single common set of attributes and facts, but at the same time..
    • it needs to describe and measure the individual lines of business in highly idiosyncratic ways using incompatible measures.

What is Dimensional Modelling

Dimensional Modelling is the set of techniques and concepts used in data warehouse design. Different Modelling techniques give different advantages in the way we store data. Dimensional modelling aims to make is relatively easy to retrieve data, especially querying large data sets.

The other popular data modelling technique, Entity-Relationship (ER) modelling, is used for OLTP and aims at reducing redundancy and better performance of both change and retrieval of data.

At a high level, effective dimensional models reflect the natural performance measurements and descriptive attributes of a business.

Goals of Dimensional modelling

  • Understandability: grouping by business categories to understand better
  • Query performance: Denormalized for query performance
  • Extensibility: can add new unexpected data

How to do dimensional modelling?

In dimensional modelling, everything is divided into 2 categories – facts/measures and dimensions/contexts.

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.

Step 1: Identify the dimensions

Step 2: Identify the measures

Step 3: Identify the attributes or properties of dimensions

Step 4: Identify the granularity of the measures

Step 5: History Preservation (Optional)

 

Operational Data Store or ODS

The ODS can be seen as an intermediate stage between the sources and the DW, although authors also propose that it can be used as a database for operational processing. It contains integrated data, but this data is at detail level and it is only current data. Therefore we can think that with this architecture we are dividing the transformation work into two phases: in the first phase the main task is integration, and in the second phase the rest of the data transformation work is done.

Target Data

Target data can be stored in:

Single data warehouse Data marts with logical data chunks Both data warehouse and data marts
The data warehouse has all the information of the enterprise in a historical archival. Data Marts are proposed as logical subsets of the complete DW. They should be consistent in their data representation in order to assure DW robustness. Data warehouses are used where interdependency of data marts is needed. Data marts are used for analysis of components

 

The data models that are used for DWs are Multidimensional Model and Relational Model.

 

Two different approaches for Relational DW design: one that applies dimensional modelling techniques, and another that bases mainly in the concept of materialized views.

Dimensional models represent data with a “cube” structure , making more compatible logical data representation with OLAP data management.

The objectives of dimensional modelling are:

  • To produce database structures that are easy for end-users to understand and write queries against, and
  • To maximise the efficiency of queries. It achieves these objectives by minimising the number of tables and relationships between them.

Data Staging Area

What is the Data Staging Area?

The data staging area sits between the data source(s) and the data target(s).

This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration.  Here data gains its “intelligence”, as logic is applied to transform the data from a transactional nature to an analytical nature. This area is also where data cleansing happens. The ETL design phase is often the most time-consuming phase in a data warehousing project, and an ETL tool is often used in this area.

Sometimes also referred to as staging layer.

Why Data Staging Area?

Staging areas can be designed to provide many benefits, but the primary motivations for their use are to increase efficiency of ETL processes, ensure data integrity and support data quality operations. Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time.  Not all businesses require a Data Warehouse Staging Area.  Most businesses use ETL to copy data directly from operational databases into the Data Warehouse.

 

Staging Area

Staging Area

 

The functions of the staging area include the following:

  • Consolidation
  • Alignment
  • Minimize contention
  • Change detection
  • Data cleansing
  • Data Archival
  • troubleshooting

More details on Data Staging Area…

Data staging areas are often transient in nature, with their contents being erased prior to running an ETL process or immediately following successful completion of an ETL process.

Staging areas can be implemented in the form of tables in relational databases, text-based flat files (or XML files) stored in file systems or proprietary formatted binary files stored in file systems. Staging area architectures range in complexity from a set of simple relational tables in a target database to self-contained database instances or file systems.