Datastage | Datastage

October 23, 2014

Different Types of Dimensions and Facts in Data Warehouse

Dimension –

A dimension table typically has two types of columns, primary keys to fact tables and textual\descreptive data.

Fact –
A fact table typically has two types of columns, foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact’s data on detail or aggregated level.

Types of Dimensions –

Slowly Changing Dimensions:

Attributes of a dimension that would undergo changes over time. It depends on the business requirement whether particular attribute history of changes should be preserved in the data warehouse. This is called a Slowly Changing Attribute and a dimension containing such an attribute is called a Slowly Changing Dimension.

Rapidly Changing Dimensions:

A dimension attribute that changes frequently is a Rapidly Changing Attribute. If you don’t need to track the changes, the Rapidly Changing Attribute is no problem, but if you do need to track the changes, using a standard Slowly Changing Dimension technique can result in a huge inflation of the size of the dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a Rapidly Changing Dimension.

Junk Dimensions:

A junk dimension is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by Rapidly Changing Dimensions.

Inferred Dimensions:

While loading fact records, a dimension record may not yet be ready. One solution is to generate an surrogate key with Null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.

Conformed Dimensions:

A Dimension that is used in multiple locations is called a conformed dimension. A conformed dimension may be used with multiple fact tables in a single database, or across multiple data marts or data warehouses.

Degenerate Dimensions:

A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a separate dimension table. These are essentially dimension keys for which there are no other attributes. In a data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.

Role Playing Dimensions:

A role-playing dimension is one where the same dimension key — along with its associated attributes — can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign keys for both Ship Date and Delivery Date. But the same date dimension attributes apply to each foreign key, so you can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship date as well as delivery date, and hence the name of Role Playing dimension.

Shrunken Dimensions:

A shrunken dimension is a subset of another dimension. For example, the Orders fact table may include a foreign key for Product, but the Target fact table may include a foreign key only for ProductCategory, which is in the Product table, but much less granular. Creating a smaller dimension table, with ProductCategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the Product dimension is snowflaked, there is probably already a separate table for ProductCategory, which can serve as the Shrunken Dimension.

Static Dimensions:

Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension.

Types of Facts –

Additive:

Additive facts are facts that can be summed up through all of the dimensions in the fact table. A sales fact is a good example for additive fact.

Semi-Additive:

Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Eg: Daily balances fact can be summed up through the customers dimension but not through the time dimension.

Non-Additive:

Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Eg: Facts which have percentages, ratios calculated.

Factless Fact Table:

In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called “Factless Fact tables”.

Eg: A fact table which has only product key and date key is a factless fact. There are no measures in this table. But still you can get the number products sold over a period of time.

Based on the above classifications, fact tables are categorized into two:

Cumulative:

This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.

Snapshot:

This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

October 23, 2014

Difference between ER modeling and dimension modeling

	Dimension Modelling	ER modelling
Redundancy	More	Less Redundancy
Optimization	Querying, can vary and can have huge volumes	Pre-determined set of updates, inserts and select queries
Data	Current/recent data	Archived and historical data
Size	MB to GBs	GBs to TBs
Physical Model	Denormalized	Normalized
Complexity	Dimensional model has logical grouped set of star-schemas	An E-R diagram is a complex maze of hundreds of entities linked with each other
Split	split as per the dimensions and facts	split as per the entities

October 23, 2014

Why is DM better than ER for DW?

1) There are standard type of joins and framework.

All dimensions can be thought of as symmetrically equal entry points into the fact table.

The logical design is independent of expected query patterns. The user interfaces are symmetrical, the query strategies are symmetrical, and the SQL generated against the dimensional model is symmetrical.

In other words,

You will never find attributes in fact tables and facts in dimension tables.
If you see a non-fact field in the fact table, you can assume that it is a key to a dimension table

2) it is smoothly extensible to accommodate unexpected new data elements and new design decisions.

First, all existing tables (both fact and dimension) can be changed in place by simply adding new data rows in the table. Data should not have to be reloaded. Typically, No query tool OR reporting tool needs to be reprogrammed to accommodate the change. All old applications continue to run without yielding different results. You can, respectively, make the following graceful changes to the design after the data warehouse is up and running by:

Adding new unanticipated facts (that is, new additive numeric fields in the fact table), as long as they are consistent with the fundamental grain of the existing fact table.
Adding completely new dimensions, as long as there is a single value of that dimension defined for each existing fact record
Adding new, unanticipated dimensional attributes.
Breaking existing dimension records down to a lower level of granularity from a certain point in time forward.

3) There is a body of standard approaches for handling common modeling situations in the business world.

Each of these situations has a well-understood set of alternatives that can be specifically programmed in report writers, query tools, and other user interfaces. These modeling situations include:

Slowly changing dimensions, where a ‘constant’ dimension such as Product OR Customer actually evolves slowly and asynchronously. Dimensional modeling provides specific techniques for handling slowly changing dimensions, depending on the business environment.
Heterogeneous products, where a business such as a bank needs to:
- track a number of different lines of business together within a single common set of attributes and facts, but at the same time..
- it needs to describe and measure the individual lines of business in highly idiosyncratic ways using incompatible measures.

October 23, 2014

What is Dimensional Modelling

Dimensional Modelling is the set of techniques and concepts used in data warehouse design. Different Modelling techniques give different advantages in the way we store data. Dimensional modelling aims to make is relatively easy to retrieve data, especially querying large data sets.

The other popular data modelling technique, Entity-Relationship (ER) modelling, is used for OLTP and aims at reducing redundancy and better performance of both change and retrieval of data.

At a high level, effective dimensional models reflect the natural performance measurements and descriptive attributes of a business.

Goals of Dimensional modelling

Understandability: grouping by business categories to understand better
Query performance: Denormalized for query performance
Extensibility: can add new unexpected data

How to do dimensional modelling?

In dimensional modelling, everything is divided into 2 categories – facts/measures and dimensions/contexts.

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.

Step 1: Identify the dimensions

Step 2: Identify the measures

Step 3: Identify the attributes or properties of dimensions

Step 4: Identify the granularity of the measures

Step 5: History Preservation (Optional)

October 22, 2014

Operational Data Store or ODS

The ODS can be seen as an intermediate stage between the sources and the DW, although authors also propose that it can be used as a database for operational processing. It contains integrated data, but this data is at detail level and it is only current data. Therefore we can think that with this architecture we are dividing the transformation work into two phases: in the first phase the main task is integration, and in the second phase the rest of the data transformation work is done.

October 22, 2014

Target Data

Target data can be stored in:

Single data warehouse	Data marts with logical data chunks	Both data warehouse and data marts
The data warehouse has all the information of the enterprise in a historical archival.	Data Marts are proposed as logical subsets of the complete DW. They should be consistent in their data representation in order to assure DW robustness.	Data warehouses are used where interdependency of data marts is needed. Data marts are used for analysis of components

The data models that are used for DWs are Multidimensional Model and Relational Model.

Two different approaches for Relational DW design: one that applies dimensional modelling techniques, and another that bases mainly in the concept of materialized views.

Dimensional models represent data with a “cube” structure , making more compatible logical data representation with OLAP data management.

The objectives of dimensional modelling are:

To produce database structures that are easy for end-users to understand and write queries against, and
To maximise the efficiency of queries. It achieves these objectives by minimising the number of tables and relationships between them.

October 22, 2014

Data Staging Area

What is the Data Staging Area?

The data staging area sits between the data source(s) and the data target(s).

This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. Here data gains its “intelligence”, as logic is applied to transform the data from a transactional nature to an analytical nature. This area is also where data cleansing happens. The ETL design phase is often the most time-consuming phase in a data warehousing project, and an ETL tool is often used in this area.

Sometimes also referred to as staging layer.

Why Data Staging Area?

Staging areas can be designed to provide many benefits, but the primary motivations for their use are to increase efficiency of ETL processes, ensure data integrity and support data quality operations. Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time. Not all businesses require a Data Warehouse Staging Area. Most businesses use ETL to copy data directly from operational databases into the Data Warehouse.

Staging Area

The functions of the staging area include the following:

Consolidation
Alignment
Minimize contention
Change detection
Data cleansing
Data Archival
troubleshooting

More details on Data Staging Area…

Data staging areas are often transient in nature, with their contents being erased prior to running an ETL process or immediately following successful completion of an ETL process.

Staging areas can be implemented in the form of tables in relational databases, text-based flat files (or XML files) stored in file systems or proprietary formatted binary files stored in file systems. Staging area architectures range in complexity from a set of simple relational tables in a target database to self-contained database instances or file systems.

October 22, 2014

OLTP Vs OLAP or Data Warehousing systems

A tabular comparison of OLTP or transaction processing system with Data Warehouse or a OLAP with all major criteria considered with some examples

This compares OLTP vs OLAP or OLTP vs DW (Data Warehouse)

	OLTP	OLAP
Indexes	Few	Too many
Joins	Many	Some
Data duplication	Normalized database (usually 3NF or 3^rd Normalized form – optimize update/insert/delete performance, and to guarantee data consistency	De-normalized or partially de-normalized schemas (such as a star schema) to optimize query performance.
Derived data & aggregates	Rare	Common
Workload	Only specifically designed queries will execute	Supports ad hoc queries, but workload can be planned ahead
Tuning	Optimized based on predetermined queries	Optimized for wide array of queries
Data modifications	In OLTP systems, end users routinely issue individual data modification statements to the database.	A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) from data from the sources using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse.
Typical operations	A typical OLTP operation accesses one or a handful of records. For example, “Retrieve the current order for this customer.”	A typical data warehouse query scans thousands or millions of rows. For example, “Find the total fees charged for all customers in the last year.”
Historical data	OLTP systems usually store only the current version of the data or recent ones as needed to successfully meet the requirements of the current transaction	Data warehouses usually store many months or years of data. This is to support historical analysis.

October 21, 2014

Types of ETL tools

If you are being asked to make a choice on the ETL tool to be used, there are various considerations we need to make. First we will understand the types of ETL tools that are available:

In-House Versus Ready-Made	In-House is developed in the enterprise based on requirements, whereas ready made tools make use of tools available in the market and configure them as per the requirement.
Commercial license Versus Freeware	Commercial licensed software come with a price tag but with promise of support. But freeware drastically reduce the cost though support may not be easily available
Code-Based Versus GUI	Code based requires manual code changes to build software code whereas GUI based would create code by drag and drop and configuration changes
ETL versus ELT	The order of processing, especially when and where the data is transformed
Engine Types	The processing can happen either in one or multiple CPU depending on ETL tool used and capabilities provided in that version.

You can go through ETL vs ELT comparison to understand how they differ.

But, today, ETL is much more than that

Most ETL tools also cover:

data profiling and data quality control
meta data management
monitoring and cleansing of the data
real-time and on-demand data integration
extraction of Big Data using Hadoop
master data management

The considerations which are needed to be checked for types of ETL are:

Ease of use
Support
Speed
Data Quality
Connectivity
Hardware/Software platform required
Risk
Cost effectiveness

October 21, 2014

Data warehouse Sources

The OLTP systems, transaction systems, live databases, operational systems – whatever they are known as in various fora, these are are all source data for the data warehouse. The source data cannot be normally used for analytic information as they are:

Not Optimized for Reporting
Data schema optimized for transactions not queries
Difficult to share data
Typically do not maintain historical data

Data warehouse architecture 1