data warehouse | Datastage

October 23, 2014

What is a Dimension?

A measure or dimension is normally an additive numerical value that represents a business metric. You are also not limited to a single measure. You can have multiple measures within a fact table. For example, if your fact table is used to track purchases internationally you might have measures for each type of currency. If you are building a fact table for the retail industry you might also have the following measures cost, list price, average sale price.

Dimensions describe the objects involved in a business intelligence effort. While facts correspond to events, dimensions correspond to people, items, or other objects. For example, in the retail scenario, we discussed that purchases, returns and calls are facts. On the other hand, customers, employees, items and stores are dimensions and should be contained in dimension tables.

Dimension tables contain details about each instance of an object. For example, the items dimension table would contain a record for each item sold in the store. It might include information such as the cost of the item, the supplier, color, sizes, and similar data.

Dimension tables are typically small, ranging from a few to several thousand rows. Occasionally dimensions can grow fairly large, however. For example, a large credit card company could have a customer dimension with millions of rows. Dimension table structure is typically very lean, for example customer dimension could look like following:

Customer_key
Customer_full_name
Customer_city
Customer_state
Customer_country

Fact tables and dimension tables are related to each other. Again returning to our retail model, the fact table for a customer transaction would likely contain a foreign key reference to the item dimension table, where the entry corresponds to a primary key in that table for a record describing the item purchased.

Most data warehouses will have one or multiple time dimensions. Since the warehouse will be used for finding and examining trends, data analysts will need to know when each fact has occurred. The most common time dimension is calendar time. However, your business might also need a fiscal time dimension in case your fiscal year does not start on January 1st as the calendar year.

Most data warehouses will also contain product or service dimensions since each business typically operates by offering either products or services to others. Geographically dispersed businesses are likely to have a location dimension.

October 23, 2014

What is a Fact Table?

A fact table is a table that joins dimension tables with measures. For example, Lets say you wanted to know the time worked by employees, by location, by project and by task. If you had a dimension for employees, location, project and task you would create a composite primary key using these foreign keys and add an additional column for the time worked measure. (more on measures in a little bit)

Keep in mind that fact tables are huge unlike dimension tables. Fact tables are usually built to contain a hundred thousand records on the low side up to billions of records on the high side. Therefore, these tables must be normalized to be efficient.

A fact table is normalized when it consists of the following two things:

A composite primary key using the foreign keys to the dimension tables.
Measure(s) used for numerical analysis.

Fact tables contain the data corresponding to a particular business process. Each row represents a single event associated with that process and contains the measurement data associated with that event. For example, a retail organization might have fact tables related to customer purchases, customer service telephone calls and product returns. The customer purchases table would likely contain information about the amount of the purchase, any discounts applied and the sales tax paid.

The information contained within a fact table is typically numeric data and it is often data that can be easily manipulated, particularly by summing together many thousands of rows. For example, the retailer described above may wish to pull a profit report for a particular store, product line or customer segment. The retailer can do this by retrieving information from the fact table that relates to those transactions meeting the specific criteria and then adding those rows together.

Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine. For example, a store selling automotive parts might have a fact table recording a sale of each item. The fact table of an educational entity could track credit hours awarded to students. A bakery could have a fact table that records manufacturing of various baked goods.

Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table “grain”. For instance, for a healthcare billing company it might be sufficient to track revenues by month; daily and hourly data might not exist or might not be relevant. On the other hand, the assembly line warehouse analysts might be very concerned in number of defective goods that were manufactured each hour. Similarly a marketing data warehouse might be concerned by the activity of a consumer group with a specific income-level rather than purchases made by each individual.

Fact Table Grain

When designing a fact table, developers must pay careful attention to the grain of the table — the level of detail contained within the table. The developer designing the purchase fact table described above would need to decide, for example, whether the grain of the table is a customer transaction or an individual item purchase. In the case of an individual item purchase grain, each customer transaction would generate multiple fact table entries, corresponding to each item purchased. The choice of grain is a fundamental decision made during the design process that can have significant impact on the business intelligence effort down the road.

October 23, 2014

Difference between ER modeling and dimension modeling

	Dimension Modelling	ER modelling
Redundancy	More	Less Redundancy
Optimization	Querying, can vary and can have huge volumes	Pre-determined set of updates, inserts and select queries
Data	Current/recent data	Archived and historical data
Size	MB to GBs	GBs to TBs
Physical Model	Denormalized	Normalized
Complexity	Dimensional model has logical grouped set of star-schemas	An E-R diagram is a complex maze of hundreds of entities linked with each other
Split	split as per the dimensions and facts	split as per the entities

October 23, 2014

What is Dimensional Modelling

Dimensional Modelling is the set of techniques and concepts used in data warehouse design. Different Modelling techniques give different advantages in the way we store data. Dimensional modelling aims to make is relatively easy to retrieve data, especially querying large data sets.

The other popular data modelling technique, Entity-Relationship (ER) modelling, is used for OLTP and aims at reducing redundancy and better performance of both change and retrieval of data.

At a high level, effective dimensional models reflect the natural performance measurements and descriptive attributes of a business.

Goals of Dimensional modelling

Understandability: grouping by business categories to understand better
Query performance: Denormalized for query performance
Extensibility: can add new unexpected data

How to do dimensional modelling?

In dimensional modelling, everything is divided into 2 categories – facts/measures and dimensions/contexts.

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.

Step 1: Identify the dimensions

Step 2: Identify the measures

Step 3: Identify the attributes or properties of dimensions

Step 4: Identify the granularity of the measures

Step 5: History Preservation (Optional)

October 22, 2014

Data Staging Area

What is the Data Staging Area?

The data staging area sits between the data source(s) and the data target(s).

This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. Here data gains its “intelligence”, as logic is applied to transform the data from a transactional nature to an analytical nature. This area is also where data cleansing happens. The ETL design phase is often the most time-consuming phase in a data warehousing project, and an ETL tool is often used in this area.

Sometimes also referred to as staging layer.

Why Data Staging Area?

Staging areas can be designed to provide many benefits, but the primary motivations for their use are to increase efficiency of ETL processes, ensure data integrity and support data quality operations. Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time. Not all businesses require a Data Warehouse Staging Area. Most businesses use ETL to copy data directly from operational databases into the Data Warehouse.

Staging Area

The functions of the staging area include the following:

Consolidation
Alignment
Minimize contention
Change detection
Data cleansing
Data Archival
troubleshooting

More details on Data Staging Area…

Data staging areas are often transient in nature, with their contents being erased prior to running an ETL process or immediately following successful completion of an ETL process.

Staging areas can be implemented in the form of tables in relational databases, text-based flat files (or XML files) stored in file systems or proprietary formatted binary files stored in file systems. Staging area architectures range in complexity from a set of simple relational tables in a target database to self-contained database instances or file systems.

October 22, 2014

OLTP Vs OLAP or Data Warehousing systems

A tabular comparison of OLTP or transaction processing system with Data Warehouse or a OLAP with all major criteria considered with some examples

This compares OLTP vs OLAP or OLTP vs DW (Data Warehouse)

	OLTP	OLAP
Indexes	Few	Too many
Joins	Many	Some
Data duplication	Normalized database (usually 3NF or 3^rd Normalized form – optimize update/insert/delete performance, and to guarantee data consistency	De-normalized or partially de-normalized schemas (such as a star schema) to optimize query performance.
Derived data & aggregates	Rare	Common
Workload	Only specifically designed queries will execute	Supports ad hoc queries, but workload can be planned ahead
Tuning	Optimized based on predetermined queries	Optimized for wide array of queries
Data modifications	In OLTP systems, end users routinely issue individual data modification statements to the database.	A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) from data from the sources using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse.
Typical operations	A typical OLTP operation accesses one or a handful of records. For example, “Retrieve the current order for this customer.”	A typical data warehouse query scans thousands or millions of rows. For example, “Find the total fees charged for all customers in the last year.”
Historical data	OLTP systems usually store only the current version of the data or recent ones as needed to successfully meet the requirements of the current transaction	Data warehouses usually store many months or years of data. This is to support historical analysis.

October 21, 2014

Data warehouse Sources

The OLTP systems, transaction systems, live databases, operational systems – whatever they are known as in various fora, these are are all source data for the data warehouse. The source data cannot be normally used for analytic information as they are:

Not Optimized for Reporting
Data schema optimized for transactions not queries
Difficult to share data
Typically do not maintain historical data

Data warehouse architecture 1

October 21, 2014

Data Warehouse lifecycle

The data warehouse lifecycle system involves the creation of an efficient system for storing and retrieving computer data. Time has shown that simply dumping data into vast computer storage does not work. Instead, it is best to create a storage system, test it, and alter it as needed to fit the ever-changing data needs of the company.

Data warehouse lifecycle

Design:

the development, from both available data inventories and DSS analyst requirements and analytic needs, of robust star-schema-based dimensional data models. Practically speaking, we believe that the best data warehousing practitioners today are working simultaneously from (a) the available data inventory in the organization and (b) the often incomplete expression of needs on the part of multiple heterogeneous analytic communities, making data warehouse and data mart design a complex and intricate process more akin to architecture and diplomacy than to traditional IT systems design. Key activities in this phase typically include end-user interview cycles, source system cataloging, definition of key performance indicators and other critical business metrics, mapping of decision-making processes underlying information needs, and logical and physical schema design tasks, which feed the prototyping phase of the life-cycle model quite directly.

Prototype:

the deployment, for a select group of opinion-makers and leading practitioners in the end-user analytic communities, of a populated, working model of a data warehouse or data mart design, suitable for actual use. The purpose of prototyping shifts, as the design team moves back and forth between design and prototype. In early prototypes, the primary objective is to constrain and in some cases re-frame end-user requirements by showing opinion-leaders and heavyweight analysts in the end-user communities precisely what they had asked for in the previous iteration. As the gap between stated needs and actual needs closes over the course of 2 or more design-prototype iterations, the purpose of the prototype shifts toward diplomacy – gaining commitment to the project at hand from opinion leaders in the end-user communities to the design, and soliciting their assistance in gaining similar commitment.

Deploy:

the formalization of a user-approved prototype for actual production use, including the development of documentation, training, O&M processes and the host of activities traditionally associated with enterprise IT system deployment. Deployment typically involves at least two separate deployments – the deployment of a prototype into a production-test environment, and the deployment of a stress-tested, performance-tested production configuration into an actual production environment. It is at this phase that the single most often- neglected component of a successful production warehouse – documentation – stalls both the transition to operations and management personnel (who cannot manage what they cannot understand) and to end-user organizations (who after all have to be taught at some level about the data and metadata the warehouse or mart contains, prior to roll-out).

Operation:

the day-to-day maintenance of the data warehouse or mart, the data delivery services and client tools that provide DSS analysts with their access to warehouse and mart data, and the management of ongoing extraction, transformation and loading processes (from traditional drop-and-reload bulk updates to cutting edge near-real-time trickle-charging processes) that keep the warehouse or mart current with respect to the authoritative transaction source systems.

Enhancement:

the modification of

(a) physical technological components,

(b) operations and management processes (including ETL regimes and scheduling) and

In cases where external business conditions change discontinuously, or organizations themselves undergo discontinuous changes (as in the case of asset sales, mergers and acquisitions), enhancement moves seamlessly back into fundamental design.

October 18, 2014

Types of Data Warehouse

To first understand various Data Warehouse architectures, we need to understand the various components in the enterprise that would contribute to the Data Warehouse.

Components of Data Warehouse Architecture

Various architectures noted

Depending on how the data is stored and accessed, the various architectures are:

Independent data marts (IDM)
Data mart bus architecture with linked dimensional data marts (DBA)
Hub and Spoke Architecture (HAS)
Federated architecture (FED)
Centralized Architecture

One other way of classifying various Data warehouse architectures is by the number of middleware between the operational systems and the analytical tools. These would be – the single-layer architecture and the N-layer architecture.

Other Types:

Types of Dimension Tables

Types of Hashed Files

Types of Dimensional Modeling

April 29, 2014

Why Data warehouse?

Why do we need a data warehouse?

Heterogeneities are everywhere

Enterprises tend to rely on different databases, files, external sources and even the internet and social media for all the information. This means:

Different interfaces
Different data representations
Duplicate and inconsistent information

Data Management in Large Enterprises

Vertical fragmentation of informational systems (vertical stove pipes)
Result of application (user)-driven development of operational systems

The goal of building a warehouse is to have a system that:

Collects and combines information
Provides integrated view, uniform user interface
Supports sharing