October 23, 2014 What is a Dimension? A measure or dimension is normally an additive numerical value that represents a business metric. You are also not limited to a single measure. You can have multiple measures within a fact table. For example, if your fact table is used to track purchases internationally you might have measures for each type of currency. If you are building a fact table for the retail industry you might also have the following measures cost, list price, average sale price. Dimensions describe the objects involved in a business intelligence effort. While facts correspond to events, dimensions correspond to people, items, or other objects. For example, in the retail scenario, we discussed that purchases, returns and calls are facts. On the other hand, customers, employees, items and stores are dimensions and should be contained in dimension tables. Dimension tables contain details about each instance of an object. For example, the items dimension table would contain a record for each item sold in the store. It might include information such as the cost of the item, the supplier, color, sizes, and similar data. Dimension tables are typically small, ranging from a few to several thousand rows. Occasionally dimensions can grow fairly large, however. For example, a large credit card company could have a customer dimension with millions of rows. Dimension table structure is typically very lean, for example customer dimension could look like following: Customer_key Customer_full_name Customer_city Customer_state Customer_country Fact tables and dimension tables are related to each other. Again returning to our retail model, the fact table for a customer transaction would likely contain a foreign key reference to the item dimension table, where the entry corresponds to a primary key in that table for a record describing the item purchased. Most data warehouses will have one or multiple time dimensions. Since the warehouse will be used for finding and examining trends, data analysts will need to know when each fact has occurred. The most common time dimension is calendar time. However, your business might also need a fiscal time dimension in case your fiscal year does not start on January 1st as the calendar year. Most data warehouses will also contain product or service dimensions since each business typically operates by offering either products or services to others. Geographically dispersed businesses are likely to have a location dimension.
October 23, 2014 What is a Fact Table? A fact table is a table that joins dimension tables with measures. For example, Lets say you wanted to know the time worked by employees, by location, by project and by task. If you had a dimension for employees, location, project and task you would create a composite primary key using these foreign keys and add an additional column for the time worked measure. (more on measures in a little bit) Keep in mind that fact tables are huge unlike dimension tables. Fact tables are usually built to contain a hundred thousand records on the low side up to billions of records on the high side. Therefore, these tables must be normalized to be efficient. A fact table is normalized when it consists of the following two things: A composite primary key using the foreign keys to the dimension tables. Measure(s) used for numerical analysis. Fact tables contain the data corresponding to a particular business process. Each row represents a single event associated with that process and contains the measurement data associated with that event. For example, a retail organization might have fact tables related to customer purchases, customer service telephone calls and product returns. The customer purchases table would likely contain information about the amount of the purchase, any discounts applied and the sales tax paid. The information contained within a fact table is typically numeric data and it is often data that can be easily manipulated, particularly by summing together many thousands of rows. For example, the retailer described above may wish to pull a profit report for a particular store, product line or customer segment. The retailer can do this by retrieving information from the fact table that relates to those transactions meeting the specific criteria and then adding those rows together. Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine. For example, a store selling automotive parts might have a fact table recording a sale of each item. The fact table of an educational entity could track credit hours awarded to students. A bakery could have a fact table that records manufacturing of various baked goods. Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table “grain”. For instance, for a healthcare billing company it might be sufficient to track revenues by month; daily and hourly data might not exist or might not be relevant. On the other hand, the assembly line warehouse analysts might be very concerned in number of defective goods that were manufactured each hour. Similarly a marketing data warehouse might be concerned by the activity of a consumer group with a specific income-level rather than purchases made by each individual. Fact Table Grain When designing a fact table, developers must pay careful attention to the grain of the table — the level of detail contained within the table. The developer designing the purchase fact table described above would need to decide, for example, whether the grain of the table is a customer transaction or an individual item purchase. In the case of an individual item purchase grain, each customer transaction would generate multiple fact table entries, corresponding to each item purchased. The choice of grain is a fundamental decision made during the design process that can have significant impact on the business intelligence effort down the road.
October 23, 2014 Difference between ER modeling and dimension modeling Dimension Modelling ER modelling Redundancy More Less Redundancy Optimization Querying, can vary and can have huge volumes Pre-determined set of updates, inserts and select queries Data Current/recent data Archived and historical data Size MB to GBs GBs to TBs Physical Model Denormalized Normalized Complexity Dimensional model has logical grouped set of star-schemas An E-R diagram is a complex maze of hundreds of entities linked with each other Split split as per the dimensions and facts split as per the entities
October 23, 2014 What is Dimensional Modelling Dimensional Modelling is the set of techniques and concepts used in data warehouse design. Different Modelling techniques give different advantages in the way we store data. Dimensional modelling aims to make is relatively easy to retrieve data, especially querying large data sets. The other popular data modelling technique, Entity-Relationship (ER) modelling, is used for OLTP and aims at reducing redundancy and better performance of both change and retrieval of data. At a high level, effective dimensional models reflect the natural performance measurements and descriptive attributes of a business. Goals of Dimensional modelling Understandability: grouping by business categories to understand better Query performance: Denormalized for query performance Extensibility: can add new unexpected data How to do dimensional modelling? In dimensional modelling, everything is divided into 2 categories – facts/measures and dimensions/contexts. Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables. Step 1: Identify the dimensions Step 2: Identify the measures Step 3: Identify the attributes or properties of dimensions Step 4: Identify the granularity of the measures Step 5: History Preservation (Optional)
October 22, 2014 Data Staging Area What is the Data Staging Area? The data staging area sits between the data source(s) and the data target(s). This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. Here data gains its “intelligence”, as logic is applied to transform the data from a transactional nature to an analytical nature. This area is also where data cleansing happens. The ETL design phase is often the most time-consuming phase in a data warehousing project, and an ETL tool is often used in this area. Sometimes also referred to as staging layer. Why Data Staging Area? Staging areas can be designed to provide many benefits, but the primary motivations for their use are to increase efficiency of ETL processes, ensure data integrity and support data quality operations. Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time. Not all businesses require a Data Warehouse Staging Area. Most businesses use ETL to copy data directly from operational databases into the Data Warehouse. Staging Area The functions of the staging area include the following: Consolidation Alignment Minimize contention Change detection Data cleansing Data Archival troubleshooting More details on Data Staging Area… Data staging areas are often transient in nature, with their contents being erased prior to running an ETL process or immediately following successful completion of an ETL process. Staging areas can be implemented in the form of tables in relational databases, text-based flat files (or XML files) stored in file systems or proprietary formatted binary files stored in file systems. Staging area architectures range in complexity from a set of simple relational tables in a target database to self-contained database instances or file systems.
October 22, 2014 OLTP Vs OLAP or Data Warehousing systems A tabular comparison of OLTP or transaction processing system with Data Warehouse or a OLAP with all major criteria considered with some examples This compares OLTP vs OLAP or OLTP vs DW (Data Warehouse) OLTP OLAP Indexes Few Too many Joins Many Some Data duplication Normalized database (usually 3NF or 3rd Normalized form – optimize update/insert/delete performance, and to guarantee data consistency De-normalized or partially de-normalized schemas (such as a star schema) to optimize query performance. Derived data & aggregates Rare Common Workload Only specifically designed queries will execute Supports ad hoc queries, but workload can be planned ahead Tuning Optimized based on predetermined queries Optimized for wide array of queries Data modifications In OLTP systems, end users routinely issue individual data modification statements to the database. A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) from data from the sources using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse. Typical operations A typical OLTP operation accesses one or a handful of records. For example, “Retrieve the current order for this customer.” A typical data warehouse query scans thousands or millions of rows. For example, “Find the total fees charged for all customers in the last year.” Historical data OLTP systems usually store only the current version of the data or recent ones as needed to successfully meet the requirements of the current transaction Data warehouses usually store many months or years of data. This is to support historical analysis.
October 21, 2014 Data warehouse Sources The OLTP systems, transaction systems, live databases, operational systems – whatever they are known as in various fora, these are are all source data for the data warehouse. The source data cannot be normally used for analytic information as they are: Not Optimized for Reporting Data schema optimized for transactions not queries Difficult to share data Typically do not maintain historical data Data warehouse architecture 1
October 21, 2014 Data Warehouse lifecycle The data warehouse lifecycle system involves the creation of an efficient system for storing and retrieving computer data. Time has shown that simply dumping data into vast computer storage does not work. Instead, it is best to create a storage system, test it, and alter it as needed to fit the ever-changing data needs of the company. Data warehouse lifecycle Design: the development, from both available data inventories and DSS analyst requirements and analytic needs, of robust star-schema-based dimensional data models. Practically speaking, we believe that the best data warehousing practitioners today are working simultaneously from (a) the available data inventory in the organization and (b) the often incomplete expression of needs on the part of multiple heterogeneous analytic communities, making data warehouse and data mart design a complex and intricate process more akin to architecture and diplomacy than to traditional IT systems design. Key activities in this phase typically include end-user interview cycles, source system cataloging, definition of key performance indicators and other critical business metrics, mapping of decision-making processes underlying information needs, and logical and physical schema design tasks, which feed the prototyping phase of the life-cycle model quite directly. Prototype: the deployment, for a select group of opinion-makers and leading practitioners in the end-user analytic communities, of a populated, working model of a data warehouse or data mart design, suitable for actual use. The purpose of prototyping shifts, as the design team moves back and forth between design and prototype. In early prototypes, the primary objective is to constrain and in some cases re-frame end-user requirements by showing opinion-leaders and heavyweight analysts in the end-user communities precisely what they had asked for in the previous iteration. As the gap between stated needs and actual needs closes over the course of 2 or more design-prototype iterations, the purpose of the prototype shifts toward diplomacy – gaining commitment to the project at hand from opinion leaders in the end-user communities to the design, and soliciting their assistance in gaining similar commitment. Deploy: the formalization of a user-approved prototype for actual production use, including the development of documentation, training, O&M processes and the host of activities traditionally associated with enterprise IT system deployment. Deployment typically involves at least two separate deployments – the deployment of a prototype into a production-test environment, and the deployment of a stress-tested, performance-tested production configuration into an actual production environment. It is at this phase that the single most often- neglected component of a successful production warehouse – documentation – stalls both the transition to operations and management personnel (who cannot manage what they cannot understand) and to end-user organizations (who after all have to be taught at some level about the data and metadata the warehouse or mart contains, prior to roll-out). Operation: the day-to-day maintenance of the data warehouse or mart, the data delivery services and client tools that provide DSS analysts with their access to warehouse and mart data, and the management of ongoing extraction, transformation and loading processes (from traditional drop-and-reload bulk updates to cutting edge near-real-time trickle-charging processes) that keep the warehouse or mart current with respect to the authoritative transaction source systems. Enhancement: the modification of (a) physical technological components, (b) operations and management processes (including ETL regimes and scheduling) and (c) logical schema designs in response to changing business requirements. In cases where external business conditions change discontinuously, or organizations themselves undergo discontinuous changes (as in the case of asset sales, mergers and acquisitions), enhancement moves seamlessly back into fundamental design.
October 18, 2014 Types of Data Warehouse To first understand various Data Warehouse architectures, we need to understand the various components in the enterprise that would contribute to the Data Warehouse. Components of Data Warehouse Architecture Various architectures noted Depending on how the data is stored and accessed, the various architectures are: Independent data marts (IDM) Data mart bus architecture with linked dimensional data marts (DBA) Hub and Spoke Architecture (HAS) Federated architecture (FED) Centralized Architecture One other way of classifying various Data warehouse architectures is by the number of middleware between the operational systems and the analytical tools. These would be – the single-layer architecture and the N-layer architecture. Other Types: Types of Dimension Tables Types of Hashed Files Types of Dimensional Modeling
April 29, 2014 Why Data warehouse? Why do we need a data warehouse? Heterogeneities are everywhereEnterprises tend to rely on different databases, files, external sources and even the internet and social media for all the information. This means: Different interfacesDifferent data representationsDuplicate and inconsistent information Data Management in Large Enterprises Vertical fragmentation of informational systems (vertical stove pipes)Result of application (user)-driven development of operational systems The goal of building a warehouse is to have a system that: Collects and combines informationProvides integrated view, uniform user interfaceSupports sharing