Data Warehouse lifecycle

The data warehouse lifecycle system involves the creation of an efficient system for storing and retrieving computer data. Time has shown that simply dumping data into vast computer storage does not work. Instead, it is best to create a storage system, test it, and alter it as needed to fit the ever-changing data needs of the company.

Data warehouse lifecycle

Data warehouse lifecycle

Design:

 the development, from both available data inventories and DSS analyst requirements and analytic needs, of robust star-schema-based dimensional data models. Practically speaking, we believe that the best data warehousing practitioners today are working simultaneously from (a) the available data inventory in the organization and (b) the often incomplete expression of needs on the part of multiple heterogeneous analytic communities, making data warehouse and data mart design a complex and intricate process more akin to architecture and diplomacy than to traditional IT systems design. Key activities in this phase typically include end-user interview cycles, source system cataloging, definition of key performance indicators and other critical business metrics, mapping of decision-making processes underlying information needs, and logical and physical schema design tasks, which feed the prototyping phase of the life-cycle model quite directly.

Prototype:

the deployment, for a select group of opinion-makers and leading practitioners in the end-user analytic communities, of a populated, working model of a data warehouse or data mart design, suitable for actual use. The purpose of prototyping shifts, as the design team moves back and forth between design and prototype. In early prototypes, the primary objective is to constrain and in some cases re-frame end-user requirements by showing opinion-leaders and heavyweight analysts in the end-user communities precisely what they had asked for in the previous iteration. As the gap between stated needs and actual needs closes over the course of 2 or more design-prototype iterations, the purpose of the prototype shifts toward diplomacy – gaining commitment to the project at hand from opinion leaders in the end-user communities to the design, and soliciting their assistance in gaining similar commitment.

Deploy:

the formalization of a user-approved prototype for actual production use, including the development of documentation, training, O&M processes and the host of activities traditionally associated with enterprise IT system deployment. Deployment typically involves at least two separate deployments – the deployment of a prototype into a production-test environment, and the deployment of a stress-tested, performance-tested production configuration into an actual production environment. It is at this phase that the single most often- neglected component of a successful production warehouse – documentation – stalls both the transition to operations and management personnel (who cannot manage what they cannot understand) and to end-user organizations (who after all have to be taught at some level about the data and metadata the warehouse or mart contains, prior to roll-out).

Operation:

the day-to-day maintenance of the data warehouse or mart, the data delivery services and client tools that provide DSS analysts with their access to warehouse and mart data, and the management of ongoing extraction, transformation and loading processes (from traditional drop-and-reload bulk updates to cutting edge near-real-time trickle-charging processes) that keep the warehouse or mart current with respect to the authoritative transaction source systems.

Enhancement:

the modification of

(a) physical technological components,

(b) operations and management processes (including ETL regimes and scheduling) and

(c) logical schema designs in response to changing business requirements.

In cases where external business conditions change discontinuously, or organizations themselves undergo discontinuous changes (as in the case of asset sales, mergers and acquisitions), enhancement moves seamlessly back into fundamental design.

Types of Data Warehouse

To first understand various Data Warehouse architectures, we need to understand the various components in the enterprise that would contribute to the Data Warehouse.

Components of Data Warehouse Architecture

Data warehouse architecture

 

Various architectures noted

Depending on how the data is stored and accessed, the various architectures are:

  •  Independent data marts (IDM)
  • Data mart bus architecture with linked dimensional data marts (DBA)
  • Hub and Spoke Architecture (HAS)
  • Federated architecture (FED)
  • Centralized Architecture

One other way of classifying various Data warehouse architectures is by the number of middleware between the operational systems and the analytical tools. These would be – the single-layer architecture and the N-layer architecture.

Other Types:

Types of Dimension Tables

Types of Hashed Files

Types of Dimensional Modeling

What is Merge stage?

The Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.(according to DS documentation)

Merge stage combines a master dataset with one or more update datasets based on the key columns. The output record contains all the columns from master record plus any additional columns from each update record that are required.

A master record and update record will be merged only if both have same key column values.

The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time.

As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well.

Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link.

Popular ETL Tools companies use

Click on the image to read better:

ETL Tools

ETL Tools

 

As you can see there are many types of ETL Tools and all you have to do right now is to choose appropriate one for you. Some of them are relatively quite expensive, some may be too complex, if you don’t want to transform a lot of information or use many sources or use sophisticated features.

Check out informatica vs datastage comparison

Top picks out in the market –
1. Informatica
2. IBM DataStage
3. Ab Initio
4. Oracle
5. SAP Business Objects Data Integrator
6. Microsoft SSISTop Challengers
1. SAS/DataFlux
2. iWay Software
3. Talend
4. Syncsort
5. Pervasive Software
6. Pitney Bowes Insight

Competitive price vendors –
1. Informatica Cloud edition
2. expressor-software (pricing based on channels i.e. multiple processing)

Open Source vendors –
1. Talend
2. Apatar
3. Pentaho Kettle
4. SnapLogic

Cloud/SaaS Based vendors –
1. Informatica
2. SnapLogic
Both started by entrepreneur Gaurav Dhillon.

Top Pipeline Partitioning vendors –
1. IBM DataStage (process based)
2. Informatica (multi thread based)
3. expressor-software (hybrid based)

Top Message oriented/Real-time processing vendors –
1. IBM DataStage
2. Informatica

Best salesforce.com Integration vendors –
1. Informatica (both Cloud edition based and adaptor based support)
2. IBM DataStage (Adaptor/Pack based support)

Top ELT architecture based vendors –
1. Talend (excellent ELT based objects to drag and drop in the designer)
2. IBM DataStage (provides options to create tables before loading)

When to use join stage vs lookup stage?

The major consideration is the volume of the data. One of the most important mistakes that developers make is to not have volumetric analyses done before deciding to use Join or  Lookup stages.

There are two data sets being combined. One is the primary or driving data set, sometimes called the left of the join. The other dataset are the reference data set or the right of the join.

Join Lookup
If the reference datasets take up a large amount of memory relative to the physical RAM memory size of the computer DataStage is running on, then a Lookup stage might crash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. In this case, Join is most preferred. So, if the reference datasets are not big enough to cause trouble, use a Lookup.  Large reference datasets can cause slow performance since each lookup operation can, and typically will, cause a page fault and an I/O operation.

Physical architecture Vs Logical architecture in Oracle Data Integrator

Physical Architecture is used to create the server name (database) which we are connecting to particular DATABASE.

Logical Architecture is a logical name of Physical Architecture, Only in the designing time we are using Physical Architectures but Logical Architecture used in run time ex:-  when creating models we are using Logical Architecture, and creating interfaces .

In the run time logical Architecture connects to context, In the designing time we have created logical schema name and pyscical schema name which are internally connected by using context, Now the context connects to the particular PhysicalArchitecture, the Physical Architecture connects to the designed Server (DATABASE)

 

Creating a Data Server

A Data Server corresponds for example to a Database, JMS server instance, a scripting engine or a file system accessed with Oracle Data Integrator in the integration flows. Under a data server, subdivisions are created in the form of Physical Schema.

Context is used to connect  Physical Architecture and Logical Architecture

Adding zeroes or spaces to strings and get fixed length

Using the transformer stage, how do we get fixed length strings in output stream? In the below example, we see how to add zeroes and make the string of a given length.

Source having following data as below,

AB1

Aim2

Abnv5

1An8bx

as in my target/ output must have:

 

000AB1

00Aim2

0Abnv5

1An8bx

 

How do we achieve this?

Str(‘0’,6-Len(string)): string 

Find if SMP or MPP

Look in the Configuration file if the “fastname” has the same value for all given nodes then you are running on SMP.


If there are different “fastname’s” then its MPP.

For Server jobs it won’t matter if it is SMP or MPP because Server jobs execute only on the machine where Datastage installed.

Troubleshoot Datastage compilation errors

 

datastage compilation error

datastage compilation error

If you get an error during compile, check the following:

œCompilation problems

–If Transformer used, check C++ compiler, LD_LIRBARY_PATH

–If Buildop errors try buildop from command line

–Some stages may not support RCP – can cause column mismatch .

–Use the Show Error and More buttons

–Examine Generated OSH

–Check environment variables settings