What is Merge stage?

The Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.(according to DS documentation)

Merge stage combines a master dataset with one or more update datasets based on the key columns. The output record contains all the columns from master record plus any additional columns from each update record that are required.

A master record and update record will be merged only if both have same key column values.

The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time.

As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well.

Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link.

When to use join stage vs lookup stage?

The major consideration is the volume of the data. One of the most important mistakes that developers make is to not have volumetric analyses done before deciding to use Join or  Lookup stages.

There are two data sets being combined. One is the primary or driving data set, sometimes called the left of the join. The other dataset are the reference data set or the right of the join.

Join Lookup
If the reference datasets take up a large amount of memory relative to the physical RAM memory size of the computer DataStage is running on, then a Lookup stage might crash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. In this case, Join is most preferred. So, if the reference datasets are not big enough to cause trouble, use a Lookup.  Large reference datasets can cause slow performance since each lookup operation can, and typically will, cause a page fault and an I/O operation.

Why Data warehouse?

Why do we need a data warehouse?

Heterogeneities are everywhere

Enterprises tend to rely on different databases, files, external sources and even the internet and social media for all the information. This means:

Different interfaces
Different data representations
Duplicate and inconsistent information

 
Data Management in Large Enterprises
Vertical fragmentation of informational systems (vertical stove pipes)
Result of application (user)-driven development of operational systems

The goal of building a warehouse is to have a system that:
 
Collects and combines information
Provides integrated view, uniform user interface
Supports sharing



Slowly Changing Dimension/SCD in datastage

To understand what is Slowly Changing Dimension, we first understand these:
Most frequently, we insert into Fact tables . Very infrequently we update the facts that were loaded incorrectly. It is even less likely to delete rows from the fact table; the only time this might happen is if you wanted to archive the old data that is no longer relevant for the current portion of the data warehouse.
Dimension tables, on the other hand are likely to see frequent updates. A classic example of this is a consumer dimension. People can change names, phone numbers, addresses, etc regularly.

Dimensions that are changing over time are referred to as slowly changing dimensions (SCD).

Slowly Changing Dimension

Slowly Changing Dimension

Type 1: override the existing value

Customer_key Customer_name Customer_city Customer_state
123 Ms. Brown Houston Texas
123 Ms. Brown Dallas Texas
Slowly Changing Dimension - SCD Type 1

Slowly Changing Dimension – SCD Type 1

Type 2: create a new record with effective dates

Customer_key Customer_name Customer_city Customer_state Effective_date Obsolete_date
123 Ms.Brown Houston Texas 1/1/2000 1/1/2005
234 Ms. Brown Dallas Texas 1/1/2005 NULL
Slowly Changing Dimension - SCD Type 3

Slowly Changing Dimension – SCD Type 3

Type 3: create a new and old columns, no new rows

Customer_key Customer_name Customer_city Customer_state Change_date New_city
123 Ms. Brown Houston Texas 1/1/2005 Dallas
Slowly Changing Dimension - SCD Type 3

Slowly Changing Dimension – SCD Type 3

What is DataWarehouse?

A data warehouse is a central integrated database containing data from all the operational sources and archive systems in an organization. It contains a copy of transaction data specifically structured for query analysis. This database can be accessed by all users, ensuring that each group in an organization is accessing valuable, stable data
Data Warehouse
A Data Warehouse Is A Structured Repository of Historic Data.
It Is Developed in an Evolutionary Process by Integrating Data from Non-integrated Legacy Systems.
It Is Usually:
ü    Subject Oriented
ü    Integrated
ü    Time Variant
ü    Non-volatile

 Next: Under stand Why do we need a Data warehouse

Datastage History

At the beginning datastage was available under Ardent DataStage then it’s name was changed to Ascential DataStage.

ascential

In 2005 IBM acquired the Ascential company and added datastage product to it’s WebSphere family. The October 2006 release of DataStage has integrated it into the new IBM Information Server platform, but still called Websphere Datastage.

websphere

IBM WebSphere DataStage is one of the products from IBM’s WebSphere Information Integration suite and Information Server. The tool uses mostly graphical notaction to present data integration solutions. Datastage is delivered in various versions as Server and Enterprise Edition. In 2008 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage.

For more detailed early history, see below:

  • DataStage was conceived at VMark, a spin off from Prime Computers that developed two notable products: UniVerse database and the DataStage ETL tool. The first VMark ETL prototype was built by Lee Scheffler in 1996
  • Lee Scheffler presented the DataStage product overview to the board of VMark in 1996 and it was approved for development. The product was in alpha testing in October, beta testing in November and was generally available in 1997.
  • VMark acquired UniData in 1997 and renamed itself to Ardent Software. In 1999 Ardent Software was acquired by Informix the database software vendor.
  • In 2001 IBM acquired Informix and took just the database business leaving the data integration tools to be spun off as an independent software company called Ascential Software

For various edition names, as people refer to various DataStage options using these names:

 

Edition Details
Enterprise Edition  a name give to the version of DataStage that had a parallel processing architecture and parallel ETL jobs.
Server Edition  the name of the original version of DataStage representing Server Jobs. Early DataStage versions only contained Server Jobs. DataStage 5 added Sequence Jobs and DataStage 6 added Parallel Jobs via Enterprise Edition.
MVS Edition  mainframe jobs, developed on a Windows or Unix/Linux platform and transferred to the mainframe as compiled mainframe jobs.
DataStage for PeopleSoft  a server edition with prebuilt PeopleSoft EPM jobs under an OEM arrangement with PeopeSoft and Oracle Corporation.
DataStage TX  for processing complex transactions and messages, formerly known as Mercator.
DataStage SOA  Real Time Integration pack can turn server or parallel jobs into SOA services.

What are Stage Variables, Derivations and Constants?

Stage Variable – An intermediate processing variable that retains value during read and doesnt pass the value into target column.
Derivation – Expression that specifies value to be passed on to the target column.
Constant – Conditions that are either true or false that specifies flow of data with a link

What is IIS Suite

IIS =  IBM InfoSphere Information Server

IIS from IBM could mean:

  • InfoSphere Information Server Workgroup Edition (IIS WE) = Data Integration or ETL(DataStage, QualityStage, Information Analyzer, and Metadata Workbench)
  • InfoSphere Information Server for Data Warehousing (IIS DW) = Data Warehousefor Smart Analytics System

IIS consists of:

  • Data Stage: IBM’s main ETL & data integration tool
  • Quality Stage: IBM’s main data quality tool (needs to be licensed seperately)
  • Fast Track: write the mapping spec to generate Data Stage & Quality Stage jobs & reports
  • Business Glossary: to edit business meaning/data dictionary
  • Blueprint Director: link everything: metadata, ETL, data model
  • Information Analyzer: to understand the content, structure and quality of the data
  • Information Services Director: to deploy data stage/quality jobs as web services or EJB
  • Metadata Workbench: create data lineage between database, files and BI models
  • Metadata Server : stores operational metadata like how many rows were written

Links to IBM Documentation:

 

OLAP cubes

OLAP – What is OLAP?

OLAP stands for On-Line Analytical Processing. OLAP is a technology that is used to organize large business databases and support business intelligence. OLAP databases are divided into one or more cubes, and each cube is organized and designed by a cube administrator to fit the way that you retrieve and analyze data so that it is easier to create.

OLAP

OLAP

 

OLAP database & cubes

Online Analytical Processing (OLAP) databases facilitate business-intelligence queries. OLAP is a database technology that has been optimized for querying and reporting, instead of processing transactions. The source data for OLAP is Online Transactional Processing (OLTP) databases that are commonly stored in data warehouses. OLAP data is derived from this historical data, and aggregated into structures that permit sophisticated analysis. OLAP data is also organized hierarchically and stored in cubes instead of tables. It is a sophisticated technology that uses multidimensional structures to provide rapid access to data for analysis.

OLAP cubes

OLAP cubes

 

 

OLAP – volumes, complexity

Online Analytical Process (OLAP) is a characterized by relatively low volume of transactions. Actually the queries are often very complex. In the OLAP System response time more. In OLAP Database there is Aggregated, historical Inf. Data , stored in multi-dimensional schemas.

The first attempt to provide a definition to OLAP was by Dr. Codd, who proposed 12 rules for OLAP.

Not to be confused with OLTP (Online Transaction processing)


For people on the business side, the key feature out of the above list is “Multidimensional.” In other words, the ability to analyze metrics in different dimensions such as time, geography, gender, product, etc. For example, sales for the company are up. What region is most responsible for this increase? Which store in this region is most responsible for the increase? What particular product category or categories contributed the most to the increase? Answering these types of questions in order means that you are performing an OLAP analysis.

Depending on the underlying technology used, OLAP can be broadly divided into two different camps: MOLAP and ROLAP.


Sort Stage Basics

Like in many data related processes, sort plays an important part in etl transformations using Datastage. In fact sorted data is a pre-requisite for a number of stages like remove-duplicate, join, etc.. There are two ways in which you can sort your data in Datastage. One way is to do inline sorts in any stage that you want. For example if you want your data sorted before using it in your transformer you simply have to go to the input -> partitioning tab and check the ‘PERFORM SORT’ option. Even though sort is achievable in this way in some cases people specifically go for the sort stage due to the flexibility, power and variety of options it offers.
You can find the sort stage in the processing tab of the palettes section. In order to perform the sort you must provide the keys on which you want the sort performed. If you want to perform the sort sequentially in a parallel job then this must be specified by setting the sort stage to run in sequential mode rather than the parallel mode. This option can be set in the advanced tab of the sort stage as shown below. The choice of running the sort in sequential or parallel mode is entirely up to you based on your business requirement.
The choice on where to carry out the sort is entirely dependant on how you have configured your APT_CONFIG_FILE. If you have set a separate pool to carry out the sort operation then sort will only use the scratch disk specified in the sort pool. If such a sort pool is not specified then the scratch disks in the default pool is used for sorting. As a last option it will use the temporary disk spaces specified by the TMPDIR env variable and if thats not there then the /tmp folder is used.
Since most of you will be knowing how to configure the sort stage for  a basic use I’ve decided to explain certain options in the Sort stage that might be useful when you design your job.
Sort Key mode
This option will be present under each sort key. This option is important mainly because it allows us to use the sort stage in an efficient manner such that we have an optimally running sort operation. By default the value is set to ‘Sort’. The other two options are ‘Don’t Sort (Previously Sorted)’ and ‘Don’t Sort (Previously Grouped)’. If out of the set of columns we are sorting on if we already know that the data is sorted on a particular column then we can set the sort key mode option to Don’t Sort (Previously Sorted) which will indicate the input records have already been sorted prior to this stage using that particular column. This ensures that we won’t be doing any unnecessary sorts, which thereby increases your job performance.
Supposing you have data as below
A                             B                             C
1                              127                         AAA
1                              124                         BBB
1                              156                         CCC
2                              09                           DDD
2                              540                         EEE
Now if we specify the sorting keys as A and B  and  if you are only interested in seeing data in a group(specified by column A in this case) suitably sorted then it makes sense to set the sort key mode for key ‘A’  to Set to Don’t Sort (Previously Grouped) to indicate that input records are already grouped by this particular column, but not sorted. This will make the sort stage only sort on column B.
Sort Utility
There are two different types of sorts which you can use in Datastage. They are mentioned below
–          Datastage – This is the default sort used by Datastage and uses the tsort operator. It does not require any other application to work
–          UNIX – As the name indicates this carries out the sort operation using the Unix sort command. For this the psort operator will be used by Datastage.
Cluster Key change column and create key change column
There will be requirements in you design in which you will want to know the first record in a group(specified by the sort keys) from the data in your dataset. This can be easily achieved by set the key change option to true. Even though both of the above options do the same thing they are different in the sense that they will have to be used based on the sort key mode specified. If you have used the sort key mode as Don’t sort for the keys during sorting then you will have to use the ‘Cluster Key Change column’ option. If you have used the default sort key mode then you need only use the ‘create key change column’ option. Setting these options will ensure that an extra column is added to your output records and this column will contain a value of 1 if the particular record in that group (where group is indicated by the change in sort key value) is the first record. All following records in that group will have a value of 0.
eg. If the data shown above was sorted on the key A and the key change option was set then the output data would look like this
A                             B                             C               ClusterKeyChange
1                              127                         AAA          1
1                              124                         BBB          0
1                              156                         CCC          0
2                              09                           DDD         1
2                              540                         EEE         0
Restrict memory usage
By default the value is 20MB. This restricts the sort stage to carry outs its operation to a certain amount of virtual memory on the node on which it is being processed. You can increase this to increase your sort rate. However you should remember that the size should never be more than the amount of actual physical memory that is actually available on the processing node.