Merge Stage Example

The Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.(according to DS documentation)

Merge stage combines a mster dataset with one or more update datasets based on the key columns.the output record contains all the columns from master record plus any additional columns from each update record that are required.
A master record and update record will be merged only if both have same key column values.
The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time.
As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well.
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link.
Example :
Master dataset:
CUSTOMER_ID CUSTOMER_NAME
1 UMA
2 POOJITHA
Update dataset1
CUSTOMER_ID CITY ZIP_CODE SEX
1 CYPRESS 90630 M
2 CYPRESS 90630 F
Output:
CUSTOMER_ID CUSTOMER_NAME CITY ZIP_CODE SEX
1 UMA CYPRESS 90630 M
2 POOJITHA CYPRESS 90630 F
Merge stage configuration steps:
Options:
Unmatched Masters Mode:Keep means that unmatched rows (those without any updates) from the master link are output; Drop means that unmatched rows are dropped instead.
Warn On Reject Updates:True to generate a warning when bad records from any update links are rejected.
Warn On Unmatched Masters:True to generate a warning when there are unmatched rows from the master link.
Partitioning:Hash on both master input and update input as shown below:
Compile and run the job :
Scenario 2:
Remove a record from the updateds1 and check the output:
Check for the datastage warning in the job log as we have selected Warn on unmatched masters = TRUE
stg_merge,0: Master record (0) has no updates.
stg_merge,1: Update record (1) of data set 1 is dropped; no masters are left.
Scenarios 3:Drop unmatched master record and capture reject records from updateds1
Scenario 4:Insert a duplicate record with same customer id in the master dataset and check for the results.
Look at the output and it is clear that merge stage automatically dropped the duplicate record from master dataset.
Scenario 4:Added new updatedataset2 which contains following data.
Update Dataset2
CUSTOMER_ID CITIZENSHIP
1 INDIAN
2 AMERICAN
Still we have duplicate row in the master dataset.if you compile the job with above design you will get compilation error like below.
If you look ate the above figure you can see 2 rows in the output becuase we have a matching row for the customer_id = 2 in the updateds2 .
Scenario 5:add a duplicate row for customer_id=1 in updateds1 dataset.
Now we have duplicate record both in master dataset and updateds1.Run the job and check the results and warnings in the job log.
No change the results and merge stage automatically dropped the duplicate row.
Scenario 6:modify a duplicate row for customer_id=1 in updateds1 dataset with zipcode as 90630 instead of 90620.
Run the job and check output results.
I ran the same job multiple times and found the merge stage is taking first record coming as input from the updateds1 and dropping the next records with same customer id.
This post covered most of the merge scenarios.

Starting DataStage from batch[bat] file in Windows

code :

REM Regenerate and restart DataStage server engine
pushd C:IBMInformationServerServerDSEngine  #use the DSHOME path
binuvregen
popd
net start dsrpc
net start dsengine
net start dstelnet


put this bat file in DSHOME or u can add a line in script to cd to DSHOME.. 

Release Job locks in Datastage

There are Three methods to unlock the DataStage jobs:
 Using DataStage Administrator Tool.
  Using UV Utility
 – Using DataStage Director


Unlock jobs -Using DataStage Administrator Tool:

 Log into Administrator (as dsadm)
  Open the Command line for the project
  Execute “LIST.READU EVERY”
  Identify the values for INODE and USER columns for the job for which the locks need to be released.
  Execute UNLOCK INODE <inodenumber> ALL
 • UNLOCK USER <user number> ALL

Unlock jobs -Using UV Utility:

 • Logon to Unix
  Su to dsadm
  Go to a corresponding project
  Type “UV”
  Type “DS.TOOLS”
  Select the option “4” (4. Administer processes/locks >>)
  Again Select the option “4” (4. List all locks)
  Identify the values for PID columns for the job for which the locks need to be released.
  Then select the Option “7” (7. Clear locks held by a process)
 • Give the PID value here(Enter pid#=) and press Enter.

Unlock jobs -Using DataStage Director:


– Open DataStage Director
 Go to the Job
 Cleanup Resources
 In the Job Resources window,
 Select “Show All” (Processes)
 Find your User Name and click on “Logout”

MDM Solutions

Top MDM(Master Data Management“) solutions (not in any particular order)

1. Oracle Siebel UCM/Oracle CDH
2. IBM MDM through InfoSphere. 
3. SAP MDM
4. Informatica Siperian (Informatica acquired Siperian in January 2010). 
5. SAS DataFlux
6. Initiate Systems
7. D&B Purisma
8. SAP MDM
9. Tibco 
10. VisionWare
11. Sun
12. Talend MDM (Open Source MDM)

MDM generates a “golden record” of an organization’s essential entity such as Customer, Product, Account, etc.

single version of truth


Datastage vs Informatica

Note: The following is based on differences between Informatica 8.6.1 and IBM DataStage 8.0.1. Informatica’s standard industry term is Mapping for a source-target pipeline and IBM DataStage’s is a job. 

Pipeline Partitioning – 

IBM’s thrust is DataStage’s pipeline partitioning. Data can be segmented on multiple partitions, processed and then re-collected with IBM DataStage. IBM DataStage lets control a job design based on the logic of the processing instead of defaulting the whole pipeline flow to one partition type. IBM DataStage offers 7 different types of multi-processing partitions.

Informatica offers partitioning as dynamic partitioning which defaults a workflow not at every Stage/Object level in a mapping/job. Informatica offers other partitioning choices as well at the workflow level. 

Designer/Monitoring GUIs – 

Informatica offers access to the development and monitoring effort through its 4 GUIs – offered as Informatica PowerDesigner, Repository Manager, Worflow Designer, Workflow Manager. 

IBM DataStage caters to development and monitoring its jobs through 3 GUIs – IBM DataStage Designer(for development), Job Sequence Designer(workflow design) and Director(for monitoring). 

Version Control –

Informatica offers instant version control through its repository server managed with “Repository Manager” GUI console. A mapping with work-in-progress cannot be opened until saved and checked back into the repository. 

Version Control was offered as a component until version Ascential DataStage7.5.x. Ascential was acquired by IBM and when DataStage was integrated into IBM Information Server with DataStage at version 8.0.1, the support of version control as a component was discontinued. 

Repository based flow – 

Informatica, offers a step-by-step effort of creating a data integration solution. Each object created while mapping a source with a target gets saved into the repository project folder categorized by – Sources, Targets, Transformations, Mappings, Mapplets, User-defined functions, Business Components, Cubes and Dimensions. Each object created can be shared, dropped into a mapping across cross-functional development teams. Thus increasing re-usability. Projects are folder based and inter-viewable. 

IBM DataStage offers a project based integration solution, projects are not interviewable. Every project needs a role based access. The step-by-step effort in mapping a source to a target lineages into a job. For sharing objects within a job, separate objects need to be created called containers that are local/shared. 

Creating a source-target pipeline – 

Within Informatica’s PowerCenter Designer, first a source definition needs to be created using “Source Analyzer” that imports the metadata, then a target definition is created using “Target Designer”, then a transformation using “Transformation Developer” is created, and finally maps a source-transformation-target using “Mapping Designer”. 

IBM lets drag and drop a functionality i.e a stage within in one canvas area for a pipeline source-target job. With IBM DataStage within the “DataStage Designer” import of both source and target metadata is needed, proceeding with variety of stages offered as database stages, transformation stages, etc. 

The biggest difference between both the vendor offerings in this area is Informatica forces you to be organized through a step-by-step design process, while IBM DataStage leaves the organization as a choice and gives you flexibility in dragging and dropping objects based on the logic flow.  

Code Generation and Compilation – 

Informatica’s thrust is the auto-generated code. A mapping gets created by dropping a source-transformation-target that doesn’t need to be compiled. 

IBM DataStage requires to compile a job in order to run it successfully. Changing business requirements effect the maintenance of change control management with IBM DataStage jobs. Re-compilation is required for every occurring change. 

Reusability – 

Informatica offers ease of re-usability through Mapplets and Worklets for re-using mappings and workflows. 

IBM Stage offers re-usability of a job through containers(local&shared). To re-use a Job Sequence(workflow), you will need to make a copy, compile and run. 

Change Data Capture (CDC) – 

Informatica offers CDC through a separate edition – Real-time Edition. CDC is a drag and drop object within IBM DataStage Designer. 

Data Encryption/Masking – 
Data Masking or encryption needs to be done before reaching IBM DataStage Server. Informatica has an offering within PowerCenter Designer as a separate transformation called “Data Masking Transformation”. 

Variety of Transformations –

Informatica offers about 30 general transformations for processing incoming data. 

IBM offers about 40 data transforming stages/objects. 

Impact Analysis – 

Informatica offers a separate edition – Advanced edition that helps with data lineage and impact analysis. 

IBM DataStage offers through Designer by right clicking on a job to perform dependencies or impact analysis. 

Real-Time Integration – 

IBM DataStage within the Designer offers creating in-the-box real-time solutions for WISD, XML, Web Services, WebSphere MQ, Java based services. 

Informatica offers SOA/Real-time integration through Real-Time edition. 

Monitoring – 

Informatica Workflow monitor offers different levels of run-statistics information. Tracing levels are offered at 4 different levels – Normal, Terse, Verbose Initialization and Verbose data. These tracing levels offer the degree of information based on source/target rows, caching, transformation statistics for each mapping. 

IBM DataStage offers Operational Statistics from DataStage Director. The start, elapsed, end times can be viewed within the Director GUI. Row statistics can be obtained at every processing stage/object through the monitor option within the Director.

Data Integration or ETL products

Top picks out in the market –
1. Informatica
2. IBM DataStage
3. Ab Initio
4. Oracle
5. SAP Business Objects Data Integrator
6. Microsoft SSIS

Top Challengers
1. SAS/DataFlux
2. iWay Software
3. Talend
4. Syncsort
5. Pervasive Software
6. Pitney Bowes Insight

Competitive price vendors –
1. Informatica Cloud edition
2. expressor-software (pricing based on channels i.e. multiple processing)

Open Source vendors –
1. Talend
2. Apatar
3. Pentaho Kettle
4. SnapLogic

Cloud/SaaS Based vendors –
1. Informatica
2. SnapLogic
Both started by entrepreneur Gaurav Dhillon.

Top Pipeline Partitioning vendors –
1. IBM DataStage (process based)
2. Informatica (multi thread based)
3. expressor-software (hybrid based)

Top Message oriented/Real-time processing vendors –
1. IBM DataStage
2. Informatica

Best salesforce.com Integration vendors – 
1. Informatica (both Cloud edition based and adaptor based support)
2. IBM DataStage (Adaptor/Pack based support)


Top ELT architecture based vendors –
1. Talend (excellent ELT based objects to drag and drop in the designer)
2. IBM DataStage (provides options to create tables before loading)

Data Lineage in Metadata workbench

Metadata Workbench illustrates relationships between processes, business concepts, people, databases, columns, data files, and much, much more. Combined with Business Glossary, it gives you reporting capabilities for the casual business user as well as the (often) more technical dedicated metadata researcher.
Log into the Metadata Workbench.At the left, there is the list of DataStage Servers and their Projects. Open up the project, it’s folders, and click on your Job. A detailed page appears. Click on the Job to get an expanded view in a new window of what the Job looks like. The metadata you are viewing is up-to-date from the last moment you or a developer saved the Job in the DS Designer. Also there is a very important listing of the Stage types in the Job, along with their icon. Note below you have many “expandable” sections for things like Job Operational metadata.
For any stage click “Data Lineage” at the upper right.
click “Create Report”. – this says where the data came from
At the bottom of the page,  click the button labeled “Display Final Assets”. 

What is IIS Suite

IIS =  IBM InfoSphere Information Server

IIS from IBM could mean:

  • InfoSphere Information Server Workgroup Edition (IIS WE) = Data Integration or ETL(DataStage, QualityStage, Information Analyzer, and Metadata Workbench)
  • InfoSphere Information Server for Data Warehousing (IIS DW) = Data Warehousefor Smart Analytics System

IIS consists of:

  • Data Stage: IBM’s main ETL & data integration tool
  • Quality Stage: IBM’s main data quality tool (needs to be licensed seperately)
  • Fast Track: write the mapping spec to generate Data Stage & Quality Stage jobs & reports
  • Business Glossary: to edit business meaning/data dictionary
  • Blueprint Director: link everything: metadata, ETL, data model
  • Information Analyzer: to understand the content, structure and quality of the data
  • Information Services Director: to deploy data stage/quality jobs as web services or EJB
  • Metadata Workbench: create data lineage between database, files and BI models
  • Metadata Server : stores operational metadata like how many rows were written

Links to IBM Documentation:

 

OLAP cubes

OLAP – What is OLAP?

OLAP stands for On-Line Analytical Processing. OLAP is a technology that is used to organize large business databases and support business intelligence. OLAP databases are divided into one or more cubes, and each cube is organized and designed by a cube administrator to fit the way that you retrieve and analyze data so that it is easier to create.

OLAP

OLAP

 

OLAP database & cubes

Online Analytical Processing (OLAP) databases facilitate business-intelligence queries. OLAP is a database technology that has been optimized for querying and reporting, instead of processing transactions. The source data for OLAP is Online Transactional Processing (OLTP) databases that are commonly stored in data warehouses. OLAP data is derived from this historical data, and aggregated into structures that permit sophisticated analysis. OLAP data is also organized hierarchically and stored in cubes instead of tables. It is a sophisticated technology that uses multidimensional structures to provide rapid access to data for analysis.

OLAP cubes

OLAP cubes

 

 

OLAP – volumes, complexity

Online Analytical Process (OLAP) is a characterized by relatively low volume of transactions. Actually the queries are often very complex. In the OLAP System response time more. In OLAP Database there is Aggregated, historical Inf. Data , stored in multi-dimensional schemas.

The first attempt to provide a definition to OLAP was by Dr. Codd, who proposed 12 rules for OLAP.

Not to be confused with OLTP (Online Transaction processing)


For people on the business side, the key feature out of the above list is “Multidimensional.” In other words, the ability to analyze metrics in different dimensions such as time, geography, gender, product, etc. For example, sales for the company are up. What region is most responsible for this increase? Which store in this region is most responsible for the increase? What particular product category or categories contributed the most to the increase? Answering these types of questions in order means that you are performing an OLAP analysis.

Depending on the underlying technology used, OLAP can be broadly divided into two different camps: MOLAP and ROLAP.


Create Business lineage reports in Metadata workbench


What is business lineage report?

It displays the flow of information between assets

How?
In the task list on the asset information page for an asset, click Business Lineage.

OR

In a results list from a find or query action, or in the Manage Business Lineage window, right-click the name of an asset and choose Business Lineage.

What is included?
Default – application, BI report and data file can be included in business lineage reports

To change default behavior, Metadata Workbench Administrator must configure any assets to be excluded from the business lineage report




Troubleshoot
Check if you have Business Glossary User role