Merge Stage Example

The Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.(according to DS documentation)

Merge stage combines a mster dataset with one or more update datasets based on the key columns.the output record contains all the columns from master record plus any additional columns from each update record that are required.
A master record and update record will be merged only if both have same key column values.
The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time.
As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well.
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link.
Example :
Master dataset:
CUSTOMER_ID CUSTOMER_NAME
1 UMA
2 POOJITHA
Update dataset1
CUSTOMER_ID CITY ZIP_CODE SEX
1 CYPRESS 90630 M
2 CYPRESS 90630 F
Output:
CUSTOMER_ID CUSTOMER_NAME CITY ZIP_CODE SEX
1 UMA CYPRESS 90630 M
2 POOJITHA CYPRESS 90630 F
Merge stage configuration steps:
Options:
Unmatched Masters Mode:Keep means that unmatched rows (those without any updates) from the master link are output; Drop means that unmatched rows are dropped instead.
Warn On Reject Updates:True to generate a warning when bad records from any update links are rejected.
Warn On Unmatched Masters:True to generate a warning when there are unmatched rows from the master link.
Partitioning:Hash on both master input and update input as shown below:
Compile and run the job :
Scenario 2:
Remove a record from the updateds1 and check the output:
Check for the datastage warning in the job log as we have selected Warn on unmatched masters = TRUE
stg_merge,0: Master record (0) has no updates.
stg_merge,1: Update record (1) of data set 1 is dropped; no masters are left.
Scenarios 3:Drop unmatched master record and capture reject records from updateds1
Scenario 4:Insert a duplicate record with same customer id in the master dataset and check for the results.
Look at the output and it is clear that merge stage automatically dropped the duplicate record from master dataset.
Scenario 4:Added new updatedataset2 which contains following data.
Update Dataset2
CUSTOMER_ID CITIZENSHIP
1 INDIAN
2 AMERICAN
Still we have duplicate row in the master dataset.if you compile the job with above design you will get compilation error like below.
If you look ate the above figure you can see 2 rows in the output becuase we have a matching row for the customer_id = 2 in the updateds2 .
Scenario 5:add a duplicate row for customer_id=1 in updateds1 dataset.
Now we have duplicate record both in master dataset and updateds1.Run the job and check the results and warnings in the job log.
No change the results and merge stage automatically dropped the duplicate row.
Scenario 6:modify a duplicate row for customer_id=1 in updateds1 dataset with zipcode as 90630 instead of 90620.
Run the job and check output results.
I ran the same job multiple times and found the merge stage is taking first record coming as input from the updateds1 and dropping the next records with same customer id.
This post covered most of the merge scenarios.

Release Job locks in Datastage

There are Three methods to unlock the DataStage jobs:
 Using DataStage Administrator Tool.
  Using UV Utility
 – Using DataStage Director


Unlock jobs -Using DataStage Administrator Tool:

 Log into Administrator (as dsadm)
  Open the Command line for the project
  Execute “LIST.READU EVERY”
  Identify the values for INODE and USER columns for the job for which the locks need to be released.
  Execute UNLOCK INODE <inodenumber> ALL
 • UNLOCK USER <user number> ALL

Unlock jobs -Using UV Utility:

 • Logon to Unix
  Su to dsadm
  Go to a corresponding project
  Type “UV”
  Type “DS.TOOLS”
  Select the option “4” (4. Administer processes/locks >>)
  Again Select the option “4” (4. List all locks)
  Identify the values for PID columns for the job for which the locks need to be released.
  Then select the Option “7” (7. Clear locks held by a process)
 • Give the PID value here(Enter pid#=) and press Enter.

Unlock jobs -Using DataStage Director:


– Open DataStage Director
 Go to the Job
 Cleanup Resources
 In the Job Resources window,
 Select “Show All” (Processes)
 Find your User Name and click on “Logout”

MDM Solutions

Top MDM(Master Data Management“) solutions (not in any particular order)

1. Oracle Siebel UCM/Oracle CDH
2. IBM MDM through InfoSphere. 
3. SAP MDM
4. Informatica Siperian (Informatica acquired Siperian in January 2010). 
5. SAS DataFlux
6. Initiate Systems
7. D&B Purisma
8. SAP MDM
9. Tibco 
10. VisionWare
11. Sun
12. Talend MDM (Open Source MDM)

MDM generates a “golden record” of an organization’s essential entity such as Customer, Product, Account, etc.

single version of truth


Datastage vs Informatica

Note: The following is based on differences between Informatica 8.6.1 and IBM DataStage 8.0.1. Informatica’s standard industry term is Mapping for a source-target pipeline and IBM DataStage’s is a job. 

Pipeline Partitioning – 

IBM’s thrust is DataStage’s pipeline partitioning. Data can be segmented on multiple partitions, processed and then re-collected with IBM DataStage. IBM DataStage lets control a job design based on the logic of the processing instead of defaulting the whole pipeline flow to one partition type. IBM DataStage offers 7 different types of multi-processing partitions.

Informatica offers partitioning as dynamic partitioning which defaults a workflow not at every Stage/Object level in a mapping/job. Informatica offers other partitioning choices as well at the workflow level. 

Designer/Monitoring GUIs – 

Informatica offers access to the development and monitoring effort through its 4 GUIs – offered as Informatica PowerDesigner, Repository Manager, Worflow Designer, Workflow Manager. 

IBM DataStage caters to development and monitoring its jobs through 3 GUIs – IBM DataStage Designer(for development), Job Sequence Designer(workflow design) and Director(for monitoring). 

Version Control –

Informatica offers instant version control through its repository server managed with “Repository Manager” GUI console. A mapping with work-in-progress cannot be opened until saved and checked back into the repository. 

Version Control was offered as a component until version Ascential DataStage7.5.x. Ascential was acquired by IBM and when DataStage was integrated into IBM Information Server with DataStage at version 8.0.1, the support of version control as a component was discontinued. 

Repository based flow – 

Informatica, offers a step-by-step effort of creating a data integration solution. Each object created while mapping a source with a target gets saved into the repository project folder categorized by – Sources, Targets, Transformations, Mappings, Mapplets, User-defined functions, Business Components, Cubes and Dimensions. Each object created can be shared, dropped into a mapping across cross-functional development teams. Thus increasing re-usability. Projects are folder based and inter-viewable. 

IBM DataStage offers a project based integration solution, projects are not interviewable. Every project needs a role based access. The step-by-step effort in mapping a source to a target lineages into a job. For sharing objects within a job, separate objects need to be created called containers that are local/shared. 

Creating a source-target pipeline – 

Within Informatica’s PowerCenter Designer, first a source definition needs to be created using “Source Analyzer” that imports the metadata, then a target definition is created using “Target Designer”, then a transformation using “Transformation Developer” is created, and finally maps a source-transformation-target using “Mapping Designer”. 

IBM lets drag and drop a functionality i.e a stage within in one canvas area for a pipeline source-target job. With IBM DataStage within the “DataStage Designer” import of both source and target metadata is needed, proceeding with variety of stages offered as database stages, transformation stages, etc. 

The biggest difference between both the vendor offerings in this area is Informatica forces you to be organized through a step-by-step design process, while IBM DataStage leaves the organization as a choice and gives you flexibility in dragging and dropping objects based on the logic flow.  

Code Generation and Compilation – 

Informatica’s thrust is the auto-generated code. A mapping gets created by dropping a source-transformation-target that doesn’t need to be compiled. 

IBM DataStage requires to compile a job in order to run it successfully. Changing business requirements effect the maintenance of change control management with IBM DataStage jobs. Re-compilation is required for every occurring change. 

Reusability – 

Informatica offers ease of re-usability through Mapplets and Worklets for re-using mappings and workflows. 

IBM Stage offers re-usability of a job through containers(local&shared). To re-use a Job Sequence(workflow), you will need to make a copy, compile and run. 

Change Data Capture (CDC) – 

Informatica offers CDC through a separate edition – Real-time Edition. CDC is a drag and drop object within IBM DataStage Designer. 

Data Encryption/Masking – 
Data Masking or encryption needs to be done before reaching IBM DataStage Server. Informatica has an offering within PowerCenter Designer as a separate transformation called “Data Masking Transformation”. 

Variety of Transformations –

Informatica offers about 30 general transformations for processing incoming data. 

IBM offers about 40 data transforming stages/objects. 

Impact Analysis – 

Informatica offers a separate edition – Advanced edition that helps with data lineage and impact analysis. 

IBM DataStage offers through Designer by right clicking on a job to perform dependencies or impact analysis. 

Real-Time Integration – 

IBM DataStage within the Designer offers creating in-the-box real-time solutions for WISD, XML, Web Services, WebSphere MQ, Java based services. 

Informatica offers SOA/Real-time integration through Real-Time edition. 

Monitoring – 

Informatica Workflow monitor offers different levels of run-statistics information. Tracing levels are offered at 4 different levels – Normal, Terse, Verbose Initialization and Verbose data. These tracing levels offer the degree of information based on source/target rows, caching, transformation statistics for each mapping. 

IBM DataStage offers Operational Statistics from DataStage Director. The start, elapsed, end times can be viewed within the Director GUI. Row statistics can be obtained at every processing stage/object through the monitor option within the Director.

Data Lineage in Metadata workbench

Metadata Workbench illustrates relationships between processes, business concepts, people, databases, columns, data files, and much, much more. Combined with Business Glossary, it gives you reporting capabilities for the casual business user as well as the (often) more technical dedicated metadata researcher.
Log into the Metadata Workbench.At the left, there is the list of DataStage Servers and their Projects. Open up the project, it’s folders, and click on your Job. A detailed page appears. Click on the Job to get an expanded view in a new window of what the Job looks like. The metadata you are viewing is up-to-date from the last moment you or a developer saved the Job in the DS Designer. Also there is a very important listing of the Stage types in the Job, along with their icon. Note below you have many “expandable” sections for things like Job Operational metadata.
For any stage click “Data Lineage” at the upper right.
click “Create Report”. – this says where the data came from
At the bottom of the page,  click the button labeled “Display Final Assets”. 

FTP Stage for VB file from Mainframe to Datastage

FTP Stage: 
———— 
Record Level>>Record Type=implicit 
Field defaults>>Delimiter =| <pipe> 
Type defaults>>General>>Character set=EBCDIC, Data format=binary 




Delimiter=none for no delimter at end of each record


Specify all the input fields with length, in my case I had to use SQL Type= Varchar and their length. THis would work so long as it conforms to the LRECL setting.

ETL disadvantages

  • Data from different sources would be different, poorly documented and dirty. 
  • Standardization
    • Name and address standardization is not very easy
    • Specialized software for standardization – specific to geography
  • Perfecting the data is costly

Aggregator Stage and Filter Stage with example

If we have a data as below 

table_data

Sno name
10 Jeff
10 Mike
10 Sam
20 Tom
30 Amy
20 Ron
40 John



And we need to get the same multiple times records into the one target. 
And single records not repeated with respected to Sno need to come to one target.

Take Job design as 

 

Read and load the data in sequential file. 

In Aggregator stage select group =dno 

Aggregator type = count rows 

Count output column =dno_cpunt( user defined ) 

In output Drag and Drop the columns required.Than click ok 

In Filter Stage 

—– At first where clause dno_count>1 
—–Output link =0 
—–At second where clause dno_count<=1 —–output link=0 Drag and drop the outputs to the two targets. Give Target file names and Compile and Run the JOb. You will get the required data to the Targets. 

What can Datastage do?

  • Design jobs for Extraction, Transformation, and Loading (ETL)
  • Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations
  • Import, export, create, and managed metadata for use within jobs
  • Schedule, run, and monitor jobs all within DataStage
  • Administer your DataStage development and execution environments
 

Complex Flat File Stage

  • CFF stage now processes Multi Format Flat (MFF) file
  • Constraints can be specified on the output links to filter data and/or define when a record should be sent down the link