What is Merge stage?

The Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.(according to DS documentation)

Merge stage combines a master dataset with one or more update datasets based on the key columns. The output record contains all the columns from master record plus any additional columns from each update record that are required.

A master record and update record will be merged only if both have same key column values.

The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time.

As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well.

Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link.

Popular ETL Tools companies use

Click on the image to read better:

ETL Tools

ETL Tools


As you can see there are many types of ETL Tools and all you have to do right now is to choose appropriate one for you. Some of them are relatively quite expensive, some may be too complex, if you don’t want to transform a lot of information or use many sources or use sophisticated features.

Check out informatica vs datastage comparison

Top picks out in the market –
1. Informatica
2. IBM DataStage
3. Ab Initio
4. Oracle
5. SAP Business Objects Data Integrator
6. Microsoft SSISTop Challengers
1. SAS/DataFlux
2. iWay Software
3. Talend
4. Syncsort
5. Pervasive Software
6. Pitney Bowes Insight

Competitive price vendors –
1. Informatica Cloud edition
2. expressor-software (pricing based on channels i.e. multiple processing)

Open Source vendors –
1. Talend
2. Apatar
3. Pentaho Kettle
4. SnapLogic

Cloud/SaaS Based vendors –
1. Informatica
2. SnapLogic
Both started by entrepreneur Gaurav Dhillon.

Top Pipeline Partitioning vendors –
1. IBM DataStage (process based)
2. Informatica (multi thread based)
3. expressor-software (hybrid based)

Top Message oriented/Real-time processing vendors –
1. IBM DataStage
2. Informatica

Best salesforce.com Integration vendors –
1. Informatica (both Cloud edition based and adaptor based support)
2. IBM DataStage (Adaptor/Pack based support)

Top ELT architecture based vendors –
1. Talend (excellent ELT based objects to drag and drop in the designer)
2. IBM DataStage (provides options to create tables before loading)

When to use join stage vs lookup stage?

The major consideration is the volume of the data. One of the most important mistakes that developers make is to not have volumetric analyses done before deciding to use Join or  Lookup stages.

There are two data sets being combined. One is the primary or driving data set, sometimes called the left of the join. The other dataset are the reference data set or the right of the join.

Join Lookup
If the reference datasets take up a large amount of memory relative to the physical RAM memory size of the computer DataStage is running on, then a Lookup stage might crash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. In this case, Join is most preferred. So, if the reference datasets are not big enough to cause trouble, use a Lookup.  Large reference datasets can cause slow performance since each lookup operation can, and typically will, cause a page fault and an I/O operation.

Physical architecture Vs Logical architecture in Oracle Data Integrator

Physical Architecture is used to create the server name (database) which we are connecting to particular DATABASE.

Logical Architecture is a logical name of Physical Architecture, Only in the designing time we are using Physical Architectures but Logical Architecture used in run time ex:-  when creating models we are using Logical Architecture, and creating interfaces .

In the run time logical Architecture connects to context, In the designing time we have created logical schema name and pyscical schema name which are internally connected by using context, Now the context connects to the particular PhysicalArchitecture, the Physical Architecture connects to the designed Server (DATABASE)


Creating a Data Server

A Data Server corresponds for example to a Database, JMS server instance, a scripting engine or a file system accessed with Oracle Data Integrator in the integration flows. Under a data server, subdivisions are created in the form of Physical Schema.

Context is used to connect  Physical Architecture and Logical Architecture

Adding zeroes or spaces to strings and get fixed length

Using the transformer stage, how do we get fixed length strings in output stream? In the below example, we see how to add zeroes and make the string of a given length.

Source having following data as below,





as in my target/ output must have:







How do we achieve this?

Str(‘0’,6-Len(string)): string 

Find if SMP or MPP

Look in the Configuration file if the “fastname” has the same value for all given nodes then you are running on SMP.

If there are different “fastname’s” then its MPP.

For Server jobs it won’t matter if it is SMP or MPP because Server jobs execute only on the machine where Datastage installed.

Troubleshoot Datastage compilation errors


datastage compilation error

datastage compilation error

If you get an error during compile, check the following:

œCompilation problems

–If Transformer used, check C++ compiler, LD_LIRBARY_PATH

–If Buildop errors try buildop from command line

–Some stages may not support RCP – can cause column mismatch .

–Use the Show Error and More buttons

–Examine Generated OSH

–Check environment variables settings