basics | Datastage

September 28, 2014

What is Merge stage?

The Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.(according to DS documentation)

Merge stage combines a master dataset with one or more update datasets based on the key columns. The output record contains all the columns from master record plus any additional columns from each update record that are required.

A master record and update record will be merged only if both have same key column values.

The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time.

As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well.

Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link.

September 27, 2014

When to use join stage vs lookup stage?

The major consideration is the volume of the data. One of the most important mistakes that developers make is to not have volumetric analyses done before deciding to use Join or Lookup stages.

There are two data sets being combined. One is the primary or driving data set, sometimes called the left of the join. The other dataset are the reference data set or the right of the join.

Join	Lookup
If the reference datasets take up a large amount of memory relative to the physical RAM memory size of the computer DataStage is running on, then a Lookup stage might crash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. In this case, Join is most preferred.	So, if the reference datasets are not big enough to cause trouble, use a Lookup. Large reference datasets can cause slow performance since each lookup operation can, and typically will, cause a page fault and an I/O operation.

April 29, 2014

Why Data warehouse?

Why do we need a data warehouse?

Heterogeneities are everywhere

Enterprises tend to rely on different databases, files, external sources and even the internet and social media for all the information. This means:

Different interfaces
Different data representations
Duplicate and inconsistent information

Data Management in Large Enterprises

Vertical fragmentation of informational systems (vertical stove pipes)
Result of application (user)-driven development of operational systems

The goal of building a warehouse is to have a system that:

Collects and combines information
Provides integrated view, uniform user interface
Supports sharing

September 24, 2013

Slowly Changing Dimension/SCD in datastage

To understand what is Slowly Changing Dimension, we first understand these:

Most frequently, we insert into Fact tables . Very infrequently we update the facts that were loaded incorrectly. It is even less likely to delete rows from the fact table; the only time this might happen is if you wanted to archive the old data that is no longer relevant for the current portion of the data warehouse.

Dimension tables, on the other hand are likely to see frequent updates. A classic example of this is a consumer dimension. People can change names, phone numbers, addresses, etc regularly.

Dimensions that are changing over time are referred to as slowly changing dimensions (SCD).

Slowly Changing Dimension

Type 1: override the existing value

Customer_key	Customer_name	Customer_city	Customer_state
123	Ms. Brown	Houston	Texas
123	Ms. Brown	Dallas	Texas

Slowly Changing Dimension – SCD Type 1

Type 2: create a new record with effective dates

Customer_key	Customer_name	Customer_city	Customer_state	Effective_date	Obsolete_date
123	Ms.Brown	Houston	Texas	1/1/2000	1/1/2005
234	Ms. Brown	Dallas	Texas	1/1/2005	NULL

Slowly Changing Dimension – SCD Type 3

Type 3: create a new and old columns, no new rows

Customer_key	Customer_name	Customer_city	Customer_state	Change_date	New_city
123	Ms. Brown	Houston	Texas	1/1/2005	Dallas

Slowly Changing Dimension – SCD Type 3

September 11, 2013

What is DataWarehouse?

A data warehouse is a central integrated database containing data from all the operational sources and archive systems in an organization. It contains a copy of transaction data specifically structured for query analysis. This database can be accessed by all users, ensuring that each group in an organization is accessing valuable, stable data

A Data Warehouse Is A Structured Repository of Historic Data.

It Is Developed in an Evolutionary Process by Integrating Data from Non-integrated Legacy Systems.

It Is Usually:

ü Subject Oriented

ü Integrated

ü Time Variant

ü Non-volatile

Next: Under stand Why do we need a Data warehouse

January 31, 2013

Datastage History

At the beginning datastage was available under Ardent DataStage then it’s name was changed to Ascential DataStage.

In 2005 IBM acquired the Ascential company and added datastage product to it’s WebSphere family. The October 2006 release of DataStage has integrated it into the new IBM Information Server platform, but still called Websphere Datastage.

IBM WebSphere DataStage is one of the products from IBM’s WebSphere Information Integration suite and Information Server. The tool uses mostly graphical notaction to present data integration solutions. Datastage is delivered in various versions as Server and Enterprise Edition. In 2008 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage.

For more detailed early history, see below:

DataStage was conceived at VMark, a spin off from Prime Computers that developed two notable products: UniVerse database and the DataStage ETL tool. The first VMark ETL prototype was built by Lee Scheffler in 1996
Lee Scheffler presented the DataStage product overview to the board of VMark in 1996 and it was approved for development. The product was in alpha testing in October, beta testing in November and was generally available in 1997.
VMark acquired UniData in 1997 and renamed itself to Ardent Software. In 1999 Ardent Software was acquired by Informix the database software vendor.
In 2001 IBM acquired Informix and took just the database business leaving the data integration tools to be spun off as an independent software company called Ascential Software

For various edition names, as people refer to various DataStage options using these names:

Edition	Details
Enterprise Edition	a name give to the version of DataStage that had a parallel processing architecture and parallel ETL jobs.
Server Edition	the name of the original version of DataStage representing Server Jobs. Early DataStage versions only contained Server Jobs. DataStage 5 added Sequence Jobs and DataStage 6 added Parallel Jobs via Enterprise Edition.
MVS Edition	mainframe jobs, developed on a Windows or Unix/Linux platform and transferred to the mainframe as compiled mainframe jobs.
DataStage for PeopleSoft	a server edition with prebuilt PeopleSoft EPM jobs under an OEM arrangement with PeopeSoft and Oracle Corporation.
DataStage TX	for processing complex transactions and messages, formerly known as Mercator.
DataStage SOA	Real Time Integration pack can turn server or parallel jobs into SOA services.

January 5, 2013

What are Stage Variables, Derivations and Constants?

Stage Variable – An intermediate processing variable that retains value during read and doesnt pass the value into target column.
Derivation – Expression that specifies value to be passed on to the target column.
Constant – Conditions that are either true or false that specifies flow of data with a link

December 17, 2012

What is IIS Suite

IIS = IBM InfoSphere Information Server

IIS from IBM could mean:

InfoSphere Information Server Workgroup Edition (IIS WE) = Data Integration or ETL(DataStage, QualityStage, Information Analyzer, and Metadata Workbench)

InfoSphere Information Server for Data Warehousing (IIS DW) = Data Warehousefor Smart Analytics System

IIS consists of:

Data Stage: IBM’s main ETL & data integration tool

Quality Stage: IBM’s main data quality tool (needs to be licensed seperately)

Fast Track: write the mapping spec to generate Data Stage & Quality Stage jobs & reports

Business Glossary: to edit business meaning/data dictionary

Blueprint Director: link everything: metadata, ETL, data model

Information Analyzer: to understand the content, structure and quality of the data

Information Services Director: to deploy data stage/quality jobs as web services or EJB

Metadata Workbench: create data lineage between database, files and BI models

Metadata Server : stores operational metadata like how many rows were written

Links to IBM Documentation:

IIS Info Centre, in particular the Learning section where they provide an intro, and the Roadmap section where they explain aboutintegration with Cognos and Cloud, as well as reporting and metadata.

IIS Product Documentation, where we can find not only release 8.5 documentation, but also all previous releases. The documentationis very exhaustive, covering DataStage, QualityStage, Blueprint Director, Business Glossary, Metadata Workbench, and Information Analyzer.

December 16, 2012

OLAP – What is OLAP?

OLAP stands for On-Line Analytical Processing. OLAP is a technology that is used to organize large business databases and support business intelligence. OLAP databases are divided into one or more cubes, and each cube is organized and designed by a cube administrator to fit the way that you retrieve and analyze data so that it is easier to create.

OLAP

OLAP database & cubes

Online Analytical Processing (OLAP) databases facilitate business-intelligence queries. OLAP is a database technology that has been optimized for querying and reporting, instead of processing transactions. The source data for OLAP is Online Transactional Processing (OLTP) databases that are commonly stored in data warehouses. OLAP data is derived from this historical data, and aggregated into structures that permit sophisticated analysis. OLAP data is also organized hierarchically and stored in cubes instead of tables. It is a sophisticated technology that uses multidimensional structures to provide rapid access to data for analysis.

OLAP cubes

OLAP – volumes, complexity

Online Analytical Process (OLAP) is a characterized by relatively low volume of transactions. Actually the queries are often very complex. In the OLAP System response time more. In OLAP Database there is Aggregated, historical Inf. Data , stored in multi-dimensional schemas.

The first attempt to provide a definition to OLAP was by Dr. Codd, who proposed 12 rules for OLAP.

Not to be confused with OLTP (Online Transaction processing)

For people on the business side, the key feature out of the above list is “Multidimensional.” In other words, the ability to analyze metrics in different dimensions such as time, geography, gender, product, etc. For example, sales for the company are up. What region is most responsible for this increase? Which store in this region is most responsible for the increase? What particular product category or categories contributed the most to the increase? Answering these types of questions in order means that you are performing an OLAP analysis.

Depending on the underlying technology used, OLAP can be broadly divided into two different camps: MOLAP and ROLAP.

June 21, 2012

Sort Stage Basics

Like in many data related processes, sort plays an important part in etl transformations using Datastage. In fact sorted data is a pre-requisite for a number of stages like remove-duplicate, join, etc.. There are two ways in which you can sort your data in Datastage. One way is to do inline sorts in any stage that you want. For example if you want your data sorted before using it in your transformer you simply have to go to the input -> partitioning tab and check the ‘PERFORM SORT’ option. Even though sort is achievable in this way in some cases people specifically go for the sort stage due to the flexibility, power and variety of options it offers.

You can find the sort stage in the processing tab of the palettes section. In order to perform the sort you must provide the keys on which you want the sort performed. If you want to perform the sort sequentially in a parallel job then this must be specified by setting the sort stage to run in sequential mode rather than the parallel mode. This option can be set in the advanced tab of the sort stage as shown below. The choice of running the sort in sequential or parallel mode is entirely up to you based on your business requirement.

The choice on where to carry out the sort is entirely dependant on how you have configured your APT_CONFIG_FILE. If you have set a separate pool to carry out the sort operation then sort will only use the scratch disk specified in the sort pool. If such a sort pool is not specified then the scratch disks in the default pool is used for sorting. As a last option it will use the temporary disk spaces specified by the TMPDIR env variable and if thats not there then the /tmp folder is used.

Since most of you will be knowing how to configure the sort stage for a basic use I’ve decided to explain certain options in the Sort stage that might be useful when you design your job.

Sort Key mode

This option will be present under each sort key. This option is important mainly because it allows us to use the sort stage in an efficient manner such that we have an optimally running sort operation. By default the value is set to ‘Sort’. The other two options are ‘Don’t Sort (Previously Sorted)’ and ‘Don’t Sort (Previously Grouped)’. If out of the set of columns we are sorting on if we already know that the data is sorted on a particular column then we can set the sort key mode option to Don’t Sort (Previously Sorted) which will indicate the input records have already been sorted prior to this stage using that particular column. This ensures that we won’t be doing any unnecessary sorts, which thereby increases your job performance.

Supposing you have data as below

A B C

1 127 AAA

1 124 BBB

1 156 CCC

2 09 DDD

2 540 EEE

Now if we specify the sorting keys as A and B and if you are only interested in seeing data in a group(specified by column A in this case) suitably sorted then it makes sense to set the sort key mode for key ‘A’ to Set to Don’t Sort (Previously Grouped) to indicate that input records are already grouped by this particular column, but not sorted. This will make the sort stage only sort on column B.

Sort Utility

There are two different types of sorts which you can use in Datastage. They are mentioned below

– Datastage – This is the default sort used by Datastage and uses the tsort operator. It does not require any other application to work

– UNIX – As the name indicates this carries out the sort operation using the Unix sort command. For this the psort operator will be used by Datastage.

Cluster Key change column and create key change column

There will be requirements in you design in which you will want to know the first record in a group(specified by the sort keys) from the data in your dataset. This can be easily achieved by set the key change option to true. Even though both of the above options do the same thing they are different in the sense that they will have to be used based on the sort key mode specified. If you have used the sort key mode as Don’t sort for the keys during sorting then you will have to use the ‘Cluster Key Change column’ option. If you have used the default sort key mode then you need only use the ‘create key change column’ option. Setting these options will ensure that an extra column is added to your output records and this column will contain a value of 1 if the particular record in that group (where group is indicated by the change in sort key value) is the first record. All following records in that group will have a value of 0.

eg. If the data shown above was sorted on the key A and the key change option was set then the output data would look like this

A B C ClusterKeyChange

1 127 AAA 1

1 124 BBB 0

1 156 CCC 0

2 09 DDD 1

2 540 EEE 0

Restrict memory usage

By default the value is 20MB. This restricts the sort stage to carry outs its operation to a certain amount of virtual memory on the node on which it is being processed. You can increase this to increase your sort rate. However you should remember that the size should never be more than the amount of actual physical memory that is actually available on the processing node.