Datastage History

At the beginning datastage was available under Ardent DataStage then it’s name was changed to Ascential DataStage.

ascential

In 2005 IBM acquired the Ascential company and added datastage product to it’s WebSphere family. The October 2006 release of DataStage has integrated it into the new IBM Information Server platform, but still called Websphere Datastage.

websphere

IBM WebSphere DataStage is one of the products from IBM’s WebSphere Information Integration suite and Information Server. The tool uses mostly graphical notaction to present data integration solutions. Datastage is delivered in various versions as Server and Enterprise Edition. In 2008 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage.

For more detailed early history, see below:

  • DataStage was conceived at VMark, a spin off from Prime Computers that developed two notable products: UniVerse database and the DataStage ETL tool. The first VMark ETL prototype was built by Lee Scheffler in 1996
  • Lee Scheffler presented the DataStage product overview to the board of VMark in 1996 and it was approved for development. The product was in alpha testing in October, beta testing in November and was generally available in 1997.
  • VMark acquired UniData in 1997 and renamed itself to Ardent Software. In 1999 Ardent Software was acquired by Informix the database software vendor.
  • In 2001 IBM acquired Informix and took just the database business leaving the data integration tools to be spun off as an independent software company called Ascential Software

For various edition names, as people refer to various DataStage options using these names:

 

Edition Details
Enterprise Edition  a name give to the version of DataStage that had a parallel processing architecture and parallel ETL jobs.
Server Edition  the name of the original version of DataStage representing Server Jobs. Early DataStage versions only contained Server Jobs. DataStage 5 added Sequence Jobs and DataStage 6 added Parallel Jobs via Enterprise Edition.
MVS Edition  mainframe jobs, developed on a Windows or Unix/Linux platform and transferred to the mainframe as compiled mainframe jobs.
DataStage for PeopleSoft  a server edition with prebuilt PeopleSoft EPM jobs under an OEM arrangement with PeopeSoft and Oracle Corporation.
DataStage TX  for processing complex transactions and messages, formerly known as Mercator.
DataStage SOA  Real Time Integration pack can turn server or parallel jobs into SOA services.

Datastage TX overview

IBM WebSphere DataStage TX delivers the ability to easily and seamlessly automate high-volume, complex transactions without the need for additional coding-resulting in a quick return on investment.

IBM WebSphere DataStage TX 6.7 delivers rapid ROI through a highly scalable, open architecture. 

IBM WebSphere DataStage TX allows you to: 
Quickly and seamlessly integrate many of the most popular applications, databases, and messaging systems Respond quickly and with flexibility to rapidly evolving, strategic business requirements Leverage the value of your enterprise systems and applications Meet your project-based or full integration solution needs 

IBM WebSphere DataStage Features & Benefits
IBM WebSphere DataStage provides these unique capabilities: The Most Powerful ETL Solution – Supports the collection, integration and transformation of high volumes of data, with data structures ranging from simple to highly complex. DataStage manages data arriving in real-time as well as data received daily, weekly or monthly. 

The Most Scalable Platform
Enables companies to solve large-scale business problems through high-performance processing of massive data volumes. By leveraging the parallel processing capabilities of multiprocessor hardware platforms, DataStage Enterprise Edition can scale to satisfy the demands of ever-growing data volumes and stringent real-time requirements and ever shrinking batch windows 

The Most Comprehensive Source and Target Support
Supports a virtually unlimited number of heterogeneous data sources and targets in a single job, including: text files; complex data structures in XML; ERP systems such as SAP and PeopleSoft; almost any database (including partitioned databases); web services; and SAS. 

Real-time Data Integration Support
Operates in real-time capturing messages from Message Oriented Middleware (MOM) queues using JMS or WebSphereMQ adapters to seamlessly combine data into conforming operational and historical analysis perspectives. IBM WebSphere RTI Services is a service-oriented architecture (SOA) enabling middleware that brokers enterprise-wide the benefits of the Ascential Enterprise Integration Suite across a continuum of time constraints, application suites, interface protocols and integration technologies. 

Advanced Maintenance and Development
Gives developers maximum speed, flexibility and effectiveness in building, deploying, updating and managing their data integration infrastructure. Full data integration reduces the development and maintenance cycle for data integration projects by simplifying administration and maximizing development resources. 

Complete Connectivity Between Any Data Source and Any Application
Ensures that the most relevant, complete and accurate data is integrated and used by the most popular enterprise application software brands, including SAP, Siebel, Oracle, PeopleSoft and J.D. Edwards.

How to perform a Lookup in Datastage?

The data in Datastage can be looked up from a hashed file or from a database (ODBC/ORACLE) source. Lookups are always managed by the transformer stage.

A Hashed File is a reference table based on key fields which provides fast access for lookups. They are very useful as a temporary or non-volatile program storage area. An advantage of using hashed files is that they can be filled up with remote data locally for better performance.
To increase performance, hashed files can be preloaded into memory for fast reads and support write-caching for fast writes.

There are also situations where loading a hashed file and using it for lookups is much more time consuming than accessing directly a database table. It usually happens where there is a need to access more complex data than a simple key-value mapping, for example what the data comes from multiple tables, must be grouped or processed in a database specific way. In that case it’s worth considering using ODBC or Oracle stage.
Please refer to the examples below to find out what is the use of lookups in Datastage
In the transformer depicted below there is a lookup into a country dictionary hash file. If a country is matched it is written to the right-hand side column, if not – a “not found” string is generated.

Design of a datastage transformer with lookup

In the job depicted below there is a sequential file lookup, linked together with a hash file which stores the temporary data.
Sequential file lookup

SCD type 2 implementation in Datastage

Slowly changing dimension Type 2 is a model where the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current) value is easy to extract and the history is clear.
The fields ‘effective date’ and ‘current indicator’ are very often used in that dimension and the fact table usually stores dimension key and version number.
SCD 2 implementation in Datastage
The job described and depicted below shows how to implement SCD Type 2 in Datastage. It is one of many possible designs which can implement this dimension.
For this example, we will use a table with customers data (it’s name is D_CUSTOMER_SCD2) which has the following structure and data:
D_CUSTOMER dimension table before loading

Datastage SCD2 job design

The most important facts and stages of the CUST_SCD2 job processing:
• The dimension table with customers is refreshed daily and one of the data sources is a text file. For the purpose of this example the CUST_ID=ETIMAA5 differs from the one stored in the database and it is the only record with changed data. It has the following structure and data:
SCD 2 – Customers file extract:

• There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text file.
• A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to separate columns.
SCD 2 lookup transformer

• A T002_Check_Discrepacies_exist transformer compares old and new values of records and passes through only records that differ.
SCD 2 check discrepancies transformer

• A T003 transformer handles the UPDATE and INSERT actions of a record. The old record is updated with current indictator flag set to no and the new record is inserted with current indictator flag set to yes, increased record version by 1 and the current date.
SCD 2 insert-update record transformer

• ODBC Update stage (O_DW_Customers_SCD2_Upd) – update action ‘Update existing rows only’ and the selected key columns are CUST_ID and REC_VERSION so they will appear in the constructed where part of an SQL statement.
• ODBC Insert stage (O_DW_Customers_SCD2_Ins) – insert action ‘insert rows without clearing’ and the key column is CUST_ID.
D_CUSTOMER dimension table after Datawarehouse refresh

Datastage FAQS

1) Why job sequence is use? What are batches? What is the difference between job sequence and batches?
Job Sequence is allows you to specify a sequence of server or parallel jobs to run. The sequence can also contain control information, for example, you can specify different courses of action to take depending on whether a job in the sequence succeeds or fails. Once you have defined a job sequence, it can be scheduled and run using the DataStage Director. It appears in the DataStage Repository and in the DataStage Director client as a job.
2) Why is hash file is faster than sequential file n odbc stage??
Hash file is indexed. Also it works under hashing algo. That’s why the search is faster in hash file.
What is complex stage?
In which situation we are using this one? CFF stage is used to read the files in ebcidic format. Mainly mainframe files with redefines the answer depends on the situation you are writing code for. Macros have the distinct advantage of being more efficient (and faster) than functions,
3) What are the main diff between server job and parallel job in DataStage
In server jobs we have few stages and its mainly logical intensive and we r-using transformer for most of the things and it does not uses MPP systems
In parallel jobs we have lots of stages and its stage intensive and for particular thing we have in built stages in parallel jobs and it uses MPP systems
4) Differentiate between pipeline and partion parallelism?
Consider three CPU connected in series. When data is being fed into the first one, it start processing, simultaneously is being transferred into the second CPU and so on. U can compare with 3 section of pipe. As water enters s the pipe it start moving into all the section of pipe.
Partition Pipeline- consider 3 CPU connected in parallel and being fed with data at same time thus reduces the load and efficiency. You can compare a single big pipe having 3 inbuilt pipes. As water is being fed to them it consumes large quantity in less time.
5) How to read the data from XL FILES? My problem is my data file having some commas in data, but we are using delimiter is|? How to read the data, explain with steps?
1. Create DSN for your XL file by picking Microsoft Excel Driver
2. Take ODBC as source stage
3. Configure ODBC with DSN details
4. While importing metadata for XL sheet, make sure you should select on system tables check box.
5. Note: In XL sheet the first line should be column names.
6) What’s the meaning of performance tuning technique, Example??
Meaning of performance tuning meaning we have to take some action to increase performance of slowly running job by
1) Use link partitioner and link collector to speedup performance
2) Use sorted data for aggregation
3) Use sorter at source side and aggregation at target side
4) Tuned the oci stage for ‘Array Size’ and ‘Rows per Transaction’ numerical values for faster inserts, updates and selects.
5) Do not use ipc stage at target side.
Is this only related with server jobs? Because in parallel extender these things are taken care by stages

7) How to distinguish the surrogate key in different dimensional tables?
The Surrogate key will be the key field in the dimensions
8)How to read the data from XL FILES? Explain with steps?
Reading data from Excel file is
* Save the file in .csv (comma separated files).
* Use a flat file stage in DataStage job panel.
* Double click on the flat file stage and assign input file to the .csv file (which you stored).
• Import metadata for the file. (Once you imported or typed metadata, click view data to check the data values)
Then do the rest transformation as needed
-Debases
Create a new DSN for the Excel driver and choose the workbook from which u want data
Select the ODBC stage and access the Excel through that i.e., import the excel sheet using the new DSN created for the Excel
9) How can we generate a surrogate key in server/parallel jobs?
In parallel jobs we can use surrogate key generator stage.
N server jobs we can use an inbuilt routine called KeyMgtGetNextValue.
You can also generate the surrogate key in the database using the sequence generator.

10) What is an environment variable??
Basically Environment variable is predefined variable those we can use while creating DS job. We can set either as Project level or Job level. Once we set specific variable that variable will be available into the project/job.
11) We can also define new environment variable. For that we can get to DS Admin.
I hope u understand. For further details refer the DS Admin guide.
There are the variables used at the project or job level. We can use them to configure the job i.e. can associate the configuration file (Without this u can not run ur job); increase the sequential or dataset read/ write buffer.
Ex: $APT_CONFIG_FILE
Like above we have so many environment variables. Please go to job properties and click on Paramer tab then click on “add environment variable” to see most of the environment variables.

12) How can we test the jobs?
Testing of jobs can be performed at many different levels: Unit testing, SIT and UAT phases.
Testing basically involves functionality and performance tests.
Firstly data for the job needs to be created to test the functionality. By changing the data we will see whether the requirements are met by the existing code. Every iteration of code change should be accompanied by a testing iteration.
Performance tests basically involve load tests and see how well the existing code performance in a finite period of time. Performance tuning can be performed on sql or the job design or the basic/osh code for faster processing times.
In addition all job designs should include an error correction and fail over support so that the code is robust.
13) What is the use of Hash file?? Instead of hash file why can we use sequential file itself?
Hash file is used to eliminate the duplicate rows based on hash key, and also used for lookups. Data stage not allowed to use sequential file as lookup. Actually the primary use of the hash file is to do a look up. You can use a sequential file for look up but you need to write your own routine to match the columns. Coding time and execution time will be more expensive. But when you generate a hash file the hash file indexes the key by an inbuilt hashing algorithm. So when a look up is made is much much faster. Also it eliminates the duplicate rows. These files are stored in the memory hence faster performance than from a sequential
14) What is a routine?
Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines:
1) Transform functions
2) Before-after job subroutines
3) Job Control routines
Routine is user-defined functions that can be reusable with in the project. .
15) How can we create environment variables in DataStage?
This mostly will come under Administrator part. As a Designer only we can add directly byDesigner-view-jobprops-parameters-addenvironment variable-under user defined-then add. .
16) How to eliminate duplicate rows in data stage? TO remove duplicate rows you can achieve by more than one way
1.In DS there is one stage called “Remove Duplicate” is exist where you can specify the key.
2.Other way you can specify the key while using the stage I mean stage itself remove the duplicate rows based on key while processing time.
17) What is pivot stage? Why are u using? What purpose that stage will be used?
Pivot stage is used to make the horizontal rows into vertical and vice versa
Pivot stage supports only horizontal pivoting – columns into rows
Pivot stage doesn’t supports vertical pivoting – rows into columns
Example: In the below source table there are two cols about quarterly sales of a product but biz req. as target should contain single col. to represent quarter sales, we can achieve this problem using pivot stage, i.e. horizontal pivoting.
Source Table
ProdID Q1_Sales Q2_Sales
1010 123450 234550
Target Table
ProdID Quarter_Sales Quarter
1010 123450 Q1
1010 234550 Q2
18) What are the various processes, which start when the DataStage engine starts?
What are the changes need to be done on the database side, if I have to use dB2 stage?
DataStage engine is responsible for compilation or execution or both?
There are three processes start when the DataStage engine starts:
1. DSRPC
2. DataStage Engine Resources
3. DataStage telnet Services
19) What is the difference between reference link and straight link?
The differerence between reference link and straight link is
The straight link is the one where data are passed to next stage directly and the reference link is the one where it shows that it has a reference (reference key) to the main table
For example in oracle EMP table has reference with DEPT table.
In DATASTAGE
2 table stage as source (one is straight link and other is reference link) to 1 transformer stage as process.
If 2 source as file stage (one is straight link and other is reference link to Hash file as reference) and 1 transformer stage.

20) What is Runtime Column Propagation and how to use it?
If your job has more columns which are not defined in metadata if runtime propagation is enabled it will propagate those extra columns to the rest of the job
Can both Source system (Oracle, SQLServer…etc) and Target Data warehouse (may be oracle, SQLServer.Etc) can be on windows environment or one of the systems should be in UNIX/Linux environment.
Your Source System can be (Oracle, SQL, DB2, Flat File… etc) but your Target system for complete Data Warehouse should be one (Oracle or SQL or DB2 or.)
21) What is the difference between OCI stage and ODBC stage?
Oracle OCI:
We can write the source query in this stage but we can’t write lookup query in this stage instead of this we are using hash file stage for the lookup.
We are having the facility to write multiple queries before (Oracle OCI/Output/SQL/Before) or after (Oracle OCI/Output/SQL/After) executing the actual query (Oracle OCI/Output/SQL/Before)
We don’t have multi-row lookup facility in this stage.

ODBC:
We can write both source query as well as lookup query in this stage itself
We are not having the facility to write multiple queries in this stage.
We are having the multi-row lookup facility in this stage.
22) How to find the process id? Explain with steps?
You can find it in UNIX by using ps -ef command it displays all the process currently running on the system along with the process ids
From the DS Director. Follow the path:
Job > Cleanup Resources.
There also you can see the PID.It also displays the entire current running processes.
Depending on your environment, you may have lots of process ids. From one of the DataStage docs: you can try this on any given node: $ ps -ef | grep dsuserwhere dsuser is the account for DataStage. If the above (ps command) doesn’t make sense, you’ll need some background theory about how processes work in Unix (or the mksenvironment when running in windows). Also from the DataStage docs (I haven’t tried this one yet, but it looks interesting):APT_PM_SHOW_PIDS – If this variable is set, players will output an informational message upon startup, displaying their process id.
23) How can I schedule the cleaning of the file &PH& by dsjob?
Create a job with dummy transformer and sequential file stage. In Before Job subroutine, use ExecTCL to execute the following command
CLEAR.FILE &PH&
24) If we using two sources having same Meta data and how to check the data in two sources is same or not? And if the data is not same I want to abort the job? How we can do this?
Use a change Capture Stage. Output it into a Transformer.
Write a routine to abort the job, which is initiated at the Function.
@INROWNUM = 1.
So if the data is not matching it is passed in the transformer and the job is aborted.
25) What is difference between ETL and ELT?
ETL usually scrubs the data then loads into the Data mart or Data Warehouse where as ELT Loads the data then use the RDMBS to scrub and reload into the Data mart or Data warehouse
ETL = Extract >>> Transform >>> Load
ELT = Extract >>> Load >>> Transform
ETL-> transformation takes place in staging area
And in ELT-> transformation takes at either source side r target side…………
26) Can you tell me for what purpose .dsx files are used in the DataStage?

. Dsx is the standard file extension of all the various DataStage jobs. Whenever we export a job or a sequence, the file is exported in the .dsx format. A standard usage for the same can be that, we develop the job in our test environment and after testing we export the file and save it as x.dsx . This can be done using DataStage Manager.

27) How you remove duplicates without using remove duplicate stage?
In the target make the column as the key column and run the job.
Using a sort stage, set property: ALLOW DUPLICATES: false
Just do a hash partion of the input data and check the options Sort and Unique
28) How do you call procedures in DataStage?
Use the Stored Procedure Stage
29) What is an environment variable? What is the use of this?
Basically Environment variable is predefined variable those we can use while creating DS job. We can set either as Project level or Job level. Once we set specific variable that variable will be available into the project/job.
We can also define new environment variable. For that we can get to DS Admin.
I hope u understand. For further details refer the DS Admin guide.
30) How can we create read only jobs in DataStage?
N export there is an options just CLICK ON OPTIONS TAB THEN THERE UNDER INCLUDE OPTIONU WILL FIND READ ONLY DATASTAGE u just enables that
31) How to run the job in command prompt in Unix?
Using dsjob command,
-Options
Dsjob -run -job status project name job name
32) What is the difference between Transform and Routine in DataStage?
Transformer transform the data from one from to another form. Where as Routines describes the business logic.
33) How do u clean the DataStage repository.
Remove log files periodically
CLEAR.FILE &PH&
34) What is the transaction size and array size in OCI stage? How these can be used?
Transaction Size – This field exists for backward compatibility, but it is ignored for release 3.0 and later of the Plug-in. The transaction size for new jobs is now handled by Rows per transaction on the Transaction Handling tab on the Input page.
Rows per transaction – The number of rows written before a commit is executed for the transaction. The default value is 0, that is, all the rows are written before being committed to the data table.
Array Size – The number of rows written to or read from the database at a time. The default value is 1, that is, each row is written in a separate statement.
35) How to know the no. Of records in a sequential file before running a server job?
If your environment is UNIX, you can check with WC -l filename command.
36) Other than Round Robin, What is the algorithm used in link collector? Also Explain How it will work?
Other than round robin, the other algorithm is Sort/Merge.
Using the sort/merge method the stage reads multiple sorted inputs and writes one sorted output.
37) How to improve the performance of hash file?
You can improve performance of hashed file by
1. Preloading hash file into memory –>this can be done by enabling preloading options in hash file output stage
2. Write caching options –>.It makes data written into cache before being flushed to disk. You can enable this to ensure that hash files are written in order onto cash before flushed to disk instead of order in which individual rows are written
3. Preallocating–> estimating the approx size of the hash file so that file needs not to be splitted to often after write operation
38) What is the size of the flat file?
The flat file size depends amount of data contained by that flat file
39) What is data stage engine? What is its purpose?
DataStage sever contains DataStage engine DS Server will interact with Client components and Repository. Use of DS engine is to develop the jobs. Whenever the engine is on then only we will develop the jobs.
40) How to implement slowly changing dimensions in DataStage?
Slowly changing dimensions is concept of DWH.
DataStage is tool for ETL purpose not for Slowly changing dimensions.
Does any one have any idea?
Informatics power center, there is a way to implement slowly changing dimension through wizard. DataStage does not have that type of wizard to implement SCD, should be implemented by manual logic.
41) What is the difference between Symmetrically parallel processing, Massively parallel processing?
Symmetric Multiprocessing (SMP) – Some Hardware resources may be shared by processor. Processor communicates via shared memory and has single operating system.
Cluster or Massively Parallel Processing (MPP) – Known as shared nothing in which each processor have exclusive access to hardware resources. Cluster systems can be physically dispoersed. The processor have their own operations system and communicate via high speed network
42) Give one real time situation where
link partitioner stage used?
If we want to send more data from the source to the targets quickly we will be using the link partioner stage in the server jobs we can make a maximum of 64 partitions. And this will be in active stage. We can’t connect two active stages but it is accepted only for this stage to connect to the transformer or aggregator stage. The data sent from the link partioner will be collected by the link collector at a max of 64 partitions. This is also an active stage so in order to avoid the connection of active stage from the transformer to the link collector we will be using inter process communication. As this is a passive stage by using this data can be collected by the link collector. But we can use inter process communication only when the target is in passive stage
43) What does separation option in static hash-file mean?
The different hashing algorithms are designed to distribute records evenly among the groups of the file based on characters and their position in the record ids.
When a hashed file is created, Separation and modulo respectively specifies the group buffer size and the number of buffers allocated for a file. When a Static Hash file is created, DATASTAGE creates a file that contains the number of groups specified by modulo.
Size of Hash file = modulus (no. Groups) * Separations (buffer size)
44) What is the purpose of exception activity in data stage 7.5?
It is used to catch the exception raised while running the job
The stages followed by exception activity will be executed whenever there is an unknown error occurs while running the job sequencer.
45) What is the difference between sequential file and a dataset? When to use the copy stage?
Sequential file stores small amount of the data with any extension .txt where as Dataset stores huge amount of the data and opens the file only with an extension .ds. Sequential Stage stores small amount of the data with any extension in order to access the file where as Dataset is used to store huge amount of the data and it opens only with an extension (.ds ) .The Copy stage copies a single input data set to a number of output datasets. Each record of the input data set is copied to every output data set. Records can be copied without modification or you can drop or change the order of columns. Main difference b/w sequential file and dataset is: Sequential stores small amount of data and stores normally. But dataset loads the data like ansi format.

46) Where we use link partitioner in data stage job? Explain with example?
We use Link Partitioner in DataStage Server Jobs. The Link Partitioner stage is an active stage, which takes one input and allows you to distribute partitioned rows to up to 64 output links.
47) How to kill the job in data stage?
By killing the respective process ID
You should use kill -14 so the job ends nicely. Sometimes use -9 leaves things in a bad state.
48) How to parameterize a field in a sequential file? I am using DataStage as ETL Tool, Sequential file as source.
We cannot parameterize a particular field in a sequential file; instead we can parameterize the source file name in a sequential file
#FILENAME#
49) How to drop the index before loading data in target and how to rebuild it in data stage?
This can be achieved by “Direct Load” option of SQLLoaded utily.
50) If the size of the Hash file exceeds 2GB.What happens? Does it overwrite the current rows?
It overwrites the file
51) It is possible to access the same job two users at a time in DataStage?
No, it is not possible to access the same job two users at the same time. DS will produce the following error: “Job is accessed by other user”
T is possible. Before that u have to kill that job in DataStage Director. Using “Clean up resource” option.
52) How to find errors in job sequence?
Using DataStage Director we can find the errors in job sequence
53) What is job control? How can it used explain with steps?
JCL defines Job Control Language it is used to run more number of jobs at a time with or without using loops. Steps: click on edit in the menu bar and select ‘job properties’ and enter the parameters asparamete prompt typeSTEP_ID STEP_ID string Source SRC stringDSN DSN string Username unm string Password pad string after editing the above steps then set JCL button and select the jobs from the list box and run the job
54) What is job control? How it is developed? Explain with steps?
Controlling DataStage jobs through some other DataStage jobs. Ex: Consider two Jobs XXX and YYY. The Job YYY can be executed from Job XXX by using DataStage macros in Routines.
To execute one job from other job, following steps needs to be followed in Routines.
1. Attach job using DSAttachjob function.
2. Run the other job using DSRunjob function
3. Stop the job using DSStopJob function

55) How we can call the routine in datastage job? Explain with steps?
Routines are used for implementing the business logic they are two types 1) Before Sub Routines and 2) After Sub Routinestepsdouble click on the transformer stage right click on any one of the mapping field select [dstoutines] option within edit window give the business logic and select the either of the options (Before / After Sub Routines)
56) What are the most important aspects that a beginner must consider doing his first DS project?
He should be good at Data Warehousing Concepts and he should be familiar with all stages
57) What are the different types of lookups in DataStage?
– Look-up file stage – Generally used with Look Up stage
– Hash Look-up
– You can also implement a “look up” using Merge stage
There are two types of lookupslookup stage and lookupfilesetLookup: Lookup reference to another stage or Database to get the data from it and transforms to other database.LookupFileSet:It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link. When creating Lookup file sets, one file will be created for each partition. The individual files are referenced by a single descriptor file, which by convention has the suffix .fs.
58) Where actually the flat files store? What is the path?
Flat files stores the data and the path can be given in general tab of the sequential file stage
Normally flat file will be stored at FTP servers or local folders and more over .CSV, .EXL and .TXT file formats available for Flat files.
59) How to find the number of rows in a sequential file?
Using Row Count system variable
60) How to implement type2 slowly changing dimension in DataStage? Give me with example?
Slow changing dimension is a common problem in Data ware housing. For example: There exists a customer called Lisa in a company ABC and she lives in New York. Later she moved to Florida. The company must modify her address now. In general 3 ways to solve this problem

Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two different people. Type 3: The original record is modified to reflect the changes.

In Type1 the new one will over write the existing one that means no history is maintained, History of the person where she stayed last is lost, simple to use.

In Type2 New record is added, therefore both the original and the new record will be present, the new record will get its own primary key, Advantage of using this type2 is, Historical information is maintained but size of the dimension table grows, storage and performance can become a concern.
Type2 should only be used if it is necessary for the data warehouse to track the historical changes.

In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. Example a new column will be added which shows the original address as New York and the current address as Florida. Helps in keeping some part of the history and table size is not increased. But one problem is when the customer moves from Florida to Texas the New York information is lost. So Type 3 should only be used if the changes will only occur for a finite number of times.

61) What is difference between Merge stage and Join stage?
Join can have max of two input datasets; Merge can have more than two input datasets.
Merge and Join Stage Difference:
1. Merge Reject Links are there
2. Can take Multiple Update links
3. If you used it for comparison, then first matching data will be the output.
Because it uses the update links to extend the primary details which are coming from master link
Someone was saying that join does not support more than two input, while merge support two or more input (one master and one or more update links). I will say, that is highly incomplete information. The fact is join does support two or more input links (left right and possibly intermediate links). But, yes, if you are talking about full outer join then more than two links are not supported.
Coming back to main question of difference between Join and Merge Stage, the other significant differences that I have noticed are:
1) Number Of Reject Link
(Join) does not support reject link.
(Merge) has as many reject link as the update links (if there are n-input links then 1 will be master link and n-1 will be the update link).

2) Data Selection
(Join) There are various ways in which data is being selected. E.g. we have different types of joins, inner, outer (left, right, full), cross-join, etc. So, you have different selection criteria for dropping/selecting a row.
(Merge) Data in Master record and update records are merged only when both have same value for the merge key columns.
62) What is the difference between validated ok and compiled in DataStage.
When you compile a job, it ensures that basic things like all the important stage parameters have been set, mappings are correct, etc. and then it creates an executable job.
You validate a compiled job to make sure that all the connections are valid. All the job parameters are set and a valid output can be expected after running this job. It is like a dry run where you don’t actually play with the live data but you are confident that things will work.
When we say, “Validating a Job”, we are talking about running the Job in the “check only” mode. The following checks are made:
– Connections are made to the data sources or data warehouse.
– SQL SELECT statements are prepared.
– Files are opened. Intermediate files in Hashed File, UniVerse, or ODBC stages that use the local data source are created, if they do not already exist.
63) What are the environment variables in DataStage? Give some examples?
There are the variables used at the project or job level. We can use them to configure the job i.e. can associate the configuration file (With out this u can not run ur job); increase the sequential or dataset read/ write buffer.
Ex: $APT_CONFIG_FILE
Like above we have so many environment variables. Please go to job properties and click on “add environment variable” to see most of the environment variables.
64) Purpose of using the key and difference between Surrogate keys and natural key
We use keys to provide relationships between the entities (Tables). By using primary and foreign key relationship, we can maintain integrity of the data.
The natural key is the one coming from the OLTP system.
The surrogate key is the artificial key, which we are going to create in the target DW. We can use these surrogate keys instead of using natural key. In the SCD2 scenarios surrogate keys play a major role
65) How do you do Usage analysis in DataStage?
1. If u wants to know some job is a part of a sequence, then in the Manager right click the job and select Usage Analysis. It will show all the jobs dependents.
2. To find how many jobs are using a particular table.
3. To find how many jobs are using particular routine.
Like this, u can find all the dependents of a particular object.
Its like nested. U can move forward and backward and can see all the dependents.
66) How to remove duplicates in server job
1) Use a hashed file stage or
2) If you use sort command in UNIX (before job sub-routine), you can reject duplicated records using -u parameter or
3) using a Sort stage
67) Will DataStage consider the second constraint in the transformer if the first constraint is satisfied (if link ordering is given)?”
Answer: Yes.
68) What are constraints and derivation?
Explain the process of taking backup in DataStage?
What are the different types of lookups available in DataStage?
Constraints are used to check for a condition and filter the data. Example: Cust_Id<>0 is set as a constraint and it means and only those records meeting this will be processed further.
Derivation is a method of deriving the fields, for example if you need to get some SUM, AVG etc.
Constraints are condition and once meeting those records will be processed further. Example process all records where cust_id<>0.
Derivations are derived expressions. For example I want to do a SUM of Salary or Calculate Interest rate etc
69) What is a project? Specify its various components?
You always enter DataStage through a DataStage project. When you start a DataStage client you are prompted to connect to a project. Each project contains:
DataStage jobs.
Built-in components. These are predefined components used in a job.
User-defined components. These are customized components created using the DataStage Manager or DataStage Designer

Datastage 7.5 certification

Official IBM website (http://www-03.ibm.com/certify/tests/obj415.shtml) below are areas to prepare in order to pass the test:

Test 415 – IBM WebSphere IIS DataStage Enterprise Edition v7.5 

Section 1 – Installation and Configuration (5%) 

Describe how to properly install and configure DataStage EE 
Describe users and groups 
Describe the environment (e.g., dsenv, ODBC) 
Setup database connectivity 
Describe OS configuration/kernel 
Describe USS configuration 
Identify required components for server 
C++ compiler 
Identify the tasks required to create and configure a project to be used for EE jobs. 
Project location 
Assign DataStage EE roles 
Environment defaults 
Given a configuration file, identify its components and its overall intended purpose. 
Project location 
Assign DataStage EE roles 
Environment defaults 
List the steps necessary to start/stop DataStage EE properly. 
netstat -a|grep ds 

Section 2 – Metadata (5%) 

Demonstrate knowledge of Orchestrate schema. 
Distinguish internal data type (Orchestrate schema) vs external data type 
Describe how to set extended properties for table definition 
Import metadata using plug-ins vs orchdbutil 
Explain nullable mapping rules (e.g., source vs target) 
NLS data types 
Identify the method of importing metadata. 
Flat sources (e.g., sequential file, Orchestrate schema, ODBC, etc.) 
COBOL CopyBook 
XML 
Given a scenario, demonstrate knowledge of runtime column propagation. 
Usage 
Impact on stage mapping and target databases 

Section 3 – Persistent Storage (10%) 

Given a scenario, explain the process of importing/exporting data to/from framework (e.g., sequential file, external source/target). 
Explain use of various file stages (e.g., file, CFF, fileset, dataset) and where appropriate to use 
If USS, define the native file format (e.g., EBCDIC, VSDM) 
Given a scenario, describe proper use of a sequential file. 
Read in parallel (e.g., reader per node, multiple files) 
Handle various formats (e.g., fix vs variable, delimeted vs nondelimited, etc.) 
Describe how to import and export nullable data 
Explain how to identify and capture rejected records (e.g., log counts, using reject link, options for rejection)
Given a scenario, describe proper usage of CFF (native not plug-in). 
Explain how to import data from a file that has multiple varying record types (e.g., COBOL CopyBook, EBCDIC to ASCII) 
Describe proper usage of FileSets and DataSets. 
Explain differences and similarities of FileSet and DataSet (e.g., header data file segments, internal DS vs external format (FS)) 
Determine which tools can be used to manage FileSets and DataSets (GUI and CLI) 
Describe use of FTP stage for remote data (e.g., how to parallel, plug-in vs enterprise). 
Restructure stages (e.g., column import/export) 
Identify importing/exporting of XML data. 
XML stage options and usage 
XPATH and XLS 

Section 4 – Parallel Architecture (10%) 

Given a scenario, demonstrate proper use of data partitioning and collecting. 
Partitioning goals (e.g., data distribution, meeting data grouping requirements) 
Explain the differences and usage of round robin, entire, same, and hash partitioning 
Explain the differences between auto, ordered, and sort collector 
Identify partitioning type, parallel/sequential by analyzing a DataStage EE screen shot 
Explain the differences between partitioning keys and sorting (stage) keys through a scenario where these do not match 
Demonstrate proper use of partitioning methods based on given business requirements and DataStage EE technical requirements 
Explain the differences between a funnel state and a collector 
Describe input of partitioning and re-partitioning in an MPP/cluster environment 
Given a scenario, demonstrate knowledge of parallel execution. 
Given a job design and configuration file, provide estimates of the number of processes generated at runtime 
Explain the purpose and use of resource and node pools 
Given a source DataSet, describe the degree of parallelism using auto and same partitioning 

Section 5 – Databases (15%) 

Given a scenario, demonstrate proper selection of database stages and database specific stage properties. 
DB2 
Based on DB2 connectivity requirements, select appropriate stage (e.g., DB2 API, DB2 Enterprise) 
List environment variables needed to successfully run a DB2 job 
Teradata 
Based on functional requirements, select appropriate Teradata stage (e.g., TD Enterprise, TD MultiLoad (target only), TD API) 
Requested sessions and total sessions 
DRS 
ODBC 
Informix 
Sybase 
Oracle 
Remote Server 
Partition table 
Given a scenario using Oracle, load target and proper use of index mode and $API_ORA_LOAD_OPTIONS 
Identify source database options. 
Identify which stages/options read in parallel 
Explain the degree of parallelism for Oracle Enterprise and DB2 Enterprise 
Identify the use of “sparse” vs “normal” lookup 
When to use (e.g., 1:100) 
How to construct sparse lookup and SQL statements 
Given a scenario, demonstrate knowledge of target database options. 
Articulate benefits and limitations of using parallel load method (e.g., constraints, speed, exclusive locks, etc.) 
Explain the differences between upsert and load methods on target database stages 
Describe how to control restart of DB target (separate the “L” from ETL) 
Use OPEN, CLOSE, create temp, Load, SELECT FROM… INSERT INFO, DROP 
Identify the impact of RCP when target has fewer columns than job flow 
Separate update and insert records before target upsert 
Given a scenario, describe how to design EE ETL jobs that will extract data from DB2 (or any DBMS), combine with data from another source and load to another DBMS target. 
Demonstrate knowledge of working with NLS database sources and targets. 

Section 6 – Data Transformation (15%) 

Given a scenario, demonstrate knowledge of default type conversions, output mappings, and associated warnings. 
Demonstrate appropriate use of modify for NULL handling, string trim, non-default conversions 
Given a scenario, demonstrate proper selections of Transformer stage vs other stages. 
Copy stage/output mapping 
No BASIC Transformer 
Filter vs Transformer constraint (e.g., filter vs interpreted) 
Given a scenario, describe Transformer stage capabilities (including stage variables, link variables, DataStage macros, constraints, system variables, link ordering, @PART NUM, functions). 
Explain NULL handling within parallel transformer (e.g., reject rule, NULL functions) 
Demonstrate the use of Transformer stage variables (e.g., to identify key grouping boundaries on incoming data 
Use a Transformer only to generate a sequence of numbers in parallel 
Identify processes to add functionality not provided by existing DataStage stages (e.g., wrapper, buildops, user def functions/routines). 
Given a scenario, demonstrate no closed loop scenario when updating a reference table. 

Section 7 – Combining and Sorting Data (10%) 

Demonstrate knowledge of Join, Lookup, and Merge stages. 
Explain the differences between Lookup, Join, and Merge stages and demonstrate when to use each
Explain link (e.g., input/output/reject) requirements 
Explain sorting requirements 
Explain full outer join usage 
Demonstrate understanding of link ordering with JOIN and MERGE stages 
Lookup – continue (also outer JOIN) with non-nullable input columns 
Explain what happens with duplicate input column names on JOIN and MERGE 
Given a scenario, demonstrate knowledge of SORT stage. 
Describe the differences between parallel SORT, SORT, and sequential SORT 
Describe the differences between unique option in osrt, RemDup stage 
Demonstrate understanding of SORT key column properties (e.g., don’t sort, previously sorted)
Demonstrate understanding of framework inserted sorts 
Examine score 
Set $APT_SORT_INSERTION_CHECK_ONLY 
Distinguish between SQL ORDER BY vs parallel sort 
Explain NLS collation sequences 
Given a scenario, demonstrate understanding of Aggregator stage. 
Memory impact of sort vs hash 
Demonstrate use of Aggregator properties 
Count vs calculation 
Explicitly create output columns and map 
Take output of Aggregator and join with detail records 
Describe the proper usage of change capture/change apply. 
Separate delta rows (e.g., inserts vs updates) 

Section 8 – Automation and Production Deployment (10%) 

Given a scenario, articulate the change control process. 
“Dev to QA to prod” 
Explain the differences between compile and force compile 
Use of protected projects 
Import/Export 
Including compiled Transformer 
Managing custom components (e.g., buildops, wrappers) 
USS deployment topics (e.g., remote deployment, how to deploy) 
Identify the use of the dsjob command line utility. 
Explain how a DataStage job may be run using a third party scheduler (e.g., autosys, control U, CRON, etc.) 
Explain how performance run statistics from a job run can be captured 
Capture log records to a text file (XML format) 
Explain how to use parameters to simplify the creation and maintenance of job designs. 
Explain the processes necessary to run multiple copies of the source (job multi-instance) 
Given a scenario, demonstrate the ability to use job sequencers (e.g., exception hunting, re-startable, dependencies, passing return value from routing, parameter passing and job status) 
Create different designs using Job Sequencer 

Section 9 – Monitoring and Troubleshooting (10%) 

Given a scenario, demonstrate knowledge of parallel job score. 
Show how it is captured and interpret its results 
Identify and interpret its contents 
Number of processes 
Number of nodes 
Number of DataSets 
Combined operators 
Parallel/Sequential execution 
Inserted buffer and sort operators 
Given a scenario, identify and define environment variables that control EE with regard to added functionality and reporting. 
Articulate ability to identify runtime metadata vs design metadata 
APT_PM_SHOW_PIDS, etc. 
Given a process list, identify conductor, section leader, and player processes. 
Given a scenario, identify areas that may improve performance (e.g., buffer size, repartitioning, config files, operator combination, etc.). 

Section 10 – Job Design (10%) 

Demonstrate knowledge of shared containers. 
Reuse using RCP 
Using usage analysis and multi-job compile to recompile jobs that use a modified shared container 
Given a scenario, describe how to minimize SORTS and repartitions.
Demonstrate knowledge of creating restart points across jobs using intermediate DataSets.
Given a scenario, demonstrate proper use of standards. 
Naming conventions for stages and links 
Differences between annotation and description annotation 

Datastage TX Transformational Extender

IBM WebSphere DataStage TX provides support for industry standards and connectivity requirements so you can solve critical business problems in real time. IBM WebSphere DataStage TX’s Solutions-Oriented Architecture is open and scalable, which means we can rapidly adapt our technology to meet specific industry needs – so you can accelerate implementation, reduce risks, and increase operational efficiencies.

The IBM® WebSphere® TX stands for Transformation Extender. It extends WebSphere Message Broker capabilities with additional data transformation and validation capabilities for complex variable and multi transaction batch document EDI formats as used in Financial Services, Insurance, Healthcare, and Retail Distribution and transport.
The WebSphere Transformation Extender is a multi input , multi output transform engine and has proven success track record across a wide range of industries from finance and global commerce to pharmaceuticals. IBM customers now can leverage the scalability and universal connectivity and mediation capabilities of WebSphere Message Broker with WebSphere Transformation Extender delivered as fully integrated subsystem of IBMs universal ESB
.
Important Note: IBM is merging this offering into a combined edition with installations for deployment with WebSphere Process Server, WebSphere ESB, or WebSphere Message Broker. This offering is called WebSphere Transformation Extender for Integration Servers v8.2.

Withdrawal Announcement July 27, 2007

Effective September 30, 2008, IBM will no longer offer support for IBM WebSphere DataStage Transformation Extender Versions 6.0, 6.5 and 6.7, also known as Ascential DataStage TX, DataStage TX and Mercator Integration Broker. 




Withdrawn product
WebSphere Transformation Extender V8.1 equivalent
TX 6.0 and 6.5
WTX 8.1
V6.5 Integration Broker WebSphere TX with Command Server
WebSphere TX with Launcher
WebSphere TX with Command Server for z/OS
WebSphere TX with Launcher for z/OS
V6.5 Mercator Software Development Kit WebSphere TX SDK
V6.5 Servlet Integration Plug-in
*
V6.5 EJB API
*
V6.5 IBM WebSphere MQ Integrator Plug-in WebSphere TX for Message Broker
V6.5 CICS Execution Option WebSphere TX with Command Server for z/OS
V6.5 IMS/DC Execution Option WebSphere TX with Command Server for z/OS
V6.5 Communications Adapter Included in Base Edition
V6.5 Mercator Design Studio WebSphere TX Design Studio Client
V6.5 Importer: COBOL Included in Base Edition
V6.5 Importer: Text Included in Base Edition
V6.5 Importer Collection: XML Included in Base Edition
V1.x and V2.0 EDI Integration Package:X12 WebSphere TX Pack for X12
V1.x and V2.0 EDI Integration Package:EDIFACT WebSphere TX Pack for EDIFACT
V2.x Healthcare Integration Package: HIPAA X12 WebSphere TX Pack for HIPAA EDI
V2.x Healthcare Integration Package: HL7 WebSphere TX Pack for HL7
V2.x Healthcare Integration Package: NCPDP WebSphere TX Pack for NCPDP
V1.x and V2.0 EDI Integration Package: TRADACOMS WebSphere TX Pack for TRADACOMS
V2.3 FS Manager Package no replacement available
V2.3 Swift Integration Package WebSphere TX Pack for SWIFTNet FIN
V2.3 Business Metadata Repository no replacement available
V7.x SAP Integration Package WebSphere TX Pack for SAP
V2.2 PeopleSoft Integration Package WebSphere TX Pack for PeopleSoft
V2.0.0 and V2.0.1 Siebel Integration Package WebSphere TX Pack for Siebel
V6.1 Commerce Manager WebSphere TX Trading Manager
TX 6.7
WTX 8.1
V6.7 Mercator Software Development Kit WebSphere TX SDK
V6.7 Servlet Integration Plug-in
*
V6.7 EJB API
*
V6.7 BEA WebLogic Integration Plug-in
*
V6.7 IBM WebSphere MQ Integrator Plug-in WebSphere TX for Message Broker
V6.7 CICS Execution Option WebSphere TX with Command Server for z/OS
V6.7 IMS/DC Execution Option WebSphere TX with Command Server for z/OS
V6.7 Communications Adapter Included in Base Edition
V6.7 Integration Broker WebSphere TX with Command Server
WebSphere TX with Launcher
WebSphere TX with Command Server for z/OS
WebSphere TX with Launcher for z/OS
V6.7 Mercator Design Studio WebSphere TX Design Studio Client
V6.7 Importer: COBOL Included in Base Edition
V6.7 Importer: Text Included in Base Edition
V6.7 Importer Collection: XML Included in Base Edition
V1.x and V2.0 Mercator Inside EDI IntegrationPackage for X12 WebSphere TX Pack for X12
V2.0 Mercator Inside EDI Integration Package forEDIFACT WebSphere TX Pack for EDIFACT
V3.x Mercator InsideHIPAA WebSphere TX Pack for HIPAA EDI
V3.x Mercator InsideHL7 WebSphere TX Pack for HL7
V3.x Mercator InsideNCPDP WebSphere TX Pack for NCPDP
V1.x and V2.0 Mercator EDI Integration Package for TRADACOMS WebSphere TX Pack for TRADACOMS
V1.x and V2.0 Mercator EDI Integration Package for ODETTE WebSphere TX Pack for ODETTE
V1.x and V2.0 Mercator EDI Integration Package for EANCOM WebSphere TX Pack for EANCOM
V2.4 Mercator Inside SWIFT FIN WebSphere TX Pack for SWIFTNet FIN
V2.4 Mercator InsideFSManager no replacement available
V2.4 Mercator InsideSWIFT Advanced Integration Package WebSphere TX Pack for SWIFTNet FIN
V2.4 Mercator InsideManager no replacement available
V2.4 Mercator InsideOmgeo CTM no replacement available
V2.4 Mercator InsidePayments no replacement available
V7.2 Mercator Inside Integration Package for SAP WebSphere TX Pack for SAP
V2.2 Mercator Inside Integration Package forPeopleSoft WebSphere TX Pack for PeopleSoft
V2.0.1 Mercator Inside Integration Package for Siebel WebSphere TX Pack for Siebel
V7.0 Commerce Manager WebSphere TX Trading Manager
V6.7 Mercator InsideDataExchange WebSphere TX with Launcher
WebSphere TX with Trading Manager
WebSphere TX Design Studio
WebSphere TX Pack for X12
WebSphere TX Pack for EDIFACT

How to Transform and Filter data in datastage

It’s a very common situation and a good practice to design data stage jobs in which data flow goes in the following way:

 EXTRACT SOURCE -It’s a very common situation and a good practice to design datastage jobs in which data flow goes in the following way:

 EXTRACT SOURCE -> DATA VALIDATION, REFINING, CLEANSING -> MAPPING -> DESTINATION

The data refining, validation and mapping part of the process is mainly handled by a transformer stage. Transformer stage doesn’t extract or write data to a target database. It handles extracted data, performs conversions, mappings, validations, passes values and controls the data flow. Transformer stages can have any number of input and output links. Input links can be primary or reference (used for lookups) and there can only be one primary input and any number of reference inputs. Please refer to the examples below to find out what is the use of transformers. In the job design depicted below there is a typical job flow implemented. The job is used for loading customers into the datawarehouse. The data is extracted from an ODBC data source, then filtered, validated and refined in the first transformer. Rejected (not validated) records are logged into a sequential file. The second transformer performs a lookup (into a country dictionary hash file) and does some other data mappings. The data is loaded into an Oracle database.