Transformer Stage to filter the data

If our requirement is to filter the data department wise from the file below 

samp_tabl 
1,sam,clerck,10 
2,tom,developer,20 
3,jim,clerck,10 
4,don,tester,30 
5,zeera,developer,20 
6,varun,clerck,10 
7,luti,production,40 
8,raja,priduction,40 

And our requirement is to get the target data as below 

In Target1 we need 10th & 40th dept employees. 

In Target2 we need 30th dept employees. 

In Target1 we need 20th & 40th dept employees. 

Take Job Design as below 
 



Read and Load the data in Source file 

In Transformer Stage just Drag and Drop the data to the target tables. 

Write expression in constraints as below 

dept_no=10 or dept_no= 40 for table 1 

dept_no=30 for table 1 

dept_no=20 or dept_no= 40 for table 1 

Click ok 

Give file name at the target file and 

Compile and Run the Job to get the Output 

Read more about Transformer Stage with example 

Brief introduction to Datastage

Datastage was introduced in the year 1997 with the name Dataintegrator . And the company introduced this product is VMARK in UK.
Vmark introduced the product as an normal ETL Tool with great aspects. And later it was take over by some companies. Than it was taken by company called Infomix.
They have changed the products name as Datastage and they changed there company name too as Ascential. They named the product as Acential Datastage.
That later on they developed the product by integrating with Orchestrate tool and MKS Tool Kits. In the year 2005 IBM taken the product Datastage and changed name as IBM Datastage and fixed many bugs.
IBM is the brand name well known by many people in the world. They will market the product in huge. And Informatica ETL Tool is the Competitor for Datastage.
Now after learning about History of Datastage, read some basic concepts like stages, concepts

How to create Data connection in Datastage

Data connection is a technique used for the re-usability of stage mandatory property values. 
Steps to create Data Connection 

Click on Oracle Enterprise– Click on Stage– Click on Data Connection 

You will get two options like 

a) Load Data Connection 

b) Save Data Connection 

Click on Save Data Connection 

Now Give Data Connection name ( User Defined ) 

Give Data connection name ( User Defined ) 

Give Some Description for that Connection 

In parameters, give your server name of database and 

User name = Give your user name in Database 

Password = The Password you are using for that id. 

Than click ok and Save 

Transformer Stage using CompactWhiteSpaces

Compact white spaces is the function used in the Transformer Stage function. It will reduce the spaces to the single spaces. 

If we have a data as below 

e_id,e_name 
11,ja mes 
22,lea na 
33,lacky 
44,to m 
55,ve nu 
66,emy 
77,t iny 

Take Job design as 

Seq.File——Tx—-d.S 

Read and load the data in Sequential file 

In Transformer Stage — Drag and Drop the records to the output file and create 

one column as new_names. In derivation,write expression as below 

compactwhitespaces(e_name) and click ok 

Give file name to the target file and 

compile and run the job to get the required output. 

DataSet in DataStage

Inside a InfoSphere DataStage parallel job, data is moved around in data sets. These carry meta data with them, both column definitions and information about the configuration that was in effect when the data set was created. If for example, you have a stage which limits execution to a subset of available nodes, and the data set was created by a stage using all nodes, InfoSphere DataStage can detect that the data will need repartitioning.
If required, data sets can be landed as persistent data sets, represented by a Data Set stage .This is the most efficient way of moving data between linked jobs. Persistent data sets are stored in a series of files linked by a control file (note that you should not attempt to manipulate these files using UNIX tools such as RM or MV. Always use the tools provided with InfoSphere DataStage).
there are the two groups of Datasets – persistent and virtual.
The first type, persistent Datasets are marked with *.ds extensions, while for second type, virtual datasets *.v extension is reserved. (It’s important to mention, that no *.v files might be visible in the Unix file system, as long as they exist only virtually, while inhabiting RAM memory. Extesion *.v itself is characteristic strictly for OSH – the Orchestrate language of scripting).
Further differences are much more significant. Primarily, persistent Datasets are being stored in Unix files using internal Datastage EE format, while virtual Datasets are never stored on disk – they do exist within links, and in EE format, but in RAM memory. Finally, persistent Datasets are readable and rewriteable with the DataSet Stage, and virtual Datasets – might be passed through in memory.
A data set comprises a descriptor file and a number of other files that are added as the data set grows. These files are stored on multiple disks in your system. A data set is organized in terms of partitions and segments.
Each partition of a data set is stored on a single processing node. Each data segment contains all the records written by a single job. So a segment can contain files from many partitions, and a partition has files from many segments.
Firstly, as a single Dataset contains multiple records, it is obvious that all of them must undergo the same processes and modifications. In a word, all of them must go through the same successive stage.
Secondly, it should be expected that different Datasets usually have different schemas, therefore they cannot be treated commonly.
Alias names of Datasets are

1) Orchestrate File
2) Operating System file

And Dataset is multiple files. They are
a) Descriptor File
b) Data File
c) Control file
d) Header Files

In Descriptor File, we can see the Schema details and address of data.
In Data File, we can see the data in Native format.
And Control and Header files resides in Operating System.


Starting a Dataset Manager
Choose Tools  Data Set Management, a Browse Files dialog box appears:
  1. Navigate to the directory containing the data set you want to manage. By convention, data set files have the suffix .ds.
  2. Select the data set you want to manage and click OK. The Data Set Viewer appears. From here you can copy or delete the chosen data set. You can also view its schema (column definitions) or the data it contains.

How to Find Numerical Records in the Column

If we have a table as below 

all_num 
e_id,e_name 
11,ram 
22,tom 
33@#,asli 
44,king 
54%,remo 

Take Job Design as 

Seq.file——-Tx—————Filter————–D.s 



Read and load the data in Sequential file stage. 

In Transformer stage Drag and drop the all the columns we have and create one column as num_avl 

In derivation Write expression as 

allnum(e_id) Click ok 

In Filter Stage—- In Where Clause 

num_avl=1 

That’s it Compile and Run the job 

You wil get the required Output. 

We get the all the numerical records in the column. 

changes made by particular USER in DataStage

  • Go to Data stage Designer
  • from the Menu bar  Tools –> Advanced Find



Then Select filelds based upon your choice:

  • Name To find : * or part of the job name
  • Folder to search : select the particular folder you are looking for
  • Modified by User: Enter the user name of the datastage user.
  • Date  : select the date type



Then click Find.

It will give you the list of jobs modified by that particular USER.

Specifying C++ compiler settings for DataStage in Windows

specify the C++ compiler settings for DataStage in Windows Environment. From last few days I also face a lot of issue with this 😛 so thought it need to share with you all.
For :  DataStage 8.x

1)            Simple download the Visual C++ 2008 Express Edition with SP1 from below Links
2)            Extract the ISO files and click on Setup.exe.
3)            It will open a prompt a Window in that Select “Visual C++ 2008 Express Edition” (3rd from the top)
4)            Installed It
5)            Now open CMD and execute below commands
a)            setx -m INCLUDE “%INCLUDE%; C:Program Files (x86)MKS Toolkitinclude;
C:Program Files (x86)Microsoft Visual Studio 9.0VCinclude;C:Program FilesMicrosoft SDKsWindowsv6.0AInclude”
b)             setx -m LIB “%LIB%;C:Program Files (x86)Microsoft Visual Studio 9.0VClib;C:Program FilesMicrosoft SDKsWindowsv6.0ALib;C:Program Files (x86)MKS Toolkitlib;C:Program Files (x86)MKS Toolkitusrlib”
c)             setx -m PATH “%PATH%; C:Program Files (x86)MKS Toolkitmksnt;C:PROGRA~2MKSTOO~1bin64;C:PROGRA~2MKSTOO~1bin;C:PROGRA~2MKSTOO~1binX11;C:Program Files (x86)Microsoft Visual Studio 9.0VCbin;
6)            Please change the path according to your Installation
7)            Now open DS administrator select Project à Properties à General à Environment and check these parameters :
Means NO CHANGE in default parameters.
8)            Save the settings and restart the computer.
9)            If all goes fine, C++ compiler Setting is done for your system.

FileSet in DataStage

DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set.  while their storage places are diverse Unix files and they’re human-readable.
This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns. The amount of data that can be stored in each destination data file is limited by the characteristics of the file system and the amount of free disk space available.
The number of files created by a file set depends on:
·         The number of processing nodes in the default node pool
·         The number of disks in the export or default disk pool connected to each processing node in the default node pool
·         The size of the partitions of the data set
The File Set stage enables you to create and write to file sets, and to read data back from file set.
Unlike data sets, file sets carry formatting information that describes the format of the files to be read or written.
Filesets are similar to datasets
1.Partitioned
2.Implemented with header file and data files
Filesets are different from datasets
1.The data files of filesets are text files and hence are readable by other applications whereas the data files of datasets are stored in native internal format and are readable only DataStage

DataStage Naming Conventions

DataStage Naming Conventions follows the guidelines of ETL Naming Conventions.
Contents
  • 1 Job Name Prefixes
  • 2 Stage Names
  • 3 Link Names
Job Name Prefixes
Job prefixes are optional but they help to quickly identify the type of job and can make job navigation and job reporting easier. Parallel jobs – par Server jobs – ser Sequence jobs – seq Batch jobs – bat Mainframe jobs
Stage Names
The stage type prefix is used on all stage names so it appears on metadata reports that do not include a diagram of the stage or a description of the stage type. The name alone can be used to indicate the stage type.
Source and target stage names indentify the name of the entity such as a table name or a sequential file name. The stage name strips out any dynamic part of the name – such as a timestamp, and file extensions. 
  • Database stage – db_table name
  • Dataset – ds_datasetname
  • Hash file – hf_hashfilename
  • Sequential file stage – sf_filename
The prefix identifies the source type, the rest of the name indicates how to find that source outside of DataStage or how to refer to that source in another DataStage job.
Transformation stages 
  • Aggregation – AG_CalculatedContent (Prices, SalesAmounts, YTDPrices)
  • Changed Data Capture – CD
  • Funnel – FO_FunnelType (Continuous, round robin)
  • Lookup – LU
  • Pivot – PI
  • Remove Duplicates – RD
  • Sort – SO_SortFields
  • Transformer – TR_PrimaryFunction (HandleNulls, QA, Map)
Link Names
The link name describes what data is travelling down the link. Link names turn up in process metadata via the link count statistics so it is very important to use names that make process reporting user friendly.
Only some links in a job are important to project administrators. The link naming convention has two types of link names: – Links of importance have a five letter prefix followed by a double underscore followed by link details. – Intermediate links have a link name without a double underscore.
Links of Importance: – The first primary link in a job consists of SourceType(char2)pri(primary). – Any link from a reference source consists of SourceType(char2)ref(reference). – Any link loading to a target consists of TargetType(char2)UpdateAction(char3). – Any reject link SourceType(char2)rej(reject).
Any project can add new links of importance, such as the output count of a remove duplicates or aggregation stage.
Example: dbpri__stockitem is the first link in a job. dbups__stockitem is the link loading to a target database table with an upsert option. dbref__orgcodes is a reference lookup to of orgcodes to a database table. dbrej__stockitems is a reject of upserts to the stockitem table.
You can then produce a pivot report against the link row count statistics to show the row counts for a particular job using the five letter prefix as for each type of row count.