Change Capture Stage

This is not to be confused with the companion product IIS Websphere Change Data Capture

Change Capture Stage or Change Data Capture stage or CDC is used to capture the changes between the after_Data and before_data. 
Change Capture Stage

Change Capture Stage


Take Example Data as below 

Change_after_data 
e_id,e_name,e_add 
11,kim,syd 
22,jim,canb 
33,pim,syd 
44,lim,canb 
55,pom,perth 

Change_before_data 

e_id,e_name,e_add 
11,kim,syd 
22,jim,mel 
33,pim,perth 
44,lim,canb 
55,pom,adeliade 
66,shila,bris 

Take Job Design as below 
 


Read and load the data in two files as below 
 


In Change Capture Stage Select Key column as e_id 
and value column as e_add. Because e_add is the column we are going to update.

Failed to invoke GenRuntime using phantom process helper

When
When compiling a job

Causes and resolutions

A)  Server’s /tmp space was full Clean up space in /tmp
B) Jobs status incorrect. DataStage Director->Job->Clear Status File
C) Format problem with projects uvodbc.config file Confirm uvodbc.config has the following entry/format:
[ODBC DATA SOURCES]

DBMSTYPE = UNIVERSE
network = TCP/IP
service = uvserver
host = 127.0.0.1

D) Corrupted DS_STAGETYPES file Connect to the DataStage server,change directory to DSEngine, source dsenv ( . ./dsenv)
$ bin/uvsh
>LOGTO projectname (case sensitive)
Set a file pointer RTEMP to the template DS_STAGETYPES file
>SETFILE /Template/DS_STAGETYPES RTEMP
Check that all of the entries in the template DS_STAGETYPES file are present in the project’s DS_STAGETYPES file
>SELECT RTEMP
* this will return a count of records found in the template DS_STAGETYPES file
>COUNT DS_STAGETYPES
* this will return a count of records found in the project’s DS_STAGETYPES file
* These numbers should be the same
If the numbers differ and some records are missing from the project’s DS_STAGETYPES file
>COPY FROM RTEMP TO DS_STAGETYPES ALL OVERWRITING
exit Universe shell
>Q
E) Internal locks Connect to the DataStage server,change directory to DSEngine, source dsenv ( . ./dsenv)
Change directory to the projects directory that has the job generating the error.
Execute the following replacing with the actual job name.
$ $DSHOME/bin/uvsh “DS.PLADMIN.CMD NOPROMPT CLEAR LOCKS “

Still unresolved? try the following:

Turn on server side tracing, attempt to compile the problem job, turn off server side tracing, and gather the tracing information.

  1. Turn on server side by connecting to the server with the DataStage Administrator client.
  2. High light the project which has the problem job.
  3. Click on the Properties button.
  4. In the Properties window, click the Tracing tab
  5. Click on the Enabled check box
  6. Click the OK button
  7. With a new DataStage Designer connection, attempt to compile the job.
  8. With the DataStage Administrator client, go back into the projects properties
  9. Select the Tracing tab
  10. Uncheck the Enabled check box
  11. For each DSRTRACE entry do the following:
    a) High light the entry and click View
    b) High light the contents of the display and click on Copy
    c) Paste the copied information into Notepad
  12. Open a PMR with support and supply the Notepad information.

FTP Stage for VB file from Mainframe to Datastage

FTP Stage: 
———— 
Record Level>>Record Type=implicit 
Field defaults>>Delimiter =| <pipe> 
Type defaults>>General>>Character set=EBCDIC, Data format=binary 




Delimiter=none for no delimter at end of each record


Specify all the input fields with length, in my case I had to use SQL Type= Varchar and their length. THis would work so long as it conforms to the LRECL setting.

Host name invalid





81011 means that the host name is not valid or the server is not responding. 
Do an exact search on 81011, you will find many discussions around it.

  • Verify if Datastage is up and running. You can either do a command line ping or go to control panel and check if the services are running. 
  • Try specifying the IP address rather than the host name.
  • Check for any firewalls
  • try in your Command window pinging your localhost. 
  • >ping localhost
  • If Server and Client are in the same machine try 127.0.0.1 Or ‘localhost’

Failed job – **** Parallel startup failed ****

Problem(Abstract)

A parallel DataStage job with configuration file setup to run multiple nodes on a single server fails with error:
Message: main_program: **** Parallel startup failed ****

Resolving the problem

The full text for this “parallel startup failed” error provides some additional information about possible causes:
    This is usually due to a configuration error, such as not having the Orchestrate install directory properly mounted on all nodes, rsh permissions not correctly set (via /etc/hosts.equiv or .rhosts), or running from a directory that is not mounted on all nodes. Look for error messages in the preceding output.

For the situation where a site is attempting to run multiple nodes on multiple server machines, the above statement is correct. More information on setting up ssh/rsh and parallel processing can be found in the following topics: 
Configuring remote and secure shells
Configuring a parallel processing environment

However, in the case where all nodes are running on a single server machine, the “Parallel startup failed” message is usually an indication that the fastname defined in the configuration file does not match the name output by the server’s “hostname” command. 

In a typical node configuration file, the server name where each node runs is indicated by the fastname, i.e., /opt/IBM/InformationServer/Server/Configurations/default.apt:

{
  node "node1"
  {
     fastname "server1"
     pools ""
     resource disk "/opt/resource/node1/Datasets" {pools ""}

      resource scratchdisk "/opt/resource/node1/Scratch" {pools ""}
  }
  node "node2"
  {
     fastname "server1"

      pools ""
     resource disk "/opt/resource/node2/Datasets" {pools ""}
     resource scratchdisk "/opt/resource/node2/Scratch" {pools ""}
  }
}

Login to the DataStage server machine and at the operating system command prompt, enter command:
hostname

If the hostname output EXACTLY matches the fastname defined for local nodes, then the job will run correctly on that server. However, if the “hostname” command outputs the hostname in a different format (such as with domain name appended) then the names defined for fastname will be considered remote nodes and a failed attempt will be made to access the node via rsh/ssh.

Using the above example, if the hostname output was server1.mydomain.com then prior to the “Parallel startup failed” error in job log you will likely see the following error:

    Message: main_program: Accept timed out retries = 4
    server1: Connection refused

The above problem will occur even if your /etc/hosts file maps server1 and server1.mydomain.com to the same address since it is not the inability to resolve either address that causes this issue, but rather that the fastname in node configuration file does not exactly match the system hostname (or value of APT_PM_CONDUCTOR_NODE if defined).

You have several options to deal with this situation:

  • change fastname for nodes in configuration file to exactly match the output of hostname command.
  • set APT_PM_CONDUCTOR_NODE to the same value as fastname. This would need to be defined either in every project or every job.
  • You should NOT change the hostname of server to match fastname. Information Server / DataStage stores some information based on the current hostname. If you change the hostname after installation of Information Server / DataStage, then you will need to contact support team for additional instructions to allow DataStage to work correctly with the new hostname.

Datastage Hostname change

Changing hostname in a running system is not easy.


If you need to change the server name where we have installed the complete IIS8 suite of product (DataStage, QualityStage, Information Analyzer, etc) 


change in hostname in /etc/hosts will not work. you must modify the Information Services Framework configuration, XMETA repository, and WebSphere Application Server configuration to recongnize the new host. So, any wrong modification on one of these configurations, then you probably are going to spend more than 1 week to debug the problem.


The better choice and one that surely works is to reinstall the product.





How to find the Alphabetic Records in the column

In order to find the alphabetic records in the column, we have a Alpha Function. 


By using Alpha Function, we can find the number of alphabetic records we have in the required column. 


Take Job Design as below 


Seq.File—– Tx—– Filter—— D.s 




If we have a table as below 

alpha_tabl 


e_id,e_name 
11,james 
22,leana121 
33,lacky11 
44,tom 
55,venu 
66,emy211@ 
77,tiny@ 

Read and Load the data in the Sequential file 


In transformer Stage drag and drop the records to the Output file and create one column as new_names 


In Derivation write as alpha(e_name ) Click ok 


In Filter Stage— In Where Clause 


Write as new_names=1 


And don’t forget to drang and drop the records in Output 


Thats it click ok 


Compile and Run the Job to get required Output. 

ETL disadvantages

  • Data from different sources would be different, poorly documented and dirty. 
  • Standardization
    • Name and address standardization is not very easy
    • Specialized software for standardization – specific to geography
  • Perfecting the data is costly

ETL vs ELT – What and Why?

Extract

– The process by which data is extracted from the data source

Transform

– The transformation of the source data into a format relevant to the solution

Load

– The loading of data into the warehouse

ETL ELT
Strengths ·         Only relevant data is stored ·         Easier project management as project can be broken down to smaller chunks
·         Potentially reduces development time and processing overhead ·         Utilize existing database engines and reduce hardware costs
·         Future proof as all data from sources is available
Weaknesses ·         Leaving out irrelevant data may require more changes (costs) at later times ·         Not a matured concept, so have to deal with learning curve
·         Additional hardware and skills investment to run 3rd party ETL software ·         Limited availability of 3rd party tools, database provided engines or ETL tools have to be reutilized

Most of the current warehouse systems use the ETL methodology. Some emphasis has been placed on switching to ELT to save development costs, time and risk.Many organizations have recognized the benefits that can be gained from the employment of ELT. In spite of having already deployed warehouse solutions that use ETL, they have embarked upon a process of re-engineering their solutions using ELT through the use of effective tools.

Whilst data warehouse development traditionalists will undoubtedly pour scorn on the ELT approach, the clear and obvious benefits cannot afford to be ignored.

 

Sort Stage Basics

Like in many data related processes, sort plays an important part in etl transformations using Datastage. In fact sorted data is a pre-requisite for a number of stages like remove-duplicate, join, etc.. There are two ways in which you can sort your data in Datastage. One way is to do inline sorts in any stage that you want. For example if you want your data sorted before using it in your transformer you simply have to go to the input -> partitioning tab and check the ‘PERFORM SORT’ option. Even though sort is achievable in this way in some cases people specifically go for the sort stage due to the flexibility, power and variety of options it offers.
You can find the sort stage in the processing tab of the palettes section. In order to perform the sort you must provide the keys on which you want the sort performed. If you want to perform the sort sequentially in a parallel job then this must be specified by setting the sort stage to run in sequential mode rather than the parallel mode. This option can be set in the advanced tab of the sort stage as shown below. The choice of running the sort in sequential or parallel mode is entirely up to you based on your business requirement.
The choice on where to carry out the sort is entirely dependant on how you have configured your APT_CONFIG_FILE. If you have set a separate pool to carry out the sort operation then sort will only use the scratch disk specified in the sort pool. If such a sort pool is not specified then the scratch disks in the default pool is used for sorting. As a last option it will use the temporary disk spaces specified by the TMPDIR env variable and if thats not there then the /tmp folder is used.
Since most of you will be knowing how to configure the sort stage for  a basic use I’ve decided to explain certain options in the Sort stage that might be useful when you design your job.
Sort Key mode
This option will be present under each sort key. This option is important mainly because it allows us to use the sort stage in an efficient manner such that we have an optimally running sort operation. By default the value is set to ‘Sort’. The other two options are ‘Don’t Sort (Previously Sorted)’ and ‘Don’t Sort (Previously Grouped)’. If out of the set of columns we are sorting on if we already know that the data is sorted on a particular column then we can set the sort key mode option to Don’t Sort (Previously Sorted) which will indicate the input records have already been sorted prior to this stage using that particular column. This ensures that we won’t be doing any unnecessary sorts, which thereby increases your job performance.
Supposing you have data as below
A                             B                             C
1                              127                         AAA
1                              124                         BBB
1                              156                         CCC
2                              09                           DDD
2                              540                         EEE
Now if we specify the sorting keys as A and B  and  if you are only interested in seeing data in a group(specified by column A in this case) suitably sorted then it makes sense to set the sort key mode for key ‘A’  to Set to Don’t Sort (Previously Grouped) to indicate that input records are already grouped by this particular column, but not sorted. This will make the sort stage only sort on column B.
Sort Utility
There are two different types of sorts which you can use in Datastage. They are mentioned below
–          Datastage – This is the default sort used by Datastage and uses the tsort operator. It does not require any other application to work
–          UNIX – As the name indicates this carries out the sort operation using the Unix sort command. For this the psort operator will be used by Datastage.
Cluster Key change column and create key change column
There will be requirements in you design in which you will want to know the first record in a group(specified by the sort keys) from the data in your dataset. This can be easily achieved by set the key change option to true. Even though both of the above options do the same thing they are different in the sense that they will have to be used based on the sort key mode specified. If you have used the sort key mode as Don’t sort for the keys during sorting then you will have to use the ‘Cluster Key Change column’ option. If you have used the default sort key mode then you need only use the ‘create key change column’ option. Setting these options will ensure that an extra column is added to your output records and this column will contain a value of 1 if the particular record in that group (where group is indicated by the change in sort key value) is the first record. All following records in that group will have a value of 0.
eg. If the data shown above was sorted on the key A and the key change option was set then the output data would look like this
A                             B                             C               ClusterKeyChange
1                              127                         AAA          1
1                              124                         BBB          0
1                              156                         CCC          0
2                              09                           DDD         1
2                              540                         EEE         0
Restrict memory usage
By default the value is 20MB. This restricts the sort stage to carry outs its operation to a certain amount of virtual memory on the node on which it is being processed. You can increase this to increase your sort rate. However you should remember that the size should never be more than the amount of actual physical memory that is actually available on the processing node.