parallel | Datastage

May 10, 2014

Difference between Hash and Modulus partitioning

Hash and Modulus techniques are Key based partition techniques.

Hash and Modulus techniques are used for different purpose.

If Key column data type is textual then we use has partition technique for the job.

If Key column data type is numeric, we use modulus partition technique.

If one key column numeric and another text then also we use has partition technique.

if both the key columns are numeric data type then we use modulus partition technique.

September 24, 2013

Slowly Changing Dimension/SCD in datastage

To understand what is Slowly Changing Dimension, we first understand these:

Most frequently, we insert into Fact tables . Very infrequently we update the facts that were loaded incorrectly. It is even less likely to delete rows from the fact table; the only time this might happen is if you wanted to archive the old data that is no longer relevant for the current portion of the data warehouse.

Dimension tables, on the other hand are likely to see frequent updates. A classic example of this is a consumer dimension. People can change names, phone numbers, addresses, etc regularly.

Dimensions that are changing over time are referred to as slowly changing dimensions (SCD).

Slowly Changing Dimension

Type 1: override the existing value

Customer_key	Customer_name	Customer_city	Customer_state
123	Ms. Brown	Houston	Texas
123	Ms. Brown	Dallas	Texas

Slowly Changing Dimension – SCD Type 1

Type 2: create a new record with effective dates

Customer_key	Customer_name	Customer_city	Customer_state	Effective_date	Obsolete_date
123	Ms.Brown	Houston	Texas	1/1/2000	1/1/2005
234	Ms. Brown	Dallas	Texas	1/1/2005	NULL

Slowly Changing Dimension – SCD Type 3

Type 3: create a new and old columns, no new rows

Customer_key	Customer_name	Customer_city	Customer_state	Change_date	New_city
123	Ms. Brown	Houston	Texas	1/1/2005	Dallas

Slowly Changing Dimension – SCD Type 3

September 29, 2012

Change Capture Stage

This is not to be confused with the companion product IIS Websphere Change Data Capture

Change Capture Stage or Change Data Capture stage or CDC is used to capture the changes between the after_Data and before_data.

Change Capture Stage

Take Example Data as below

Change_after_data
e_id,e_name,e_add
11,kim,syd
22,jim,canb
33,pim,syd
44,lim,canb
55,pom,perth

Change_before_data

e_id,e_name,e_add
11,kim,syd
22,jim,mel
33,pim,perth
44,lim,canb
55,pom,adeliade
66,shila,bris

Take Job Design as below

Read and load the data in two files as below

In Change Capture Stage Select Key column as e_id
and value column as e_add. Because e_add is the column we are going to update.

June 26, 2012

Failed job – Parallel startup failed

Problem(Abstract)

A parallel DataStage job with configuration file setup to run multiple nodes on a single server fails with error:
Message: main_program: **** Parallel startup failed ****

Resolving the problem

The full text for this “parallel startup failed” error provides some additional information about possible causes:

This is usually due to a configuration error, such as not having the Orchestrate install directory properly mounted on all nodes, rsh permissions not correctly set (via /etc/hosts.equiv or .rhosts), or running from a directory that is not mounted on all nodes. Look for error messages in the preceding output.

For the situation where a site is attempting to run multiple nodes on multiple server machines, the above statement is correct. More information on setting up ssh/rsh and parallel processing can be found in the following topics:
Configuring remote and secure shells
Configuring a parallel processing environment

However, in the case where all nodes are running on a single server machine, the “Parallel startup failed” message is usually an indication that the fastname defined in the configuration file does not match the name output by the server’s “hostname” command.

In a typical node configuration file, the server name where each node runs is indicated by the fastname, i.e., /opt/IBM/InformationServer/Server/Configurations/default.apt:

{ node "node1" { fastname "server1" pools "" resource disk "/opt/resource/node1/Datasets" {pools ""}
resource scratchdisk "/opt/resource/node1/Scratch" {pools ""} } node "node2" { fastname "server1"
pools "" resource disk "/opt/resource/node2/Datasets" {pools ""} resource scratchdisk "/opt/resource/node2/Scratch" {pools ""} } }

If the hostname output EXACTLY matches the fastname defined for local nodes, then the job will run correctly on that server. However, if the “hostname” command outputs the hostname in a different format (such as with domain name appended) then the names defined for fastname will be considered remote nodes and a failed attempt will be made to access the node via rsh/ssh.

Using the above example, if the hostname output was server1.mydomain.com then prior to the “Parallel startup failed” error in job log you will likely see the following error:

The above problem will occur even if your /etc/hosts file maps server1 and server1.mydomain.com to the same address since it is not the inability to resolve either address that causes this issue, but rather that the fastname in node configuration file does not exactly match the system hostname (or value of APT_PM_CONDUCTOR_NODE if defined).

You have several options to deal with this situation:

change fastname for nodes in configuration file to exactly match the output of hostname command.
set APT_PM_CONDUCTOR_NODE to the same value as fastname. This would need to be defined either in every project or every job.
You should NOT change the hostname of server to match fastname. Information Server / DataStage stores some information based on the current hostname. If you change the hostname after installation of Information Server / DataStage, then you will need to contact support team for additional instructions to allow DataStage to work correctly with the new hostname.

Source: http://www-01.ibm.com/support/docview.wss?uid=swg21434065

June 21, 2012

Sort Stage Basics

Like in many data related processes, sort plays an important part in etl transformations using Datastage. In fact sorted data is a pre-requisite for a number of stages like remove-duplicate, join, etc.. There are two ways in which you can sort your data in Datastage. One way is to do inline sorts in any stage that you want. For example if you want your data sorted before using it in your transformer you simply have to go to the input -> partitioning tab and check the ‘PERFORM SORT’ option. Even though sort is achievable in this way in some cases people specifically go for the sort stage due to the flexibility, power and variety of options it offers.

You can find the sort stage in the processing tab of the palettes section. In order to perform the sort you must provide the keys on which you want the sort performed. If you want to perform the sort sequentially in a parallel job then this must be specified by setting the sort stage to run in sequential mode rather than the parallel mode. This option can be set in the advanced tab of the sort stage as shown below. The choice of running the sort in sequential or parallel mode is entirely up to you based on your business requirement.

The choice on where to carry out the sort is entirely dependant on how you have configured your APT_CONFIG_FILE. If you have set a separate pool to carry out the sort operation then sort will only use the scratch disk specified in the sort pool. If such a sort pool is not specified then the scratch disks in the default pool is used for sorting. As a last option it will use the temporary disk spaces specified by the TMPDIR env variable and if thats not there then the /tmp folder is used.

Since most of you will be knowing how to configure the sort stage for a basic use I’ve decided to explain certain options in the Sort stage that might be useful when you design your job.

Sort Key mode

This option will be present under each sort key. This option is important mainly because it allows us to use the sort stage in an efficient manner such that we have an optimally running sort operation. By default the value is set to ‘Sort’. The other two options are ‘Don’t Sort (Previously Sorted)’ and ‘Don’t Sort (Previously Grouped)’. If out of the set of columns we are sorting on if we already know that the data is sorted on a particular column then we can set the sort key mode option to Don’t Sort (Previously Sorted) which will indicate the input records have already been sorted prior to this stage using that particular column. This ensures that we won’t be doing any unnecessary sorts, which thereby increases your job performance.

Supposing you have data as below

A B C

1 127 AAA

1 124 BBB

1 156 CCC

2 09 DDD

2 540 EEE

Now if we specify the sorting keys as A and B and if you are only interested in seeing data in a group(specified by column A in this case) suitably sorted then it makes sense to set the sort key mode for key ‘A’ to Set to Don’t Sort (Previously Grouped) to indicate that input records are already grouped by this particular column, but not sorted. This will make the sort stage only sort on column B.

Sort Utility

There are two different types of sorts which you can use in Datastage. They are mentioned below

– Datastage – This is the default sort used by Datastage and uses the tsort operator. It does not require any other application to work

– UNIX – As the name indicates this carries out the sort operation using the Unix sort command. For this the psort operator will be used by Datastage.

Cluster Key change column and create key change column

There will be requirements in you design in which you will want to know the first record in a group(specified by the sort keys) from the data in your dataset. This can be easily achieved by set the key change option to true. Even though both of the above options do the same thing they are different in the sense that they will have to be used based on the sort key mode specified. If you have used the sort key mode as Don’t sort for the keys during sorting then you will have to use the ‘Cluster Key Change column’ option. If you have used the default sort key mode then you need only use the ‘create key change column’ option. Setting these options will ensure that an extra column is added to your output records and this column will contain a value of 1 if the particular record in that group (where group is indicated by the change in sort key value) is the first record. All following records in that group will have a value of 0.

eg. If the data shown above was sorted on the key A and the key change option was set then the output data would look like this

A B C ClusterKeyChange

1 127 AAA 1

1 124 BBB 0

1 156 CCC 0

2 09 DDD 1

2 540 EEE 0

Restrict memory usage

By default the value is 20MB. This restricts the sort stage to carry outs its operation to a certain amount of virtual memory on the node on which it is being processed. You can increase this to increase your sort rate. However you should remember that the size should never be more than the amount of actual physical memory that is actually available on the processing node.

June 17, 2012

Datastage – Problem when running large parallel jobs

If you have large parallel jobs (8.1) running in Windows environment then it is worth checking the servers event log. You could find an error related to Nutcracker.exe. An administrator with access right on the server can see the events logs.

Nutcracker error could come up while running very heavy jobs where total cpu utilization is upto or above 90% and server starts using virtual memory. This is just an example of the impact of running large parallel jobs in Windows server. Hence to make the system somewhat full proof following changes can be done in Windows registry. The idea is to tune the TCP/IP buffer, increase the heap size and set the virtual memory to the recommended amount.

To tune TCP/IP settings go to registry editor and check if MaxUserPort and TcpNumConnections keys are present in the TCP/IP buffer and connection registry. Generally they will not be there. TcpNumConnections settings will impact on number of Osh processes.

To change the settings
Go to registry editor
Go to HKEY_LOCAL_MACHINE > SYSTEM > CurrentControlSet > Services > Tcpip > Parameters.
Now on the right pane modify the values:
Value Name: MaxUserPort
Value Type: DWORD
Value data: 65534
More information on MaxUserPort click here.

Value Name: TcpNumConnections
Value Type: DWORD
Value data: 65534
More information on TcpNumConnections click here.

To change the heap size
Go to registry editor and change the non interactive desktop heap size.
Go to HKEY_LOCAL_MACHINE > SYSTEM > CurrentControlSet > Control > Session Manager > SubSystems.
Click on windows on the right pane.
Now edit the ShareSection parameter in the value box
Change the last value from 512 to 1024.

Now what is Desktop heap and why we are changing it?

In windows anything you create as a desktop object has got some amount of memory attached to it which is known as single desktop heap. The desktop heap is responsible for storing user interface objects, for example menus (Start for example) and windows (any new Window). When we run any application which needs to create these kind of interface object it needs space (in general terms, however it does have a process by which it allocates the area.) which goes from Desktop heap. In normal scenarios no one bother about how much Desktop heap has been configured in the system. The interesting thing here is if the application does not require any kind of interface objects it will not allocate the area from heap. The value which we are changing here is related to non interactive desktop which is related to services. So we are allowing more heap size in our environment so that while running services it should not exhaust. If you need more information on Windows heap please click here.

Now the last bit is changing the virtual memory configuration.

Virtual memory can be configured by right clicking on “My Computer” and going to Advance tab’s performance section. Now change the virtual memory to the recommended size. For example if you have 8GB of RAM then recommended size would be 12 GB.

However all above setting can overcome issues related to large jobs but patches recommended for Nutcracker issue should also be installed.

June 17, 2012

Configuring Datastage 8.5 in Windows

For more information – http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.jsp?topic=/com.ibm.swg.im.iis.productization.iisinfsv.install.doc/topics/wsisinst_config_pe_win.html

If the parallel engine is installed on a computer that runs Microsoft Windows Server, Standard or Enterprise edition, you can configure the parallel engine for Microsoft Windows Server, 32-bit or 64 bit editions, by editing the Windows Registry and by changing system settings that control available memory. The Windows Registry includes keys that control desktop heap size, swap area size, the number of available TCP connections, and memory management.

1. Before Starting Configuration

Log in to Windows Server as the Administrator.
Start the Registry Editor, In Window Server 2008, open the DOS prompt as Administrator (right click on the dos prompt and select “Run as administrator”. From the DOS prompt, enter regedit

C:Windowssystem32>regedit

Back up the Windows Registry. If you plan to make several changes (for example, to configure the parallel engine), create a single backup before you make all the changes. To backup the registry, From the Registry Editor, select File -> Export and save the registry to a file.

2. Configuring the Windows Registry: Enabling auto-tuning for paged pool size

Enable auto-tuning for the paged pool size. Enabling auto-tuning ensures that sufficient memory is available for paged pools.

The Windows Server kernel allocates memory in pools. These pools are known as the paged pool and the non-paged pool. Performance degradation and server instability might result if the memory for these pools is exhausted. To avoid this situation, you can enable auto-tuning at server startup by editing the PagedPoolSize registry value in the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession ManagerMemory Management registry subkey.

2.1 Procedure

(1.) In the Registry Editor, select the following registry subkey:

HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession ManagerMemory Management

(2.) in the right pane of the Registry Editor, right-click PagedPoolSize, and click Modify.
(3.) in the Base area, select Hexadecimal.
(4.) In the Value data field, specify 0 (zero). If 0 is already specified, auto-tuning is enabled, and you do not need to change this value. If you must specify a value other than 0, the value data for SystemPages must be 0.
(5.) Click OK.

3. Configuring the Windows Registry: Enabling auto-tuning for system page size

Enable auto-tuning for the system page size. Enabling auto-tuning ensures that sufficient memory is available for page tables and page table entries.

To support virtual to physical address translation, the Windows operating system uses page tables and page table entries (PTEs). These tables reside in kernel memory. If not enough memory is allocated for the page tables, Windows might fail to create processes, threads, and I/O buffers. Because the parallel engine creates many processes to run a job, jobs will fail at startup or during run time if Windows does not have enough resources to create processes.

You can enable auto-tuning at server startup by editing the SystemPages registry value in the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession ManagerMemory Management registry subkey.

3.1 Procedure

(1.) In the Registry Editor, select the following registry subkey:

HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession ManagerMemory Management

(2.) In the right pane of the Registry Editor, right-click SystemPages, and click Modify.
(3.) In the Base area, select Hexadecimal.
(4.) In the Value data field, specify 0 (zero). Typically, specifying 0 for this value provides sufficient memory. If the available PTEs are exhausted, however, you can specify the maximum value, 0XFFFFFFFF. If you must specify a value other than 0, the value data for PagedPoolSize must be 0.
(5.) Click OK.

4 Configuring the Windows registry: Setting the threshold for de-committing memory

Set the threshold for de-committing memory. On computers that have 1 GB or more of RAM, setting the threshold to the suggested value improves memory management.

When Windows frees memory at a specific address, the memory can remain committed, or it can be de-committed and marked as reserved. You can improve memory management on computers that have 1 GB or more of RAM by editing the HeapDeCommitFreeBlockThreshold registry value in the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession Manager registry subkey. This registry value specifies the number of contiguous bytes of memory (around the address of the freed memory) above which the block of memory is de-committed.

4.1 Procedure

(1.) In the Registry Editor, select the following registry subkey:

HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession Manager

(2.) Specify the value data for HeapDeCommitFreeBlockThreshold registry value:
(3.) If the registry value does not exist, click Edit > New > DWORD Value, and type the name of the registry value.
(4.) In the right pane of the Registry Editor, right-click the registry value, and click Modify.
(5.) In the Base area, select Hexadecimal.
(6.) In the Value data field, specify 0x000400.
(7.) Click OK.

5 Configuring the Windows registry: Increasing the noninteractive desktop heap size

Increase the noninteractive desktop heap size to ensure that a sufficient number of processes can be created and run concurrently.

DataStage® processes are created in a specific desktop heap. Each process consumes a small amount of memory from the desktop heap. If you expect to run many InfoSphere DataStage jobs concurrently, increase the size of the noninteractive desktop heap to allow more processes to be created and run concurrently.

You increase the size of the noninteractive desktop heap by editing the SharedSection parameter string in the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession ManagerSubSystems registry subkey. The SharedSection parameter string is located in the Windows registry value and uses the following format to specify heap size:

SharedSection=xxxx,yyyy,zzzz

Where:

· xxxx specifies the maximum size of the system-wide heap (in kilobytes)
· yyyy specifies the size of each desktop heap
· zzzz specifies the size of the desktop heap that is associated with a noninteractive Windows instance

5.1 Procedure

(1.) In the Registry Editor, select the following registry subkey:

HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession ManagerSubSystems

(2.) In the right pane of the Registry Editor, right-click the Windows registry value, and click Modify.

(3.) In the Value data field, locate the SharedSection parameter string, and change the last SharedSection parameter from 512 to 2048. For example, specify the following values for this parameter:

SharedSection=1024,3072,2048

If you cannot change the size of the desktop heap to 2048, try setting the value to 1024. For example, specify the following values for this parameter:

SharedSection=1024,3072,1024
(4.) Click OK.

6. Configuring the Windows registry: Specifying TCP/IP settings

Specify settings for TCP/IP in the Windows Registry to ensure that the required number of parallel processes can run concurrently.

You specify TCP/IP settings by editing the following registry values in the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters registry subkey:

MaxUserPort

Specifies the maximum port number for ephemeral TCP ports.

TcpNumConnections

Specifies the maximum number of concurrent connections that TCP can open. This value significantly affects the number of concurrent osh.exe processes that are allowed. If the value for TcpNumConnections is too low, Windows cannot assign TCP ports to stages in parallel jobs, and the parallel jobs cannot run.

These keys are not added to the registry by default.

6.1 Procedure

(1.) In the Registry Editor, select the following registry subkey:

HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters

(2.) Specify the following registry values: MaxUserPort and TcpNumConnections.

a. If the registry value does not exist, click Edit > New > DWORD Value, and type the name of the registry value.
b. In the right pane of the Registry Editor, right-click the registry value, and click Modify.
c. In the Base area, select Hexadecimal.
d. In the Value data field, specify the following values:
    Registry value name                         Value data                          Valid range
     MaxUserPort                                      65534                                5000 – 65534 (decimal)
    TcpNumConnections                        65534                                 0 – 0xfffffe (65534 decimal)
e. Click OK.

7 Configuring Windows: Changing swap area size

Change the size of the swap area to improve performance.

7.1 Procedure

(1.) Click Start > Control Panel > System.
(2.) In the System Properties window, click the Advanced tab.
(3.) In the Performance area, click Settings.
(4.) In the Performance Options window, select the Advanced tab.
(5.) In the Virtual memory area, click Change.
(6.) In the Virtual Memory window, specify a value for Maximum size. Set this value to be one to one- and-a-half times larger than the value for physical memory.
(7.) Click Set.
(8.) Click OK three times, and then close the Control Panel window.

8. Configuring Windows: Enable Physical Address Extension (PAE)

Enable /PAE switch. Edit the boot.ini file to remove the /3GB switch and, if necessary, add the /PAE switch.

By default, Windows reserves 2 GB of memory for the kernel and 2 GB of memory for user mode programs. When the /3GB switch is specified in the boot.ini file, 3 GB of memory is reserved for user mode programs, and the amount of kernel memory is reduced by 1 GB. This setting results in an insufficient amount of memory for the kernel. To resolve this problem, remove the /3GB switch, if present from boot.ini file(for Windows Server 2000 or 2003).

In addition, on servers that have more than 4 GB of physical memory installed, specify the /PAE switch in the boot.ini file. For Windows 2008 server, use the BCDEDIT.exe tool (see section 11.8.1 below). The /PAE switch specifies physical address extensions that allow a 32-bit processor to access memory that is above the 4 GB limit.

8.1 Procedure

(1.) Click Start > Control Panel > System.
(2.) In the System Properties window, click the Advanced tab.
(3.) In the Startup and Recovery area, click Settings.
(4.) Click Edit to edit the boot.ini file.
(5.) Remove the /3GB switch, if present.
(6.) On servers that have more than 4 GB of physical memory installed, add the /PAE switch to the boot.ini file.
(7.) Save and close the boot.ini file.
(8.) Click OK twice, and then close the Control Panel window.

Note: There is no boot.ini in Windows Server 2008. To enable PAE in Windows 2008 server, use the Boot Configuration Data Editor, BCDEDIT.EXE, which is located in c:windowssystem32 and follow the following steps:

(1.) Open Command Prompt with elevated privileges (Run as Administrator)
(2.) Execute the following command

BCDEDIT /SET PAE ForceEnabl

(3.) To verify, run BCDEDIT without any switch:

BCDEDIT

You should see “PAE ForceEnabl” at the end of the screen.

9 Restart Windows Server

After you make changes to the Windows Registry, restart Windows to apply your changes.