Datastage String Functions

Function Usage
AlNum Cab be used to check if the given string has alphanumeric characters
Alpha TRUE if string is completely alphabetic
CompactWhiteSpace all consective whitespace will be reduced to single space
Compare Compares two strings for sort
ComparNoCase Compare two strings irrespective of Case in-sensitiveness
ComparNum Compare the first n characters of the two strings
CompareNumNoCase Compare first n characters of the two strings irrespective of case in-sensitiveness
Convert Replace character in a string with the given character.
Count Count number of times a given substring occurs in a string
Dcount Returns count of delimited fields in a string
DownCase Change all uppercase letters in a string to lowercase
DQuote Enclose a string in double quotation marks
Field Return 1 or more delimited substrings
Index Find starting character position of substring
Left Finds leftmost n characters of string
Len Length of the string or total number of characters in a string
Num Return 1 if string can be converted to a number
PadString Return the string padded with the optional pad character and optional
length
Right Finds Rightmost n characters of string
Soundex Returns a string which identifies a set of words that are
phonetically similar
Space Return a string of N space characters
Squote Covers a string into single quotation marks
Str Repeat a string
StripWhiteSpace Return the string after removing all whitespace
Trim Remove all leading and trailing spaces and tabs. Also reduce the internal occurrences of spaces and tabs into one.
TrimB Remove all trailing spaces and tabs
TrimF Remove all leading spaces and tabs
Trim Returns a string with leading and trailing whitespace removed
Upcase Change all lowercase letters in a string to uppercase

Datastage – Problem when running large parallel jobs

If you have large parallel jobs (8.1) running in Windows environment then it is worth checking the servers event log. You could find an error related to Nutcracker.exe. An administrator with access right on the server can see the events logs. 

Nutcracker error could come up while running very heavy jobs where total cpu utilization is upto or above 90% and server starts using virtual memory. This is just an example of the impact of running large parallel jobs in Windows server. Hence to make the system somewhat full proof following changes can be done in Windows registry. The idea is to tune the TCP/IP buffer, increase the heap size and set the virtual memory to the recommended amount.

To tune TCP/IP settings go to registry editor and check if MaxUserPort and TcpNumConnections keys are present in the TCP/IP buffer and connection registry. Generally they will not be there. TcpNumConnections settings will impact on number of Osh processes.

To change the settings 
Go to registry editor 
Go to HKEY_LOCAL_MACHINE > SYSTEM > CurrentControlSet > Services > Tcpip > Parameters.
Now on the right pane modify the values: 
Value Name: MaxUserPort
Value Type: DWORD
Value data: 65534
More information on MaxUserPort click here.

Value Name: TcpNumConnections 
Value Type: DWORD
Value data: 65534
More information on TcpNumConnections click here.

To change the heap size 
Go to registry editor and change the non interactive desktop heap size. 
Go to HKEY_LOCAL_MACHINE > SYSTEM > CurrentControlSet > Control > Session Manager > SubSystems.
Click on windows on the right pane. 
Now edit the ShareSection parameter in the value box 
Change the last value from 512 to 1024.

Now what is Desktop heap and why we are changing it? 

In windows anything you create as a desktop object has got some amount of memory attached to it which is known as single desktop heap. The desktop heap is responsible for storing user interface objects, for example menus (Start for example) and windows (any new Window). When we run any application which needs to create these kind of interface object it needs space (in general terms, however it does have a process by which it allocates the area.) which goes from Desktop heap. In normal scenarios no one bother about how much Desktop heap has been configured in the system. The interesting thing here is if the application does not require any kind of interface objects it will not allocate the area from heap. The value which we are changing here is related to non interactive desktop which is related to services. So we are allowing more heap size in our environment so that while running services it should not exhaust. If you need more information on Windows heap please click here.

Now the last bit is changing the virtual memory configuration.

Virtual memory can be configured by right clicking on “My Computer” and going to Advance tab’s performance section. Now change the virtual memory to the recommended size. For example if you have 8GB of RAM then recommended size would be 12 GB.

However all above setting can overcome issues related to large jobs but patches recommended for Nutcracker issue should also be installed.

Configuring Datastage 8.5 in Windows

For more information – http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.jsp?topic=/com.ibm.swg.im.iis.productization.iisinfsv.install.doc/topics/wsisinst_config_pe_win.html

If the parallel engine is installed on a computer that runs Microsoft Windows Server, Standard or Enterprise edition, you can configure the parallel engine for Microsoft Windows Server, 32-bit or 64 bit editions, by editing the Windows Registry and by changing system settings that control available memory. The Windows Registry includes keys that control desktop heap size, swap area size, the number of available TCP connections, and memory management.

1. Before Starting Configuration

  • Log in to Windows Server as the Administrator.
  • Start the Registry Editor, In Window Server 2008, open the DOS prompt as Administrator (right click on the dos prompt and select “Run as administrator”. From the DOS prompt, enter regedit
  •               
        C:Windowssystem32>regedit

  • Back up the Windows Registry. If you plan to make several changes (for example, to configure the parallel engine), create a single backup before you make all the changes.  To backup the registry, From the Registry Editor, select File -> Export and save the registry to a file.
2. Configuring the Windows Registry: Enabling auto-tuning for paged pool size

Enable auto-tuning for the paged pool size. Enabling auto-tuning ensures that sufficient memory is available for paged pools.

The Windows Server kernel allocates memory in pools. These pools are known as the paged pool and the non-paged pool. Performance degradation and server instability might result if the memory for these pools is exhausted. To avoid this situation, you can enable auto-tuning at server startup by editing the PagedPoolSize registry value in the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession ManagerMemory Management registry subkey.

2.1 Procedure

(1.) In the Registry Editor, select the following registry subkey:

HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession ManagerMemory Management

(2.) in the right pane of the Registry Editor, right-click PagedPoolSize, and click Modify. 
(3.) in the Base area, select Hexadecimal.
(4.) In the Value data field, specify 0 (zero). If 0 is already specified, auto-tuning is enabled, and you do not need to change this value. If you must specify a value other than 0, the value data for SystemPages must be 0. 
(5.) Click OK. 

3. Configuring the Windows Registry: Enabling auto-tuning for system page size

Enable auto-tuning for the system page size. Enabling auto-tuning ensures that sufficient memory is available for page tables and page table entries.

To support virtual to physical address translation, the Windows operating system uses page tables and page table entries (PTEs). These tables reside in kernel memory. If not enough memory is allocated for the page tables, Windows might fail to create processes, threads, and I/O buffers. Because the parallel engine creates many processes to run a job, jobs will fail at startup or during run time if Windows does not have enough resources to create processes. 

You can enable auto-tuning at server startup by editing the SystemPages registry value in the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession ManagerMemory Management registry subkey.

3.1 Procedure

(1.) In the Registry Editor, select the following registry subkey:

    HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession ManagerMemory Management

(2.) In the right pane of the Registry Editor, right-click SystemPages, and click Modify. 
(3.) In the Base area, select Hexadecimal.
(4.) In the Value data field, specify 0 (zero). Typically, specifying 0 for this value provides sufficient memory. If the available PTEs are exhausted, however, you can specify the maximum value, 0XFFFFFFFF. If you must specify a value other than 0, the value data for PagedPoolSize must be 0. 
 (5.) Click OK. 

4 Configuring the Windows registry: Setting the threshold for de-committing memory

Set the threshold for de-committing memory. On computers that have 1 GB or more of RAM, setting the threshold to the suggested value improves memory management.

When Windows frees memory at a specific address, the memory can remain committed, or it can be de-committed and marked as reserved. You can improve memory management on computers that have 1 GB or more of RAM by editing the HeapDeCommitFreeBlockThreshold registry value in the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession Manager registry subkey. This registry value specifies the number of contiguous bytes of memory (around the address of the freed memory) above which the block of memory is de-committed. 

4.1 Procedure

(1.) In the Registry Editor, select the following registry subkey:

    HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession Manager

(2.) Specify the value data for HeapDeCommitFreeBlockThreshold registry value: 
(3.) If the registry value does not exist, click Edit > New > DWORD Value, and type the name of the registry value.
(4.) In the right pane of the Registry Editor, right-click the registry value, and click Modify.
(5.) In the Base area, select Hexadecimal.
(6.) In the Value data field, specify 0x000400. 
(7.) Click OK.

5 Configuring the Windows registry: Increasing the noninteractive desktop heap size

Increase the noninteractive desktop heap size to ensure that a sufficient number of processes can be created and run concurrently.

DataStage® processes are created in a specific desktop heap. Each process consumes a small amount of memory from the desktop heap. If you expect to run many InfoSphere DataStage jobs concurrently, increase the size of the noninteractive desktop heap to allow more processes to be created and run concurrently.

You increase the size of the noninteractive desktop heap by editing the SharedSection parameter string in the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession ManagerSubSystems registry subkey. The SharedSection parameter string is located in the Windows registry value and uses the following format to specify heap size:

SharedSection=xxxx,yyyy,zzzz

Where:

  • · xxxx specifies the maximum size of the system-wide heap (in kilobytes)
  • · yyyy specifies the size of each desktop heap
  • · zzzz specifies the size of the desktop heap that is associated with a noninteractive Windows instance

5.1 Procedure

(1.) In the Registry Editor, select the following registry subkey:

HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession ManagerSubSystems

(2.) In the right pane of the Registry Editor, right-click the Windows registry value, and click Modify.

(3.) In the Value data field, locate the SharedSection parameter string, and change the last SharedSection parameter from 512 to 2048. For example, specify the following values for this parameter:

   SharedSection=1024,3072,2048

 If you cannot change the size of the desktop heap to 2048, try setting the value to 1024. For example, specify the following values for this parameter:

SharedSection=1024,3072,1024
 (4.) Click OK. 

6. Configuring the Windows registry: Specifying TCP/IP settings

Specify settings for TCP/IP in the Windows Registry to ensure that the required number of parallel processes can run concurrently. 

You specify TCP/IP settings by editing the following registry values in the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters registry subkey: 

MaxUserPort 

Specifies the maximum port number for ephemeral TCP ports. 

TcpNumConnections 

Specifies the maximum number of concurrent connections that TCP can open. This value significantly affects the number of concurrent osh.exe processes that are allowed. If the value for TcpNumConnections is too low, Windows cannot assign TCP ports to stages in parallel jobs, and the parallel jobs cannot run. 

These keys are not added to the registry by default.

6.1 Procedure

(1.) In the Registry Editor, select the following registry subkey:

HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters

(2.) Specify the following registry values: MaxUserPort and TcpNumConnections. 

a. If the registry value does not exist, click Edit > New > DWORD Value, and type the name of the registry value.
b. In the right pane of the Registry Editor, right-click the registry value, and click Modify.
c. In the Base area, select Hexadecimal.
d. In the Value data field, specify the following values: 
    Registry value name                         Value data                          Valid range
     MaxUserPort                                      65534                                 5000 – 65534 (decimal)
    TcpNumConnections                        65534                                 0 – 0xfffffe (65534 decimal)
e. Click OK.

7 Configuring Windows: Changing swap area size

Change the size of the swap area to improve performance.

7.1 Procedure

(1.) Click Start > Control Panel > System.
(2.) In the System Properties window, click the Advanced tab.
(3.) In the Performance area, click Settings. 
(4.) In the Performance Options window, select the Advanced tab. 
(5.) In the Virtual memory area, click Change. 
(6.) In the Virtual Memory window, specify a value for Maximum size. Set this value to be one to one- and-a-half times larger than the value for physical memory. 
(7.) Click Set. 
(8.) Click OK three times, and then close the Control Panel window. 

 8. Configuring Windows: Enable Physical Address Extension (PAE)

Enable /PAE switch. Edit the boot.ini file to remove the /3GB switch and, if necessary, add the /PAE switch. 

By default, Windows reserves 2 GB of memory for the kernel and 2 GB of memory for user mode programs. When the /3GB switch is specified in the boot.ini file, 3 GB of memory is reserved for user mode programs, and the amount of kernel memory is reduced by 1 GB. This setting results in an insufficient amount of memory for the kernel. To resolve this problem, remove the /3GB switch, if present from boot.ini file(for Windows Server 2000 or 2003). 

In addition, on servers that have more than 4 GB of physical memory installed, specify the /PAE switch in the boot.ini file. For Windows 2008 server, use the BCDEDIT.exe tool (see section 11.8.1 below). The /PAE switch specifies physical address extensions that allow a 32-bit processor to access memory that is above the 4 GB limit. 

8.1 Procedure

(1.) Click Start > Control Panel > System.
(2.) In the System Properties window, click the Advanced tab. 
(3.) In the Startup and Recovery area, click Settings. 
(4.) Click Edit to edit the boot.ini file. 
(5.) Remove the /3GB switch, if present. 
(6.) On servers that have more than 4 GB of physical memory installed, add the /PAE switch to the boot.ini file. 
(7.) Save and close the boot.ini file. 
(8.) Click OK twice, and then close the Control Panel window. 

Note: There is no boot.ini in Windows Server 2008. To enable PAE in Windows 2008 server, use the Boot Configuration Data Editor, BCDEDIT.EXE, which is located in c:windowssystem32 and follow the following steps: 

(1.) Open Command Prompt with elevated privileges (Run as Administrator) 
(2.) Execute the following command 

      BCDEDIT /SET PAE ForceEnabl

(3.) To verify, run BCDEDIT without any switch:

        BCDEDIT

You should see “PAE ForceEnabl” at the end of the screen.


9 Restart Windows Server

After you make changes to the Windows Registry, restart Windows to apply your changes. 

Aggregator Stage and Filter Stage with example

If we have a data as below 

table_data

Sno name
10 Jeff
10 Mike
10 Sam
20 Tom
30 Amy
20 Ron
40 John



And we need to get the same multiple times records into the one target. 
And single records not repeated with respected to Sno need to come to one target.

Take Job design as 

 

Read and load the data in sequential file. 

In Aggregator stage select group =dno 

Aggregator type = count rows 

Count output column =dno_cpunt( user defined ) 

In output Drag and Drop the columns required.Than click ok 

In Filter Stage 

—– At first where clause dno_count>1 
—–Output link =0 
—–At second where clause dno_count<=1 —–output link=0 Drag and drop the outputs to the two targets. Give Target file names and Compile and Run the JOb. You will get the required data to the Targets. 

Cloud computing in Datastage and Qualitystage

·        
Is InfoSphere Information Server cloud enabled?
Yes, InfoSphere Information Server’s DataStage and QualityStage offerings are enabled for deployment as a cloud-deployable data integration solution. Due to the versatility of the InfoSphere DataStage and QualityStage platform support, connectivity (including SalesForce.com) and standards used, these offerings can be deployed in a similar manner to how they are deployed on-premise within an enterprise today.
·         What are the benefits of running InfoSphere Information Server in the cloud?
IBM InfoSphere Information Server in the cloud opens up multiple new solutions to traditional and cloud-enabled information challenges. Coupled with it’s the ability to leverage a pay-as-you-go pricing model with the rapid deployment paradigm and massive scalability InfoSphere Information Server, the following off-premise solutions can be quickly delivered:
o    Enable systems integrators to provide data consolidation services to support complex application rationalization and migration projects lasting 3 to 12 months.
o    Flexible development capacity for existing clients using InfoSphere Information Server.
o    On-going data preparation for SaaS applications and business intelligence solutions.
This reduces the upfront time and expense involved with setting up hardware infrastructure and software licenses for projects lasting 3-12 months. With its collaborative, model-driven design environment coupled with massive scalability of a parallel processing architecture, it is ideally suited to rapid deployment and ensuring maximum throughput of trusted data per hour.
·         What kind of cloud environments can InfoSphere be deployed in, or planned?
InfoSphere Information Server can be deployed in both private and public cloud scenarios. This announcement highlights the availability of the first InfoSphere offering on a public cloud provider, specifically, Amazon Elastic Compute Cloud (Amazon EC2), a hosting service provided by Amazon Web Services. Clients building their own private clouds or using other cloud providers can leverage their existing InfoSphere Information Server licenses provided they adhere to the license terms and prepared their own machine images.
·         What is Amazon Web Services (AWS)?
AWS delivers a set of integrated services that form a computing platform “in the cloud”. Learn more about Amazon Web Services and the IBM offerings on AWS.
·         What deployment models are available for InfoSphere Information Server on Amazon EC2?
You can deploy InfoSphere Information Server’s DataStage and QualityStage on Amazon EC2 one of two ways:
o    Create your own InfoSphere Information Server-based Amazon Machine Images (AMI)s by using licenses that you already own.
o    Use the pre-built InfoSphere Information Server AMIs containing production-ready InfoSphere DataStage and InfoSphere QualityStage generated by IBM. There are hourly usage charges for the IBM generated AMIs including InfoSphere DataStage and QualityStage software licensing costs.
·         What does InfoSphere Information Server uniquely deliver on Amazon EC2?
On Amazon EC2, the pre-built InfoSphere Information Server AMI delivers an integrated ETL and Data Quality development environment that enables developers to cleanse, transform and move data with same tool using the same metadata. InfoSphere Information Server has a dynamic parallel execution engine that provides a design, deploy anywhere capability that dynamically and seamlessly scales up-or-down based upon hardware configuration thereby simplifying deployment and administration. Lastly, InfoSphere QualityStage is a probabilistic matching engine that ensures higher quality trusted data when linking any data domain across multiple, complex data sources.
·         How does InfoSphere Information Server on Amazon save time, or money, or worry?
This offering is primarily aimed at systems integrator partners, who have an enterprise application consolidation and migration service lines (e.g., SAP) and have skilled resources trained on InfoSphere Information Server. These partners are actively helping clients to use InfoSphere Information Server to consolidate, cleanse, and load the right trusted data into their target enterprise application of choice. The new “pay as you go” pricing offered by InfoSphere Information Server on EC2 is ideal for short-term projects lasting 3-12 months. Ordinarily a client would need to purchase a perpetual data integration license and make a hardware investment in order to support a short-term application consolidation or migration project.
Secondarily, this offering can help enterprise IT departments who have an existing investment in InfoSphere Information Server skills and have the need for additional test and development capacity. In addition, it helps them to achieve demonstrable ROI before committing to the capital investment in additional hardware and data integration software licenses to support short-term enterprise projects, or cloud integration scenarios.
·         Can I create my own InfoSphere Information Server-based AMIs for use on Amazon EC2?
Yes, in addition to being able to use the InfoSphere DataStage and QualityStage AMI generated by IBM, you can create your own InfoSphere Information Server-based AMIs for EC2. If you create your own InfoSphere Information Server-based AMIs, you are responsible for ensuring you own licenses for the software (such as InfoSphere Information Server and the operating system) running on the AMI.
·         Does IBM provide any InfoSphere AMIs for use on Amazon EC2?
Yes, IBM has partnered with AWS to make available a pre-bundled InfoSphere Information Server AMI containing InfoSphere DataStage and InfoSphere QualityStage for production use. This AMI consists of InfoSphere DataStage and QualityStage version 8.1 pre-installed on Novell SUSE Linux Enterprise Server version 10 (SLES 10 SP2) for the server and Windows 2003 Server for the InfoSphere Developer client licenses.
·         How are the production-ready InfoSphere Information Server AMI priced?
The InfoSphere Information Server AMI available from AWS has an hourly pay-as-you-use pricing that depends on the instance size you run the InfoSphere AMIs on. You can also incur charges for additional AWS services you use with InfoSphere. See the Amazon page for more information.
·         How will customers get information into and out of the cloud environment?
Together IBM InfoSphere Information Server and Amazon EC2 provide a broad range of solutions for getting information into and out of the cloud.
o    Load or save information via files or file sets. Network file transfers can be used for transferring data to Amazon from their enterprise for small and medium-sized data sets (gigabytes). For larger data sets, the Amazon import/export services should be used.
o    Access databases such as Oracle, MySQL, DB2 and leverage flat files and XML documents.
o    Leverage Web services or other service protocols to access or push information in and out of the cloud environment.
·         What support is available for the InfoSphere Information Server production-ready AMI?
At this point you can seek community assistance on the AWS forum. Documentation provided with the AMI image provides useful information on how to maximize the use of the InfoSphere software in the EC2 environment. IBM offers a full curriculum of InfoSphere educational courses with multiple delivery options.
·         Where can I access the production-ready InfoSphere Information Server AMI and related information?
Links to IBM Documents on Datastage Cloud computing on Amazon
Getting Started

Cloud computing in Datastage and Qualitystage – Amazon web services

Please note: Amazon has done away with Datastage on the cloud. To install Datastage on the cloud, the software and the license needs to be provided to Amazon and installation will be completed. (Thanks to the comments from ganther). So, all the details may not still be valid.

IBM InfoSphere Information Server provides an integrated ETL and Data Quality development environment that delivers a trusted, scalable and “Pay as you go” data integration solution that helps organizations derive more value from the complex, heterogeneous

 

This 64 bit kernel version 2.6.16 based Linux OS AMI contains a pre-installed InfoSphere Information Server version 8.1 product. The installed components are InfoSphere DataStage and InfoSphere QualityStage. It also contains an instance of the DB2 9.5 database that will host the metadata repository. The services tier is hosted by IBM WebSphere Application Server version 6.0.2.17.
There is a separate Windows based image with InfoSphere Information Server clients such as the web based Administration Console, the DataStage and QualityStage Designer, DataStage and QualityStage Administrator, and the DataStage and QualityStage Director.

Prerequisites and Resources:

Instance Type
The InfoSphere Information Server AMI on 64 bit kernel version 2.6.16 based Linux OS, can be used with large, Extra Large or High-CPU Extra Large instance type based on project requirement.
Software Included
  • 2.6.16 kernel level based Linux OS
  • InfoSphere DataStage 8.1 and InfoSphere QualityStage 8.1 (64 bit)
Resources and Documentation
  • Link to the Information Server Client AMI : (AWS to add link to the Client Catalog Entry)
  • Frequently Asked Questions for Information Server AMIs.
  • Get Started with Information Server AMIs.
  • Cloud Computing and Information Server
  • Information Server on DeveloperWorks
  • For a full listing of IBM products featured on AWS, please see the IBM developerWorks Cloud Computing Resource Center
  • Please visit http://www-01.ibm.com/software/data/integration/ for extensive information about the IBM Information Server suite.
  • If you are interested in purchasing a new IBM InfoSphere Information Server license, please visit the IBM Software Online Catalog.
  • InfoSphere DataStage product number 5724-Q36
  • InfoSphere QualityStage product number 5724-Q36
  • DB2 9.5 product number
  • The IBM License Information (LI) is presented in the AMI for your acceptance in English only. If you would prefer to view the LI in another language, please search for the program license agreement here using the program numbers above. You’ll then be asked to select your language.



Amazon Web Services    http://aws.amazon.com/

https://aws.amazon.com/amis/ibm-infosphere-datastage-qualitystage-production-ami

 

Amazon Web Services

Cost – Pay as you go, no long term commitment, no upfront costs
Elasticity – Deploy instantly and scale up or down as and when needed
Security managed by Amazon

AMI – Amazon Machine Image

Datastage Architecture

Datastage needs to be installed on a server. Most servers supported by IIS are based on Windows/Unix/Linux/AIX operating systems. Datastage Clients would be installed on the users workstations/laptops. Users would include Developers, testers, business analyst, operations team member (like production support).

Client Components are explained here. http://datastageetlexpert.blogspot.com/2012/06/client-components-in-datastage-75.html

What is BI/Business Intelligence?

Business intelligence (BI) is a broad category of application programs and technologies for gathering,
storing, analyzing, and providing access to data to help enterprise users make better business decisions. BI
applications include the activities of decision support, query and reporting, online analytical processing
(OLAP), statistical analysis, forecasting, and data mining.
Examples : BusinessObjects http://www.businessobjects.com/