Code checklist for Datastage



Are naming conventions implemented for all objects (stages, links & datasets)
Are all required parameter sets attached to the job?
Are all required jobs compiled with the latest parameter set? (in case of projects where existing jobs are being changed)
Are all source columns data type set to Varchar and Nullable Yes?
Are all parameters passed correctly as required to the shared container?
Is annotation/description included within the job
Is dataset path, holding area path etc HARDCODED?
When run does the Job produce WARNINGS, Check if they can be avoided?
Is the Short Job Description / Full Job Description provided?
Is the job version number provided?
DataSet names in Lower Case with .ds extension?
pFileName parameter not included, FileName is hardcoded (wherever applicable) ?
Ensure that the null handling properties are taken care for all the nullable fields. Do not set the null field value to some value which may be present in the source.
Ensure that all the character fields are trimmed before any processing. Normally extra spaces in the data may lead to some errors like lookup mismatch which are hard to detect.
In case the partition type for the next immediate stage is to be changed then the ‘Propagate partition’ should be set to ‘Clear’ in the current stage.
Make sure that appropriate partitioning and sorting are used in the stages, wherever possible. This enhances the performances. Make sure that you understand the partitioning being used. Otherwise leave it auto.
Check that the parameter values assigned to the jobs through sequencer.
Check whether the dataset are used instead of sequential file for intermediate storage between the jobs. This enhances performance in a set of linked jobs.
Verify that the intermediate files that are used by downstream jobs have unix read access/permission to all users.
The order of stage variables should be in correct order. Eg: – A stage variable used in calculation should be in the higher order than the calculation variable.
Do not validate on a null field in a transformer. Use appropriate data types for the Stage variables. Use IsNull(), IsNotNull() or Seq() for doing such validations.
While reading columns from a Database stage, use Upper Case for column names and table names in SQL queries as a good practice.
While reading from Database or querying, use “for fetch only” instead of “with ur” to avoid table lock.

For every Job Activity stage in sequencer, ensure that “Reset if required, then run” is selected where relevant.

Environment checklist for Datastage



Check for proper user privileges presence in the Database to run the datastage jobs.

All the MQTs related to a table should be present in development environment. Also, privileges to run the ‘set integrity’ in the MQTs and tables should be present.
Availability of the TSM(Tivoli Storage Manager) so that DB2(database) connector can work in case non-recoverable is set to ‘false’. Otherwise, its not required.
No DB2(database) table lock should be there. This needs to be checked before running the datastage jobs.
Data partitions on the DB2(database) tables should be done till the current date (or month, if it is done on fiscal_month) Check if the database partition is done to the table or not(wherever applicable)?
Access in control-M so that we can run the jobs
Data refresh tasks done ,(if required) from Production – Reference Data, Bootstrap data

Substitute DB2 with relevant database

Read Excel files in Datastage


V7.X

Need to convert the excel files to csv (save as .csv)

then read that file using sequentila file stage.

To convert excel into csv ,just do save as csv file and mention delimiter as comma

V8.0.1 and onwards

There is a way to read Excel files directly. Create a DSN and point it to the Excel file which needs to be read. This is similar to the ODBC connection which will be setup for DBs. Once this is done, you can use ODBC stage in Datastage to connect to this DSN. Each sheet in the excel will be treated as a table and you can write normal SQL query to pull data out of it.

Datastage Certification Dumps for V8.5: 000-421

Inphosphere Information Server Datastage V8.5

Exam 000-421

129 questions

Test information:

  • Number of questions: 65
  • Time allowed in minutes: 90
  • Required passing score: 65%

 (IBM link – http://www-03.ibm.com/certify/tests/ovr421.shtml)
To register for the exam, please click on the IBM link and follow the IBM at Prometric link at the right of the webpage.


download here
https://docs.google.com/file/d/0B4bweMe6o5pwSWR1VVBLNDM2MUk/edit?usp=sharing 

Stages in Datastage

There are different types of Stages in Datastage. 


1)Database Stages 
2)File Stages 
3)Development Stages 
4)Processing Stages 
5)Real Time Stages 


In Database Stages, mostly used stages are 

a)Oracle Enterprise 
b)ODBC Enterprise 
c)ODBC Conductor 
d)DB2 Enterprise 
e)Teradata Enterprise 
f)Informic 
g)Dynamic RDBMS 
h)Universe 
i)MS SQL Server Load 


And in Processing Stages mostly used stages are 

a)Aggregator Stage 
b)Command Stage 
c)Pivot Stage 
d)Transformer Stage 
e)Sort Stage 
f)Merge Stage 
g)Link Partitioner Stage 
h)Join Stage 
i)Lookup Stage 
j)Copy Stage

k)Funnel Stage
l)Surrogate Key
M)SCD
N)Filter Stage

The file stages are as follows

a)Sequential Stage 
b)Data Set Stage
c)File Set Stage
d)Lookup file set stage
e)Complex Flat File Stage
f)External Source Stage


The Development Stages are

a)Row Generator
b)Column Generator 
c)Head Stage
d)Tail Stage
e)Sample Stage
f)Peek Stage 

Datastage Scheduler options

For Datastage implementation under UNIX/Linux, Autosys is the most popular scheduler. The reason Autosys is prefered is:

1. Autosys gives u various options, like JOB_ON_ICE, JOB_ON_HOLD
2. Scheduling is pretty simple, if u hav a job that u want to schedule every one hr. Then through Datastage you have to schedule it 24 times, which would create 24 processes(Distinct PID). whereas in autosys you dont have to take so much pain.
3. if u want to run a job on first monday of every month, u just have to set a Calender in autosys, in datastage couldn’t think of.
4. if u want to run a job on first business day(a business day for a client may vary) of every month, u just have to set a Calender in autosys, in datastage this is tougher.
In short, it would give various scheduling options, with less effort. Reusability and maintenance is also a factor.

Other schedulers are –  

SeeBeyond, ControlM, at, cron

Though Datastage includes a scheduling option, it does not have its own. DataStage doesn’t include a “scheduler” so leverages the underlying O/S. For UNIX that means cron and a check of the crontab entries for the scheduling user will have what you need. DataStage leverages cron for recurring schedules and at for ‘one off’ schedules. For Windows, it uses scheduled tasks of Windows.

From the operating system command line, logged in as the scheduling user, a “crontab -l” will list the scheduled jobs.

You can schedule a job to run in a number of ways:
  • Once today at a specified time
  • Once tomorrow at a specified time
  • On a specific day and at a particular time
  • Daily at a particular time
  • On the next occurrence of a particular date and time
short procedure


From Director:

1. Select job
2. Add to Schedule (right-click or clock icon)
3. Select the ‘Every’ radio button
4. Make sure Monday thru Friday are highlighted
5. Set time to 9:00 AM
6. Click OK
7. Set parameters
8. Click Schedule 

Procedure
  1. Select the job or job invocation you want to schedule in the Job Status or Job Schedule view.
    Note: You cannot schedule a job with a status of Not compiled or a web service-enabled job.
  2. Do one of the following to display the Add to schedule dialog box:
    • Choose Job > Add to Schedule .
    • Choose Add To Schedule from the appropriate shortcut menu.
    • Click the Schedule button on the toolbar.Choose when to run the job by clicking the appropriate option button:
      Today runs the job today at the specified time (in the future).
      Tomorrow runs the job tomorrow at the specified time.
      Every runs the job on the chosen day or date at the specified time in this month and repeats the run at the same date and time in the following months.
      Next runs the job on the next occurrence of the day or date at the specified time.
      Daily runs the job every day at the specified time.
  3. If you selected Every or Next in step 3, choose the day to run the job by doing one of the following:
    • Choose an appropriate day or days from the Day list.
    • Choose a date from the calendar.
      Note: If you choose an invalid date, for example, 31 September, the behavior of the scheduler depends upon the operating system of the computer that hosts the engine tier, and you might not receive a warning of the invalid date. Refer to your documentation for the engine tier host for further information.
  4. Choose the time to run the job. There are two time formats:
    • 12-hour clock. Click either AM or PM.
    • 24-hour clock. Click 24H Clock.
      Click the arrow buttons to increase or decrease the hours and minutes, or enter the values directly.
  5. Click OK. The Add to schedule dialog box closes and the Job Run Options dialog box appears.
  6. Fill in the job parameter fields and check warning and row limits, as appropriate.
  7. Click Schedule. The job is scheduled to run and is added to the Job Schedule view.

Datastage History

At the beginning datastage was available under Ardent DataStage then it’s name was changed to Ascential DataStage.

ascential

In 2005 IBM acquired the Ascential company and added datastage product to it’s WebSphere family. The October 2006 release of DataStage has integrated it into the new IBM Information Server platform, but still called Websphere Datastage.

websphere

IBM WebSphere DataStage is one of the products from IBM’s WebSphere Information Integration suite and Information Server. The tool uses mostly graphical notaction to present data integration solutions. Datastage is delivered in various versions as Server and Enterprise Edition. In 2008 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage.

For more detailed early history, see below:

  • DataStage was conceived at VMark, a spin off from Prime Computers that developed two notable products: UniVerse database and the DataStage ETL tool. The first VMark ETL prototype was built by Lee Scheffler in 1996
  • Lee Scheffler presented the DataStage product overview to the board of VMark in 1996 and it was approved for development. The product was in alpha testing in October, beta testing in November and was generally available in 1997.
  • VMark acquired UniData in 1997 and renamed itself to Ardent Software. In 1999 Ardent Software was acquired by Informix the database software vendor.
  • In 2001 IBM acquired Informix and took just the database business leaving the data integration tools to be spun off as an independent software company called Ascential Software

For various edition names, as people refer to various DataStage options using these names:

 

Edition Details
Enterprise Edition  a name give to the version of DataStage that had a parallel processing architecture and parallel ETL jobs.
Server Edition  the name of the original version of DataStage representing Server Jobs. Early DataStage versions only contained Server Jobs. DataStage 5 added Sequence Jobs and DataStage 6 added Parallel Jobs via Enterprise Edition.
MVS Edition  mainframe jobs, developed on a Windows or Unix/Linux platform and transferred to the mainframe as compiled mainframe jobs.
DataStage for PeopleSoft  a server edition with prebuilt PeopleSoft EPM jobs under an OEM arrangement with PeopeSoft and Oracle Corporation.
DataStage TX  for processing complex transactions and messages, formerly known as Mercator.
DataStage SOA  Real Time Integration pack can turn server or parallel jobs into SOA services.

Datastage TX overview

IBM WebSphere DataStage TX delivers the ability to easily and seamlessly automate high-volume, complex transactions without the need for additional coding-resulting in a quick return on investment.

IBM WebSphere DataStage TX 6.7 delivers rapid ROI through a highly scalable, open architecture. 

IBM WebSphere DataStage TX allows you to: 
Quickly and seamlessly integrate many of the most popular applications, databases, and messaging systems Respond quickly and with flexibility to rapidly evolving, strategic business requirements Leverage the value of your enterprise systems and applications Meet your project-based or full integration solution needs 

IBM WebSphere DataStage Features & Benefits
IBM WebSphere DataStage provides these unique capabilities: The Most Powerful ETL Solution – Supports the collection, integration and transformation of high volumes of data, with data structures ranging from simple to highly complex. DataStage manages data arriving in real-time as well as data received daily, weekly or monthly. 

The Most Scalable Platform
Enables companies to solve large-scale business problems through high-performance processing of massive data volumes. By leveraging the parallel processing capabilities of multiprocessor hardware platforms, DataStage Enterprise Edition can scale to satisfy the demands of ever-growing data volumes and stringent real-time requirements and ever shrinking batch windows 

The Most Comprehensive Source and Target Support
Supports a virtually unlimited number of heterogeneous data sources and targets in a single job, including: text files; complex data structures in XML; ERP systems such as SAP and PeopleSoft; almost any database (including partitioned databases); web services; and SAS. 

Real-time Data Integration Support
Operates in real-time capturing messages from Message Oriented Middleware (MOM) queues using JMS or WebSphereMQ adapters to seamlessly combine data into conforming operational and historical analysis perspectives. IBM WebSphere RTI Services is a service-oriented architecture (SOA) enabling middleware that brokers enterprise-wide the benefits of the Ascential Enterprise Integration Suite across a continuum of time constraints, application suites, interface protocols and integration technologies. 

Advanced Maintenance and Development
Gives developers maximum speed, flexibility and effectiveness in building, deploying, updating and managing their data integration infrastructure. Full data integration reduces the development and maintenance cycle for data integration projects by simplifying administration and maximizing development resources. 

Complete Connectivity Between Any Data Source and Any Application
Ensures that the most relevant, complete and accurate data is integrated and used by the most popular enterprise application software brands, including SAP, Siebel, Oracle, PeopleSoft and J.D. Edwards.

How to perform a Lookup in Datastage?

The data in Datastage can be looked up from a hashed file or from a database (ODBC/ORACLE) source. Lookups are always managed by the transformer stage.

A Hashed File is a reference table based on key fields which provides fast access for lookups. They are very useful as a temporary or non-volatile program storage area. An advantage of using hashed files is that they can be filled up with remote data locally for better performance.
To increase performance, hashed files can be preloaded into memory for fast reads and support write-caching for fast writes.

There are also situations where loading a hashed file and using it for lookups is much more time consuming than accessing directly a database table. It usually happens where there is a need to access more complex data than a simple key-value mapping, for example what the data comes from multiple tables, must be grouped or processed in a database specific way. In that case it’s worth considering using ODBC or Oracle stage.
Please refer to the examples below to find out what is the use of lookups in Datastage
In the transformer depicted below there is a lookup into a country dictionary hash file. If a country is matched it is written to the right-hand side column, if not – a “not found” string is generated.

Design of a datastage transformer with lookup

In the job depicted below there is a sequential file lookup, linked together with a hash file which stores the temporary data.
Sequential file lookup