Datastage | Datastage

September 15, 2013

Difference between Primary Key and Partition Key

Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, and Random etc. While using Hash partition we specify the Partition Key.

September 15, 2013

Dimensional Modeling

Dimensional Modeling is a logical design technique that seeks to present the data in a standard framework that is, intuitive and allows for high performance access.

September 13, 2013

Transformations related to Data Quality

Component Category	Transformation Name	Description
Data Source Components	CSV source	Connects to files whose data is organized in a delimited format
	Database Source	Connects directly to a database to provide source data for a plan
	Fixed Width Source	Specifies a fixed-width file as the data source for a plan
	Real Time Source	Allows to accept input in real time mode
	SAP Source	Connects to a SAP system to provide source data for a plan
	CSV Match Source	Compares the records in a single source file against one another to identify duplicates in the file
	CSV Dual Match Source	Allows to match data from two discrete files
	DB Match Source	Connects to Data Quality repository to choose tables and columns for use in a matching plan.
	Group Source	Defines the input data for a plan by reading the set of group files created by a Group Sink in another plan
	Dual Group Source	Allows to perform matching operations on grouped data from two discrete, original data sources
Frequency Components	Count	Tabulates the quantities of discrete data values in a selected column
	MinAvgMax	Provides the minimum, maximum, and average data values for selected columns
	Range Counter	tabulates the frequency and distribution of numerical data values in user-selected fields
	Missing Values	Searches for specific values in an input field and determines the frequency of the values within that field
Analysis Components	Character Labeller	Provides a character-by-character profile of the data values in a data field
Analysis Components	Token Labeller	Analyzes the format of the data values within a field and categorizes each value according to a list of standard or user-defined tokens.
Transformation Components	Search Replace	Used to remove/replace user defined values from a group
	Word Manager	Applies one or more reference sources (data dictionaries) to an input dataset
	Merge	Combines the data values from multiple input fields to form a single output field
	To Upper	Alters the case of a dataset
	Rule Based Analyzer	Allows to define and apply one or more business rules to selected input data
	Scripting	Provides greater flexibility than the Rule Based Analyzer to build customized rules and processes into a data quality plan using TCL (Tool Command Language)
Parsing Components	Splitter	Parses the data values in a text field into discrete new fields by comparing the source data to one or more reference datasets
	Token Parser	Parses free-text fields that each contain multiple tokens and parses each token to a discrete field.
	Profile Standardizer	Parses an input field from a token labeller to a number of output fields based on a data structure that you define.
	Context Parser	Parses free-text fields containing multiple tokens into multiple single-token fields based on the value and the relative position of the tokens
Key Field Generator	Soundex	Recognizes phonetic matches between alphabetic strings – it analyzes the phonetic components of a word and assigns a value to the string based on the phonetic characteristics of the initial characters in the string.
Key Field Generator	Nysiis	Converts the values of an input field into their phonetic equivalent and reconstitutes the spelling of the string based in its phonetic characteristics.
Matching Components	Edit Distance	Derives a match score for two data values by calculating the minimum “cost” of transforming one string into another by the insertion, deletion, and replacement of characters.
	Jaro Distance	Calculates the general similarity between two data values; however, the Jaro Distance algorithm reduces the match score for the pair of values if they do not share a common prefix.
	Hamming Distance	Derives a match score for a pair of data strings by calculating the number of positions in which characters differ between them
	Bigram	Matches the data values on the basis on the occurrence of consecutive characters in both data strings in a matching pair
	Mixed Field Matcher	Compares pairs of data values at a time to identify matches in a dataset wherein data values of the same type or related types have been entered across several fields
	Weight Based Analyzer	Accepts as input the results from any or all matching components in a plan and calculates a single, overall match score for the plan’s matching operations.
Validation Components	Address Validator	Validates a postal address by comparing data to a database of postally-correct addresses prepared in database form by a third party vendor.
	International AV	Validates international postal address data and validates addresses prepared in database form by a third party vendor
	North America AV	Validates US and Canada postal addresses
Data Sink Components	CSV Sink	CSV Sink component defines a delimited (for example, comma separated) file as the output format
	Fixed Width Sink	Generates plan output in a fixed-width file format
	Report Sink	Generates a report file, in any one of several formats, that displays the plan output data
	CSV Merge Sink	Merges columns from two sources to a single sink file.
	CSV Match Sink	Creates a delimited output file containing data generated by a matching plan
	Match Key Sink	Appends match plan output data directly to the source database
	Group Sink	Creates groups – a series of files in a Data Quality-proprietary format that organizes the plan data according to user-specified key data fields.
	Database Sink	Allows the plan output to be written to a database
	Database Report Sink	Generates report data for a plan and inserts this data to the Data Quality repository.
	SAP Sink	Allows plan output to be written to a SAP database
	Realtime Sink	Enables the development of plans that will process output data in real time, for example, to deliver data to another application.

September 13, 2013

Capacity Planning for Datastage

You plan the use of disk, volume group, and file system resources to create an optimal operating environment for IBM® Information Server.

As part of capacity planning for your installation, you must size your file systems and databases to accommodate your anticipated needs. After the installation, you must continue to monitor these file systems to ensure that sufficient space is available.

RAID configuration
SAN configuration

Attention: If the file system runs out of space, IBM Information Server stops functioning correctly. The capacity planning should include the capacity requirements for the plug-ins installed separately if any.

September 13, 2013

RAID and SAN configurations

Performance of a job (for IBM WebSphere® DataStage®, IBM WebSphere QualityStage™, and IBM WebSphere Information Analyzer) depends on all components being optimized. For RAID (Redundant Array of Independent Disks) and SAN (system area network) configurations, maximum performance is a combination of maximum bandwidth (controllers and disk) and minimized contention.

The following guidelines can assist in the configuration of RAID or SAN technologies:

Ensure that your database is optimally tuned.
Ensure that computers where you install the metadata repository tier and the services tier are on a high-speed network.
Minimize contention between temporary (scratch, buffer, and sort) and data file systems. For temporary storage, do not use a network file system (NFS). Consider using the local disk for temporary storage.
Minimize contention between the disks and controllers that are associated with IBM Information Server file systems and other applications or servers.
Consider isolating multiple mount points to separate high-speed disk interconnects and controllers.
Consider the trade-off between the granularity of file systems and the underlying configuration in comparison to the available, unused storage.
Do not create overly complex device configurations. These configurations can be difficult to administer and might not offer noticeable performance improvement.
If possible, check your configuration with expertise from hardware, storage, operating system, and application (IBM Information Server) personnel.
Remember that previous experience in designing I/O systems for nonparallel engines might lead to non-optimal configurations for IBM Information Server.

The optimal disk configuration balances cost, complexity, ease of administration, and ultimate performance.

September 13, 2013

Template of RAID and SAN configuration

Disk Space : Installation of a single tier on a computer requires 4 GB of memory

On the server:

· XXX MB to install the DataStage Server Edition

· XXX MB to install DataStage Enterprise Edition

· XXX MB for RTI Plug-ins and JVM

· XXX MB temporary storage during install to unpack components 3 MB per project to install the DataStage server

· XX MB of free space in /var

· Sufficient storage space for any data that is to be held in, like the lookup or temporary stored Data in various versions of it and used in the various stages of ETL life cycle.

· DataStage tables or files

· Additional space to allow for temporary data storage while a

· DataStage job is running

On the client:

· XX MB in the Windows 2000 or Windows XP system32 directory, as appropriate

· XXX MB in the DataStage install directory for DataStage clients and documentation (typical installation)

· Additional space to allow for temporary data storage while a DataStage job is running

September 13, 2013

Datastage Server Job Code Walkthrough Checklist

Category	Checkpoint
General	Check if naming conventions are followed for job names
General	Check if annotations are used where appropriate descriptions
General	Check if the flow of stages is from left to right
General	Check if the extract, lookup, transformation and load process are separate for restart purpose
General	Check if stage names are meaningful and as per the standards defined for the project
General	Check if link names are meaningful and used as per the standards defined for the project
Sequential File Stage	Check if the input and output filenames are parameterized
Sequential File Stage	Check if the stage refers to the correct input and output filenames
Sequential File Stage	Check if the properties set in the format tab matches the sequential file format, for example delimiter and quote character
Sequential File Stage	Check if the column metadata matches with the layout of the file
Sequential File Stage	Use “View Data” to see if the file can be accessed successfully and ignore any error if the file is a new output file
Database Stages like ODBC, Oracle, DB2/UDB	Check if the Database source name, User ID, Password are parameterized
Database Stages like ODBC, Oracle, DB2/UDB	Check if the table name is specified and the generated SQL matches with the desired operation
Database Stages like ODBC, Oracle, DB2/UDB	Check that user defined SQL is used in unavoidable circumstances and select list includes column names instead of “SELECT *”
Database Stages like ODBC, Oracle, DB2/UDB	Check if necessary filters are applied on the table using WHERE clause in stage output
Database Stages like ODBC, Oracle, DB2/UDB	Check if the columns specified for ORDER BY clause are correct in stage output
Database Stages like ODBC, Oracle, DB2/UDB	Check if the required Update action is specified for the table in stage input
Database Stages like ODBC, Oracle, DB2/UDB	Check if the column metadata matches with the table DDL
Database Stages like ODBC, Oracle, DB2/UDB	Use “View Data” to see if the table can be accessed successfully
Database Stages like ODBC, Oracle, DB2/UDB	Check if the job utilizes native connectivity to databases using plug-in database stages instead of ODBC
Transformer Stage	Check if all the required derivations are performed for columns in output links
Transformer Stage	Check if validations are performed for numeric and date fields even if not mentioned in design
Transformer Stage	Check if null check is performed for all the columns used in expressions
Transformer Stage	Check if appropriate error message is generated for invalid values for numeric or date fields validations
Transformer Stage	Check if data types of columns and constants used in expressions matches to avoid needless type conversions
Transformer Stage	If multiple column derivations use the same part of expressions, check if the part of expression is coded and referenced using stage variables
Transformer Stage	Check if the stage variable names are meaningful
Transformer Stage	If an expression includes calculated constant values, check if it is declared as stage variable and an initial value is specified in the variables list
Transformer Stage	Check if the constraint conditions are valid and correct
Transformer Stage	Check if reject link is used to capture rejected records during processing
Transformer Stage	If the subsequent stage is a database stage, check if the columns in outputs links does not output NULL values for NOT NULL columns in the table
Transformer Stage	If reference input link is used, check if the key expression on the reference link is correct
Transformer Stage	If there are more than one output link, check to see if the output links are in the desired execution order
Transformer Stage	Check if any before or after-stage subroutines are required and used with correct input values
Transformer Stage	Check if it is possible to optimize and combine the derivations from multiple transformers into a single transformer
Hashed File Stage	Check if the input and output filenames are parameterized
Hashed File Stage	Check if the stage refers to the correct input and output filenames
Hashed File Stage	If the Hashed File stage is used to write and read from the same file, check if the same filename is specified in input and output tabs
Hashed File Stage	Check if the column metadata matches with the layout of the file
Hashed File Stage	Use “View Data” to see if the file can be accessed successfully and ignore any error if the file is a new output file
Sort Stage
Sort Stage	Check if the input and output columns metadata are the same
Sort Stage	Check if the Sort specifications are valid and correct per requirements
Sort Stage	Check if any before or after-stage subroutines are required and used with correct input values
Aggregator Stage
Aggregator Stage	Check if the column derivation in Output tab corresponds to the required aggregations
Aggregator Stage	Check if any before or after-stage subroutines are required and used with correct input values
Job properties	Check if the short & full Job descriptions are entered as per standards
Job properties	Check if any before or after-stage subroutines are required and used with correct input values
Job properties	Check if parameters are used in the jobs instead of hard-coding
Job properties	Check if parameter naming conventions are followed
Job properties	Check if correct data types are assigned for parameters
Job properties	Check if default values for parameters are set correctly especially for batch jobs
Job properties	Check if parameters involving passwords are encrypted
Job properties	Check if appropriate help text for parameters are provided (this is useful if the job is called in the job sequence)
Job properties	In an NLS environment, some date transforms will not work for most formats. Check if time/date locale for the job is set to OFF.

September 13, 2013

Containers Code Walkthrough Checklist

Category	Checkpoint
Local Container	Check if local container names are meaningful and used as per the standards defined for the project
Local Container	Check if link names between input or output stage and the stage in container matches with link names in server job
Shared Container	Check if shared container names are meaningful and used as per the standards defined for the project
Shared Container	Check if the link mapping (to container link) is validated in shared container properties

September 13, 2013

Job Sequence – Code Walkthrough Checklist

Category	Checkpoint
Job Sequence	Check if naming conventions are followed for job sequence names
Job Sequence	Check if activity names are meaningful and as per the standards defined for the project
Job Sequence	Check if the flow of activities is from left to right
Job Sequence	Check if trigger names are meaningful and as per the standards defined for the project
Job Sequence	If Exception Handler activity is used, check if “Automatically handle activities that fail” option is set in job sequencer properties
Job Sequence	If Wait-For-File activity is used to trigger the job, check if the file is deleted after the trigger or completion of the job

September 13, 2013

Routines: Code Walkthrough Checklist

Category	Checkpoint
Routines	Check if short and long descriptions are entered and they describe the functioning of the routine (including input argument, return values, conditions for usage & risks if any)
Routines	For the basic code within the routines, check if the variables used follow naming standards
Routines	Check if the basic code is properly indented