Difference between Primary Key and Partition Key

Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, and Random etc. While using Hash partition we specify the Partition Key. 

Dimensional Modeling

Dimensional Modeling is a logical design technique that seeks to present the data in a standard framework that is, intuitive and allows for high performance access.

Transformations related to Data Quality

Component Category
Transformation Name
Description
Data Source Components
CSV source
Connects to files whose data is organized in a delimited format
Database Source
Connects directly to a database to provide source data for a plan
Fixed Width Source
Specifies a fixed-width file as the data source for a plan
Real Time Source
Allows to accept input in real time mode
SAP Source
Connects to a SAP system to provide source data for a plan
CSV Match Source
Compares the records in a single source file against one another to identify duplicates in the file
CSV Dual Match Source
Allows to match data from two discrete files
DB Match Source
Connects to Data Quality repository to choose tables and columns for use in a matching plan.
Group Source
Defines the input data for a plan by reading the set of group files created by a Group Sink in another plan
Dual Group Source
Allows to perform matching operations on grouped data from two discrete, original data sources
Frequency Components
Count
Tabulates the quantities of discrete data values in a selected column
MinAvgMax
Provides the minimum, maximum, and average data values for selected columns
Range Counter
tabulates the frequency and distribution of numerical data values in user-selected fields
Missing Values
Searches for specific values in an input field and determines the frequency of the values within that field
Analysis Components
Character Labeller
Provides a character-by-character profile of the data values in a data field
Token Labeller
Analyzes the format of the data values within a field and categorizes each value according to a list of standard or user-defined tokens.
Transformation Components
Search Replace
Used to remove/replace user defined values from a group
Word Manager
Applies one or more reference sources (data dictionaries) to an input dataset
Merge
Combines the data values from multiple input fields to form a single output field
To Upper
Alters the case of a dataset
Rule Based Analyzer
Allows to define and apply one or more business rules to selected input data
Scripting
Provides greater flexibility than the Rule Based Analyzer to build customized rules and processes into a data quality plan using TCL (Tool Command Language)
Parsing Components
Splitter
Parses the data values in a text field into discrete new fields by comparing the source data to one or more reference datasets
Token Parser
Parses free-text fields that each contain multiple tokens and parses each token to a discrete field.
Profile Standardizer
Parses an input field from a token labeller to a number of output fields based on a data structure that you define.
Context Parser
Parses free-text fields containing multiple tokens into multiple single-token fields based on the value and the relative position of the tokens
Key Field Generator
Soundex
Recognizes phonetic matches between alphabetic strings – it analyzes the phonetic components of a word and assigns a value to the string based on the phonetic characteristics of the initial characters in the string.
Nysiis
Converts the values of an input field into their phonetic equivalent and reconstitutes the spelling of the string based in its phonetic characteristics.
Matching Components
Edit Distance
Derives a match score for two data values by calculating the minimum “cost” of transforming one string into another by the insertion, deletion, and replacement of characters.
Jaro Distance
Calculates the general similarity between two data values; however, the Jaro Distance algorithm reduces the match score for the pair of values if they do not share a common prefix.
Hamming Distance
Derives a match score for a pair of data strings by calculating the number of positions in which characters differ between them
Bigram
Matches the data values on the basis on the occurrence of consecutive characters in both data strings in a matching pair
Mixed Field Matcher
Compares pairs of data values at a time to identify matches in a dataset wherein data values of the same type or related types have been entered across several fields
Weight Based Analyzer
Accepts as input the results from any or all matching components in a plan and calculates a single, overall match score for the plan’s matching operations.
Validation Components
Address Validator
Validates a postal address by comparing data to a database of postally-correct addresses prepared in database form by a third party vendor.
International AV
Validates international postal address data and validates addresses prepared in database form by a third party vendor
North America AV
Validates US  and Canada postal addresses
Data Sink Components
CSV Sink
CSV Sink component defines a delimited (for example, comma separated) file as the output format
Fixed Width Sink
Generates plan output in a fixed-width file format
Report Sink
Generates a report file, in any one of several formats, that displays the plan output data
CSV Merge Sink
Merges columns from two sources to a single sink file.
CSV Match Sink
Creates a delimited output file containing data generated by a matching plan
Match Key Sink
Appends match plan output data directly to the source database
Group Sink
Creates groups – a series of files in a Data Quality-proprietary format that organizes the plan data according to user-specified key data fields.
Database Sink
Allows the plan output to be written to a database
Database Report Sink
Generates report data for a plan and inserts this data to the Data Quality repository.
SAP Sink
Allows plan output to be written to a SAP database
Realtime Sink
Enables the development of plans that will process output data in real time, for example, to deliver data to another application.

Capacity Planning for Datastage

You plan the use of disk, volume group, and file system resources to create an optimal operating environment for IBM® Information Server.
As part of capacity planning for your installation, you must size your file systems and databases to accommodate your anticipated needs. After the installation, you must continue to monitor these file systems to ensure that sufficient space is available.
  •          RAID configuration
  •          SAN configuration

Attention: If the file system runs out of space, IBM Information Server stops functioning correctly. The capacity planning should include the capacity requirements for the plug-ins installed separately if any.

RAID and SAN configurations

Performance of a job (for IBM WebSphere® DataStage®, IBM WebSphere QualityStage™, and IBM WebSphere Information Analyzer) depends on all components being optimized. For RAID (Redundant Array of Independent Disks) and SAN (system area network) configurations, maximum performance is a combination of maximum bandwidth (controllers and disk) and minimized contention.
The following guidelines can assist in the configuration of RAID or SAN technologies:
  • Ensure that your database is optimally tuned.
  • Ensure that computers where you install the metadata repository tier and the services tier are on a high-speed network.
  • Minimize contention between temporary (scratch, buffer, and sort) and data file systems. For temporary storage, do not use a network file system (NFS). Consider using the local disk for temporary storage.
  • Minimize contention between the disks and controllers that are associated with IBM Information Server file systems and other applications or servers.
  • Consider isolating multiple mount points to separate high-speed disk interconnects and controllers.
  • Consider the trade-off between the granularity of file systems and the underlying configuration in comparison to the available, unused storage.
  • Do not create overly complex device configurations. These configurations can be difficult to administer and might not offer noticeable performance improvement.
  • If possible, check your configuration with expertise from hardware, storage, operating system, and application (IBM Information Server) personnel.
  • Remember that previous experience in designing I/O systems for nonparallel engines might lead to non-optimal configurations for IBM Information Server.
The optimal disk configuration balances cost, complexity, ease of administration, and ultimate performance.

Template of RAID and SAN configuration

Disk Space : Installation of a single tier on a computer requires 4 GB of memory
On the server:
·         XXX MB to install the DataStage  Server Edition
·         XXX MB to install DataStage  Enterprise Edition
·         XXX MB for RTI Plug-ins and JVM
·         XXX MB temporary storage during install to unpack components 3 MB per project to install the DataStage  server
·         XX MB of free space in /var
·         Sufficient storage space for any data that is to be held in, like the lookup or temporary stored Data in various versions of it and used in the various stages of ETL life cycle.
·         DataStage  tables or files
·         Additional space to allow for temporary data storage while a
·         DataStage  job is running
On the client:
·         XX  MB in the Windows 2000 or Windows XP system32 directory, as appropriate
·         XXX MB in the DataStage  install directory for DataStage  clients and documentation (typical installation)
·         Additional space to allow for temporary data storage while a DataStage  job is running

Datastage Server Job Code Walkthrough Checklist

Category
Checkpoint
General
Check if naming conventions are followed for job names
General
Check if annotations are used where appropriate descriptions
General
Check if the flow of stages is from left to right
General
Check if the extract, lookup, transformation and load process are separate for restart purpose
General
Check if stage names are meaningful and as per the standards defined for the project
General
Check if link names are meaningful and used as per the standards defined for the project
Sequential File Stage
Check if the input and output filenames are parameterized
Sequential File Stage
Check if the stage refers to the correct input and output filenames
Sequential File Stage
Check if the properties set in the format tab matches the sequential file format, for example delimiter and quote character
Sequential File Stage
Check if the column metadata matches with the layout of the file
Sequential File Stage
Use “View Data” to see if the file can be accessed successfully and ignore any error if the file is a new output file
Database Stages like ODBC, Oracle, DB2/UDB
Check if the Database source name, User ID, Password are parameterized
Database Stages like ODBC, Oracle, DB2/UDB
Check if the table name is specified and the generated SQL matches with the desired operation
Database Stages like ODBC, Oracle, DB2/UDB
Check that user defined SQL is used in unavoidable circumstances and select list includes column names instead of “SELECT *”
Database Stages like ODBC, Oracle, DB2/UDB
Check if necessary filters are applied on the table using WHERE clause in stage output
Database Stages like ODBC, Oracle, DB2/UDB
Check if the columns specified for ORDER BY clause are correct in stage output
Database Stages like ODBC, Oracle, DB2/UDB
Check if the required Update action is specified for the table in stage input
Database Stages like ODBC, Oracle, DB2/UDB
Check if the column metadata matches with the table DDL
Database Stages like ODBC, Oracle, DB2/UDB
Use “View Data” to see if the table can be accessed successfully
Database Stages like ODBC, Oracle, DB2/UDB
Check if the job utilizes native connectivity to databases using plug-in database stages instead of ODBC
Transformer Stage
Check if all the required derivations are performed for columns in output links
Transformer Stage
Check if validations are performed for numeric and date fields even if not mentioned in design
Transformer Stage
Check if null check is performed for all the columns used in expressions
Transformer Stage
Check if appropriate error message is generated for invalid values for numeric or date fields validations
Transformer Stage
Check if data types of columns and constants used in expressions matches to avoid needless type conversions
Transformer Stage
If multiple column derivations use the same part of expressions, check if the part of expression is coded and referenced using stage variables
Transformer Stage
Check if the stage variable names are meaningful
Transformer Stage
If an expression includes calculated constant values, check if it is declared as stage variable and an initial value is specified in the variables list
Transformer Stage
Check if the constraint conditions are valid and correct
Transformer Stage
Check if reject link is used to capture rejected records during processing
Transformer Stage
If the subsequent stage is a database stage, check if the columns in outputs links does not output NULL values for NOT NULL columns in the table
Transformer Stage
If reference input link is used, check if the key expression on the reference link is correct
Transformer Stage
If there are more than one output link, check to see if the output links are in the desired execution order
Transformer Stage
Check if any before or after-stage subroutines are required and used with correct input values
Transformer Stage
Check if it is possible to optimize and combine the derivations from multiple transformers into a single transformer
Hashed File Stage
Check if the input and output filenames are parameterized
Hashed File Stage
Check if the stage refers to the correct input and output filenames
Hashed File Stage
If the Hashed File stage is used to write and read from the same file, check if the same filename is specified in input and output tabs
Hashed File Stage
Check if the column metadata matches with the layout of the file
Hashed File Stage
Use “View Data” to see if the file can be accessed successfully and ignore any error if the file is a new output file
Sort Stage
Check if the input and output columns metadata are the same
Sort Stage
Check if the Sort specifications are valid and correct per requirements
Sort Stage
Check if any before or after-stage subroutines are required and used with correct input values
Aggregator Stage
Check if the column derivation in Output tab corresponds to the required aggregations
Aggregator Stage
Check if any before or after-stage subroutines are required and used with correct input values
Job properties
Check if the short & full Job descriptions are entered as per standards
Job properties
Check if any before or after-stage subroutines are required and used with correct input values
Job properties
Check if parameters are used in the jobs instead of hard-coding
Job properties
Check if parameter naming conventions are followed
Job properties
Check if correct data types are assigned for parameters
Job properties
Check if default values for parameters are set correctly especially for batch jobs
Job properties
Check if parameters involving passwords are encrypted
Job properties
Check if appropriate help text for parameters are provided (this is useful if the job is called in the job sequence)
Job properties
In an NLS environment, some date transforms will not work for most formats. Check if time/date locale for the job is set to OFF.

Containers Code Walkthrough Checklist

Category
Checkpoint
Local Container
Check if local container names are meaningful and used as per the standards defined for the project
Local Container
Check if link names between input or output stage and the stage in container matches with link names in server job
Shared Container
Check if shared container names are meaningful and used as per the standards defined for the project
Shared Container
Check if the link mapping (to container link) is validated in shared container properties

Job Sequence – Code Walkthrough Checklist

Category
Checkpoint
Job Sequence
Check if naming conventions are followed for job sequence names
Job Sequence
Check if activity names are meaningful and as per the standards defined for the project
Job Sequence
Check if the flow of activities is from left to right
Job Sequence
Check if trigger names are meaningful and as per the standards defined for the project
Job Sequence
If Exception Handler activity is used, check if “Automatically handle activities that fail” option is set in job sequencer properties
Job Sequence
If Wait-For-File activity is used to trigger the job, check if the file is deleted after the trigger or completion of the job

Routines: Code Walkthrough Checklist

Category
Checkpoint
Routines
Check if short and long descriptions are entered and they describe the functioning of the routine (including input argument, return values, conditions for usage & risks if any)
Routines
For the basic code within the routines, check if the variables used follow naming standards
Routines
Check if the basic code is properly indented