September 15, 2013 Difference between Primary Key and Partition Key Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, and Random etc. While using Hash partition we specify the Partition Key.
September 15, 2013 Dimensional Modeling Dimensional Modeling is a logical design technique that seeks to present the data in a standard framework that is, intuitive and allows for high performance access.
September 13, 2013 Transformations related to Data Quality Component Category Transformation Name Description Data Source Components CSV source Connects to files whose data is organized in a delimited format Database Source Connects directly to a database to provide source data for a plan Fixed Width Source Specifies a fixed-width file as the data source for a plan Real Time Source Allows to accept input in real time mode SAP Source Connects to a SAP system to provide source data for a plan CSV Match Source Compares the records in a single source file against one another to identify duplicates in the file CSV Dual Match Source Allows to match data from two discrete files DB Match Source Connects to Data Quality repository to choose tables and columns for use in a matching plan. Group Source Defines the input data for a plan by reading the set of group files created by a Group Sink in another plan Dual Group Source Allows to perform matching operations on grouped data from two discrete, original data sources Frequency Components Count Tabulates the quantities of discrete data values in a selected column MinAvgMax Provides the minimum, maximum, and average data values for selected columns Range Counter tabulates the frequency and distribution of numerical data values in user-selected fields Missing Values Searches for specific values in an input field and determines the frequency of the values within that field Analysis Components Character Labeller Provides a character-by-character profile of the data values in a data field Token Labeller Analyzes the format of the data values within a field and categorizes each value according to a list of standard or user-defined tokens. Transformation Components Search Replace Used to remove/replace user defined values from a group Word Manager Applies one or more reference sources (data dictionaries) to an input dataset Merge Combines the data values from multiple input fields to form a single output field To Upper Alters the case of a dataset Rule Based Analyzer Allows to define and apply one or more business rules to selected input data Scripting Provides greater flexibility than the Rule Based Analyzer to build customized rules and processes into a data quality plan using TCL (Tool Command Language) Parsing Components Splitter Parses the data values in a text field into discrete new fields by comparing the source data to one or more reference datasets Token Parser Parses free-text fields that each contain multiple tokens and parses each token to a discrete field. Profile Standardizer Parses an input field from a token labeller to a number of output fields based on a data structure that you define. Context Parser Parses free-text fields containing multiple tokens into multiple single-token fields based on the value and the relative position of the tokens Key Field Generator Soundex Recognizes phonetic matches between alphabetic strings – it analyzes the phonetic components of a word and assigns a value to the string based on the phonetic characteristics of the initial characters in the string. Nysiis Converts the values of an input field into their phonetic equivalent and reconstitutes the spelling of the string based in its phonetic characteristics. Matching Components Edit Distance Derives a match score for two data values by calculating the minimum “cost” of transforming one string into another by the insertion, deletion, and replacement of characters. Jaro Distance Calculates the general similarity between two data values; however, the Jaro Distance algorithm reduces the match score for the pair of values if they do not share a common prefix. Hamming Distance Derives a match score for a pair of data strings by calculating the number of positions in which characters differ between them Bigram Matches the data values on the basis on the occurrence of consecutive characters in both data strings in a matching pair Mixed Field Matcher Compares pairs of data values at a time to identify matches in a dataset wherein data values of the same type or related types have been entered across several fields Weight Based Analyzer Accepts as input the results from any or all matching components in a plan and calculates a single, overall match score for the plan’s matching operations. Validation Components Address Validator Validates a postal address by comparing data to a database of postally-correct addresses prepared in database form by a third party vendor. International AV Validates international postal address data and validates addresses prepared in database form by a third party vendor North America AV Validates US and Canada postal addresses Data Sink Components CSV Sink CSV Sink component defines a delimited (for example, comma separated) file as the output format Fixed Width Sink Generates plan output in a fixed-width file format Report Sink Generates a report file, in any one of several formats, that displays the plan output data CSV Merge Sink Merges columns from two sources to a single sink file. CSV Match Sink Creates a delimited output file containing data generated by a matching plan Match Key Sink Appends match plan output data directly to the source database Group Sink Creates groups – a series of files in a Data Quality-proprietary format that organizes the plan data according to user-specified key data fields. Database Sink Allows the plan output to be written to a database Database Report Sink Generates report data for a plan and inserts this data to the Data Quality repository. SAP Sink Allows plan output to be written to a SAP database Realtime Sink Enables the development of plans that will process output data in real time, for example, to deliver data to another application.
September 13, 2013 Capacity Planning for Datastage You plan the use of disk, volume group, and file system resources to create an optimal operating environment for IBM® Information Server. As part of capacity planning for your installation, you must size your file systems and databases to accommodate your anticipated needs. After the installation, you must continue to monitor these file systems to ensure that sufficient space is available. RAID configuration SAN configuration Attention: If the file system runs out of space, IBM Information Server stops functioning correctly. The capacity planning should include the capacity requirements for the plug-ins installed separately if any.
September 13, 2013 RAID and SAN configurations Performance of a job (for IBM WebSphere® DataStage®, IBM WebSphere QualityStage™, and IBM WebSphere Information Analyzer) depends on all components being optimized. For RAID (Redundant Array of Independent Disks) and SAN (system area network) configurations, maximum performance is a combination of maximum bandwidth (controllers and disk) and minimized contention. The following guidelines can assist in the configuration of RAID or SAN technologies: Ensure that your database is optimally tuned. Ensure that computers where you install the metadata repository tier and the services tier are on a high-speed network. Minimize contention between temporary (scratch, buffer, and sort) and data file systems. For temporary storage, do not use a network file system (NFS). Consider using the local disk for temporary storage. Minimize contention between the disks and controllers that are associated with IBM Information Server file systems and other applications or servers. Consider isolating multiple mount points to separate high-speed disk interconnects and controllers. Consider the trade-off between the granularity of file systems and the underlying configuration in comparison to the available, unused storage. Do not create overly complex device configurations. These configurations can be difficult to administer and might not offer noticeable performance improvement. If possible, check your configuration with expertise from hardware, storage, operating system, and application (IBM Information Server) personnel. Remember that previous experience in designing I/O systems for nonparallel engines might lead to non-optimal configurations for IBM Information Server. The optimal disk configuration balances cost, complexity, ease of administration, and ultimate performance.
September 13, 2013 Template of RAID and SAN configuration Disk Space : Installation of a single tier on a computer requires 4 GB of memory On the server: · XXX MB to install the DataStage Server Edition · XXX MB to install DataStage Enterprise Edition · XXX MB for RTI Plug-ins and JVM · XXX MB temporary storage during install to unpack components 3 MB per project to install the DataStage server · XX MB of free space in /var · Sufficient storage space for any data that is to be held in, like the lookup or temporary stored Data in various versions of it and used in the various stages of ETL life cycle. · DataStage tables or files · Additional space to allow for temporary data storage while a · DataStage job is running On the client: · XX MB in the Windows 2000 or Windows XP system32 directory, as appropriate · XXX MB in the DataStage install directory for DataStage clients and documentation (typical installation) · Additional space to allow for temporary data storage while a DataStage job is running
September 13, 2013 Datastage Server Job Code Walkthrough Checklist Category Checkpoint General Check if naming conventions are followed for job names General Check if annotations are used where appropriate descriptions General Check if the flow of stages is from left to right General Check if the extract, lookup, transformation and load process are separate for restart purpose General Check if stage names are meaningful and as per the standards defined for the project General Check if link names are meaningful and used as per the standards defined for the project Sequential File Stage Check if the input and output filenames are parameterized Sequential File Stage Check if the stage refers to the correct input and output filenames Sequential File Stage Check if the properties set in the format tab matches the sequential file format, for example delimiter and quote character Sequential File Stage Check if the column metadata matches with the layout of the file Sequential File Stage Use “View Data” to see if the file can be accessed successfully and ignore any error if the file is a new output file Database Stages like ODBC, Oracle, DB2/UDB Check if the Database source name, User ID, Password are parameterized Database Stages like ODBC, Oracle, DB2/UDB Check if the table name is specified and the generated SQL matches with the desired operation Database Stages like ODBC, Oracle, DB2/UDB Check that user defined SQL is used in unavoidable circumstances and select list includes column names instead of “SELECT *” Database Stages like ODBC, Oracle, DB2/UDB Check if necessary filters are applied on the table using WHERE clause in stage output Database Stages like ODBC, Oracle, DB2/UDB Check if the columns specified for ORDER BY clause are correct in stage output Database Stages like ODBC, Oracle, DB2/UDB Check if the required Update action is specified for the table in stage input Database Stages like ODBC, Oracle, DB2/UDB Check if the column metadata matches with the table DDL Database Stages like ODBC, Oracle, DB2/UDB Use “View Data” to see if the table can be accessed successfully Database Stages like ODBC, Oracle, DB2/UDB Check if the job utilizes native connectivity to databases using plug-in database stages instead of ODBC Transformer Stage Check if all the required derivations are performed for columns in output links Transformer Stage Check if validations are performed for numeric and date fields even if not mentioned in design Transformer Stage Check if null check is performed for all the columns used in expressions Transformer Stage Check if appropriate error message is generated for invalid values for numeric or date fields validations Transformer Stage Check if data types of columns and constants used in expressions matches to avoid needless type conversions Transformer Stage If multiple column derivations use the same part of expressions, check if the part of expression is coded and referenced using stage variables Transformer Stage Check if the stage variable names are meaningful Transformer Stage If an expression includes calculated constant values, check if it is declared as stage variable and an initial value is specified in the variables list Transformer Stage Check if the constraint conditions are valid and correct Transformer Stage Check if reject link is used to capture rejected records during processing Transformer Stage If the subsequent stage is a database stage, check if the columns in outputs links does not output NULL values for NOT NULL columns in the table Transformer Stage If reference input link is used, check if the key expression on the reference link is correct Transformer Stage If there are more than one output link, check to see if the output links are in the desired execution order Transformer Stage Check if any before or after-stage subroutines are required and used with correct input values Transformer Stage Check if it is possible to optimize and combine the derivations from multiple transformers into a single transformer Hashed File Stage Check if the input and output filenames are parameterized Hashed File Stage Check if the stage refers to the correct input and output filenames Hashed File Stage If the Hashed File stage is used to write and read from the same file, check if the same filename is specified in input and output tabs Hashed File Stage Check if the column metadata matches with the layout of the file Hashed File Stage Use “View Data” to see if the file can be accessed successfully and ignore any error if the file is a new output file Sort Stage Sort Stage Check if the input and output columns metadata are the same Sort Stage Check if the Sort specifications are valid and correct per requirements Sort Stage Check if any before or after-stage subroutines are required and used with correct input values Aggregator Stage Aggregator Stage Check if the column derivation in Output tab corresponds to the required aggregations Aggregator Stage Check if any before or after-stage subroutines are required and used with correct input values Job properties Check if the short & full Job descriptions are entered as per standards Job properties Check if any before or after-stage subroutines are required and used with correct input values Job properties Check if parameters are used in the jobs instead of hard-coding Job properties Check if parameter naming conventions are followed Job properties Check if correct data types are assigned for parameters Job properties Check if default values for parameters are set correctly especially for batch jobs Job properties Check if parameters involving passwords are encrypted Job properties Check if appropriate help text for parameters are provided (this is useful if the job is called in the job sequence) Job properties In an NLS environment, some date transforms will not work for most formats. Check if time/date locale for the job is set to OFF.
September 13, 2013 Containers Code Walkthrough Checklist Category Checkpoint Local Container Check if local container names are meaningful and used as per the standards defined for the project Local Container Check if link names between input or output stage and the stage in container matches with link names in server job Shared Container Check if shared container names are meaningful and used as per the standards defined for the project Shared Container Check if the link mapping (to container link) is validated in shared container properties
September 13, 2013 Job Sequence – Code Walkthrough Checklist Category Checkpoint Job Sequence Check if naming conventions are followed for job sequence names Job Sequence Check if activity names are meaningful and as per the standards defined for the project Job Sequence Check if the flow of activities is from left to right Job Sequence Check if trigger names are meaningful and as per the standards defined for the project Job Sequence If Exception Handler activity is used, check if “Automatically handle activities that fail” option is set in job sequencer properties Job Sequence If Wait-For-File activity is used to trigger the job, check if the file is deleted after the trigger or completion of the job
September 13, 2013 Routines: Code Walkthrough Checklist Category Checkpoint Routines Check if short and long descriptions are entered and they describe the functioning of the routine (including input argument, return values, conditions for usage & risks if any) Routines For the basic code within the routines, check if the variables used follow naming standards Routines Check if the basic code is properly indented