September 13, 2013 \ Ananth TM Transformations related to Data Quality Component Category Transformation Name Description Data Source Components CSV source Connects to files whose data is organized in a delimited format Database Source Connects directly to a database to provide source data for a plan Fixed Width Source Specifies a fixed-width file as the data source for a plan Real Time Source Allows to accept input in real time mode SAP Source Connects to a SAP system to provide source data for a plan CSV Match Source Compares the records in a single source file against one another to identify duplicates in the file CSV Dual Match Source Allows to match data from two discrete files DB Match Source Connects to Data Quality repository to choose tables and columns for use in a matching plan. Group Source Defines the input data for a plan by reading the set of group files created by a Group Sink in another plan Dual Group Source Allows to perform matching operations on grouped data from two discrete, original data sources Frequency Components Count Tabulates the quantities of discrete data values in a selected column MinAvgMax Provides the minimum, maximum, and average data values for selected columns Range Counter tabulates the frequency and distribution of numerical data values in user-selected fields Missing Values Searches for specific values in an input field and determines the frequency of the values within that field Analysis Components Character Labeller Provides a character-by-character profile of the data values in a data field Token Labeller Analyzes the format of the data values within a field and categorizes each value according to a list of standard or user-defined tokens. Transformation Components Search Replace Used to remove/replace user defined values from a group Word Manager Applies one or more reference sources (data dictionaries) to an input dataset Merge Combines the data values from multiple input fields to form a single output field To Upper Alters the case of a dataset Rule Based Analyzer Allows to define and apply one or more business rules to selected input data Scripting Provides greater flexibility than the Rule Based Analyzer to build customized rules and processes into a data quality plan using TCL (Tool Command Language) Parsing Components Splitter Parses the data values in a text field into discrete new fields by comparing the source data to one or more reference datasets Token Parser Parses free-text fields that each contain multiple tokens and parses each token to a discrete field. Profile Standardizer Parses an input field from a token labeller to a number of output fields based on a data structure that you define. Context Parser Parses free-text fields containing multiple tokens into multiple single-token fields based on the value and the relative position of the tokens Key Field Generator Soundex Recognizes phonetic matches between alphabetic strings – it analyzes the phonetic components of a word and assigns a value to the string based on the phonetic characteristics of the initial characters in the string. Nysiis Converts the values of an input field into their phonetic equivalent and reconstitutes the spelling of the string based in its phonetic characteristics. Matching Components Edit Distance Derives a match score for two data values by calculating the minimum “cost” of transforming one string into another by the insertion, deletion, and replacement of characters. Jaro Distance Calculates the general similarity between two data values; however, the Jaro Distance algorithm reduces the match score for the pair of values if they do not share a common prefix. Hamming Distance Derives a match score for a pair of data strings by calculating the number of positions in which characters differ between them Bigram Matches the data values on the basis on the occurrence of consecutive characters in both data strings in a matching pair Mixed Field Matcher Compares pairs of data values at a time to identify matches in a dataset wherein data values of the same type or related types have been entered across several fields Weight Based Analyzer Accepts as input the results from any or all matching components in a plan and calculates a single, overall match score for the plan’s matching operations. Validation Components Address Validator Validates a postal address by comparing data to a database of postally-correct addresses prepared in database form by a third party vendor. International AV Validates international postal address data and validates addresses prepared in database form by a third party vendor North America AV Validates US and Canada postal addresses Data Sink Components CSV Sink CSV Sink component defines a delimited (for example, comma separated) file as the output format Fixed Width Sink Generates plan output in a fixed-width file format Report Sink Generates a report file, in any one of several formats, that displays the plan output data CSV Merge Sink Merges columns from two sources to a single sink file. CSV Match Sink Creates a delimited output file containing data generated by a matching plan Match Key Sink Appends match plan output data directly to the source database Group Sink Creates groups – a series of files in a Data Quality-proprietary format that organizes the plan data according to user-specified key data fields. Database Sink Allows the plan output to be written to a database Database Report Sink Generates report data for a plan and inserts this data to the Data Quality repository. SAP Sink Allows plan output to be written to a SAP database Realtime Sink Enables the development of plans that will process output data in real time, for example, to deliver data to another application.