Transformations related to Data Quality

Component Category
Transformation Name
Description
Data Source Components
CSV source
Connects to files whose data is organized in a delimited format
Database Source
Connects directly to a database to provide source data for a plan
Fixed Width Source
Specifies a fixed-width file as the data source for a plan
Real Time Source
Allows to accept input in real time mode
SAP Source
Connects to a SAP system to provide source data for a plan
CSV Match Source
Compares the records in a single source file against one another to identify duplicates in the file
CSV Dual Match Source
Allows to match data from two discrete files
DB Match Source
Connects to Data Quality repository to choose tables and columns for use in a matching plan.
Group Source
Defines the input data for a plan by reading the set of group files created by a Group Sink in another plan
Dual Group Source
Allows to perform matching operations on grouped data from two discrete, original data sources
Frequency Components
Count
Tabulates the quantities of discrete data values in a selected column
MinAvgMax
Provides the minimum, maximum, and average data values for selected columns
Range Counter
tabulates the frequency and distribution of numerical data values in user-selected fields
Missing Values
Searches for specific values in an input field and determines the frequency of the values within that field
Analysis Components
Character Labeller
Provides a character-by-character profile of the data values in a data field
Token Labeller
Analyzes the format of the data values within a field and categorizes each value according to a list of standard or user-defined tokens.
Transformation Components
Search Replace
Used to remove/replace user defined values from a group
Word Manager
Applies one or more reference sources (data dictionaries) to an input dataset
Merge
Combines the data values from multiple input fields to form a single output field
To Upper
Alters the case of a dataset
Rule Based Analyzer
Allows to define and apply one or more business rules to selected input data
Scripting
Provides greater flexibility than the Rule Based Analyzer to build customized rules and processes into a data quality plan using TCL (Tool Command Language)
Parsing Components
Splitter
Parses the data values in a text field into discrete new fields by comparing the source data to one or more reference datasets
Token Parser
Parses free-text fields that each contain multiple tokens and parses each token to a discrete field.
Profile Standardizer
Parses an input field from a token labeller to a number of output fields based on a data structure that you define.
Context Parser
Parses free-text fields containing multiple tokens into multiple single-token fields based on the value and the relative position of the tokens
Key Field Generator
Soundex
Recognizes phonetic matches between alphabetic strings – it analyzes the phonetic components of a word and assigns a value to the string based on the phonetic characteristics of the initial characters in the string.
Nysiis
Converts the values of an input field into their phonetic equivalent and reconstitutes the spelling of the string based in its phonetic characteristics.
Matching Components
Edit Distance
Derives a match score for two data values by calculating the minimum “cost” of transforming one string into another by the insertion, deletion, and replacement of characters.
Jaro Distance
Calculates the general similarity between two data values; however, the Jaro Distance algorithm reduces the match score for the pair of values if they do not share a common prefix.
Hamming Distance
Derives a match score for a pair of data strings by calculating the number of positions in which characters differ between them
Bigram
Matches the data values on the basis on the occurrence of consecutive characters in both data strings in a matching pair
Mixed Field Matcher
Compares pairs of data values at a time to identify matches in a dataset wherein data values of the same type or related types have been entered across several fields
Weight Based Analyzer
Accepts as input the results from any or all matching components in a plan and calculates a single, overall match score for the plan’s matching operations.
Validation Components
Address Validator
Validates a postal address by comparing data to a database of postally-correct addresses prepared in database form by a third party vendor.
International AV
Validates international postal address data and validates addresses prepared in database form by a third party vendor
North America AV
Validates US  and Canada postal addresses
Data Sink Components
CSV Sink
CSV Sink component defines a delimited (for example, comma separated) file as the output format
Fixed Width Sink
Generates plan output in a fixed-width file format
Report Sink
Generates a report file, in any one of several formats, that displays the plan output data
CSV Merge Sink
Merges columns from two sources to a single sink file.
CSV Match Sink
Creates a delimited output file containing data generated by a matching plan
Match Key Sink
Appends match plan output data directly to the source database
Group Sink
Creates groups – a series of files in a Data Quality-proprietary format that organizes the plan data according to user-specified key data fields.
Database Sink
Allows the plan output to be written to a database
Database Report Sink
Generates report data for a plan and inserts this data to the Data Quality repository.
SAP Sink
Allows plan output to be written to a SAP database
Realtime Sink
Enables the development of plans that will process output data in real time, for example, to deliver data to another application.

Leave a Reply

Your email address will not be published. Required fields are marked *