Information Analyzer – REGEX

source-data  [NOT] MATCHES_REGEX pattern-string

You can build your own REGEX based on need. Some examples are listed below:

Postal Codes Example
  • You are searching for all US postal codes in a column (9 bytes with last 4 optional):
    '\b[0-9]{5}(?:-[0-9]{4})?\b'
Phone numbers Example
  • You are searching for phone numbers in the standard North American format. You want to account for variations in the format such as 123-456-7890, (123)-456-7890, 123 456 7890 , 123.456.7890 and so on:
    '\(?[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}'

If you are on 8.5 or 8.7, then you can import a bunch of pre-built rules using the following developerworks article: http://www.ibm.com/developerworks/data/library/techarticle/dm-1112isanalyzerrules/

The above rules are automatically installed with version 9.1 and beyond.

There are many examples of the use of RegEx functions, including the data rule definition ‘SsnMatchesRegex’

 

More information

https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.ia.quality.doc/topics/dq_matches_regex_check.html

Information Analyzer

IA or Information Analyzer gives one frequency distributions of values, formats, etc. IA provides data sampling (random, sequential and Nth). At a fraction of the cost, a lot of useful information can be generated:
– Is the column unique?
– Are there nulls?
– What are the high frequency values?
– What are the high frequency patterns?
– Are there many outliers?

Note: Sampling from the IA perspective may still incur a full table scan on the source system. All rows may still be read and sent to IA whereby IA only chooses to analyze every 100th record or randomly ignores all but 10% of the records depending on which sampling options you selected.

“Frequency by Frequency” report under the “column frequency” option

Column Level Summary
Cardinality Count : 1,034,710
Null Count : 0
Actual Row Count : 17,937,920
Frequency Cut Off : 0
Total Rows Covered : 12,633,758
%Rows Covered : 70.4305

Check Display options: Default may the 1000 distinct values with counts.

 

if you are using the Where clause in the Analysis Settings, that will only apply to Column Analysis not to Data Rules(e.g. to get rid of the suspended accounts etc.).

For Data Rules, create a Virtual Table (which appears under the Column Analysis menu).  The Virtual Table allows you to build the where clause that is relevant and that can be used in both Column Analysis and in any of the data rules.

scheduling the Column Analysis (CA) Job from Information Analyzer (IA)

  • Go into the ‘Web Console for Information Server’ ( https://server:port/ibm/iis/console ) and
  • create a scheduling view (Schedule Monitoring > Views of Schedules > New Scheduling View) and select at least an ‘Origin’ of Information Analyzer.
  • You should be able to see the scheduled Information Analyzer processes.

COBOL files in CFF stage

It could be a complex cobol copybook structure which has a good mix of occurs, redefines and level 88, COMP and COMP3.

Note

Packed (Comp-3)

Binary (Comp)

Or it could be a simple one such as below:

********************************
* COBOL DECLARATION FOR ORDERS *
********************************

01 HEADER.
05 RECTYPE PIC X(1).
05 NAME PIC X(20).
05 ORDDATE PIC X(10).

01 DETAIL.
05 ORDERNUM PIC S9(5).
05 RECTYPE PIC X(1).
05 PRODUCTID PIC S9(5).
05 QTY PIC S9(5).

But the main issue is that the source file would be in a native EBCDIC binary format and not ASCII. The source text file is generated from Mainframe and uses EBDIC format. Each row has different type of data columns and varies depending on available data.  Mostly a header, one or more type of detail records and a trailer record.

what stage?

Either way, use a CFF stage, import the copybook layout and view the source content through CFF (server) stage. Using sequential stage could throw errors such as import error, buffer overrun, etc. Seq file stage does do the EBCIDIC to ACII conversion, but not well if it has odd fields like packed decimal and such.

endianness

endianness (Byte Order) setting would need to be checked for Binary/Comp fields. The same file definition could change based on the byte order

849 = x0351 Big Endian
20739 = x5103 Little Endian

more details on Endianness https://en.wikipedia.org/wiki/Endianness

Convertion to ASCII string

COMP-3 is a decimal storage format. When a decimal is converted to a string, it will contain a decimal point. If the scale of the decimal is 0, that decimal point will still be present in the string as the last byte.

If the scale is 0, you can likely just convert the decimal to an integer column before writing to your output.

Check with the mainframe whether they are using CCSID 037 or CCSID 500 or others. Sometime simple COBOL files can be read as binary data type and by using ASCII function in the transformer stage.

Make sure “Unicode” is used in column property

IBM037 or IBM500 is used in NLS (default is probably UTF-8)

More COBOL notes

CCSID 500 — This is an EBCDIC code page, used mainly on z/OS. It is known as the ‘International’ codepage. The related CCSID with Euro support is 1148.

CCSID 037 — This is another popular EBCDIC code page, used on OS/400 and VSE/ ESA. It is the ‘US English’ codepage. The related CCSID with Euro support is 1140.

Dynamic Relational Stage (DRS) versus OBDC or Native

DRS is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivity for the chosen target

DRS stage should be faster then the ODBC stage as it uses native database connectivity. You will need to install and configure the required database clients on your DataStage server for it to work.

Dynamic Relational Stage was leveraged for Peoplesoft to have a job to run on any of the supported databases. It supports ODBC connections too. Read more of that in the plug-in documentation. ODBC uses the ODBC driver (Open Database Connectivity) for a particular database.

Native versus DRS versus OBDC

Performance wise there is not much of a difference. DRS uses native drivers. DRS stage has both the ODBC and Native API connectivity(Like connecting to Oracle or other DBMS) whereas ODBC stage only uses Open Database Connectivity.

Dynamic Relational Stage

Dynamic RDBMS stage has existed since the early version 7s

DRS is replaced by the DRS connector Stage after V 8.5. There is an optional automatic conversion when you upgrade in place. Otherwise there is a conversion tool that you can use. Or, of course, you can effect the change manually.

Additional reading

https://www.ibm.com/support/knowledgecenter/SSZJPZ_8.7.0/com.ibm.swg.im.iis.conn.drs.doc/topics/DRS013.html

Data Lake implementation

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

-James Dixon,  the founder and CTO of Pentaho

“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

-Unkown source

The term data lake is regularly connected with Hadoop-oriented object storage. In such a situation, an entity’s data is initially stacked into the Hadoop platform, and afterward business analytics and data mining instruments are connected to the data where it lives on Hadoop’s bunch hubs of product PCs.

Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried.

Data Lake and Data warehouse

The commonality between the two is that they are both data storage repositories.

Data warehouse

Vs

Data Lakes

Structured, processed

DATA

Structured/semi-structured/unstructured/raw

Schema on write

PROCESSING

Schema on read

Expensive for large volumes

STORAGE

Designed for low cost storage

Less agile, fixed configuration

AGILITY

Highly agile, configure and reconfigure as needed

Mature

SECURITY

Maturing

Business professionals

USERS

Data scientists

Gaps in Data Lake Concept


Data lakes convey significant dangers. The most critical is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or administration. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a information swamp. What’s more- without metadata, every subsequent utilization of data means analysts start from the beginning.

Another danger is security and access control. Data can be placed into the data lake with no oversight of the contents. Numerous data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security abilities of central data lake technologies are still embryonic. These issues will not be addressed if left to non-IT personnel.

At last, performance aspects should not be neglected. Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure.

Surrogate Key Stage

A Surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from the application data. In a star schema database, surrogate keys are used to join a fact table to a dimension table.

The surrogate key is internally generated by the system but is nevertheless visible to the user of the application.

Surrogate Key Generator stage

The Surrogate Key Generator stage is a processing stage that generates surrogate key columns and maintains the key source.

The Surrogate Key Generator stage can have a single input link, a single output link, both an input link and an output link, or no links. Job design depends on the purpose of the stage.

You can use a Surrogate Key Generator stage to perform the following tasks:

  • Create or delete the key source before other jobs run
  • Update a state file with a range of key values
  • Generate surrogate key columns and pass them to the next stage in the job
  • View the contents of the state file

Generated keys are unsigned 64-bit integers. The key source can be a state file or a database sequence. If you are using a database sequence, the sequence must be created by the Surrogate Key stage. You cannot use a sequence previously created outside of DataStage.

You can use the Surrogate Key Generator stage to update a state file, but not a database sequence. Sequences must be modified with database tools.

What is a Dimension?

A measure or dimension is normally an additive numerical value that represents a business metric.  You are also not limited to a single measure.  You can have multiple measures within a fact table.  For example, if your fact table is used to track purchases internationally you might have measures for each type of currency.  If you are building a fact table for the retail industry you might also have the following measures cost, list price, average sale price.

Dimensions describe the objects involved in a business intelligence effort. While facts correspond to events, dimensions correspond to people, items, or other objects. For example, in the retail scenario, we discussed that purchases, returns and calls are facts. On the other hand, customers, employees, items and stores are dimensions and should be contained in dimension tables.

 

Dimension tables contain details about each instance of an object. For example, the items dimension table would contain a record for each item sold in the store. It might include information such as the cost of the item, the supplier, color, sizes, and similar data.

Dimension tables are typically small, ranging from a few to several thousand rows. Occasionally dimensions can grow fairly large, however. For example, a large credit card company could have a customer dimension with millions of rows. Dimension table structure is typically very lean, for example customer dimension could look like following:

Customer_key
Customer_full_name
Customer_city
Customer_state
Customer_country

Fact tables and dimension tables are related to each other. Again returning to our retail model, the fact table for a customer transaction would likely contain a foreign key reference to the item dimension table, where the entry corresponds to a primary key in that table for a record describing the item purchased.

Most data warehouses will have one or multiple time dimensions. Since the warehouse will be used for finding and examining trends, data analysts will need to know when each fact has occurred. The most common time dimension is calendar time. However, your business might also need a fiscal time dimension in case your fiscal year does not start on January 1st as the calendar year.

Most data warehouses will also contain product or service dimensions since each business typically operates by offering either products or services to others. Geographically dispersed businesses are likely to have a location dimension.

Change data Capture

IBM WebSphere DataStage Change Data Capture

This is not to be confused with the stage Change Data Capture (CDC)

The following CDC companion products are available to work with IBM Information Server, these need to be installed separately:

  • IBM WebSphere® DataStage® Changed Data Capture for Microsoft® SQL Server
  • IBM WebSphere DataStage Changed Data Capture for Oracle
  • IBM WebSphere DataStage Changed Data Capture for DB2® for z/OS®
  • IBM WebSphere DataStage Changed Data Capture for IMS™

The product, based on which database is used, is installed to capture changes from source data and pass on the changes to target data.

Change data Capture

Change data Capture

It can be used in 2 modes: PUSH and PULL modes.

  • PUSH – changes are published as it happens
  • PULL – changes are captures at regular intervals – say once a day or every 5 minutes

CDC uses the native services of the database architecture, adheres to the database vendor’s documented formats and APIs, and minimizes the invasive impact on any operational systems.

Please refer to IBM product documentation here

What is a Fact Table?

A fact table is a table that joins dimension tables with measures.  For example, Lets say you wanted to know the time worked by employees, by location, by project and by task.  If you had a dimension for employees, location, project and task you would create a composite primary key using these foreign keys and add an additional column for the time worked measure. (more on measures in a little bit)

Keep in mind that fact tables are huge unlike dimension tables.  Fact tables are usually built to contain a hundred thousand records on the low side up to billions of records on the high side.  Therefore, these tables must be normalized to be efficient.

A fact table is normalized when it consists of the following two things:

  1. A composite primary key using the foreign keys to the dimension tables.
  2. Measure(s) used for numerical analysis.

 

Fact tables contain the data corresponding to a particular business process. Each row represents a single event associated with that process and contains the measurement data associated with that event. For example, a retail organization might have fact tables related to customer purchases, customer service telephone calls and product returns. The customer purchases table would likely contain information about the amount of the purchase, any discounts applied and the sales tax paid.

The information contained within a fact table is typically numeric data and it is often data that can be easily manipulated, particularly by summing together many thousands of rows. For example, the retailer described above may wish to pull a profit report for a particular store, product line or customer segment. The retailer can do this by retrieving information from the fact table that relates to those transactions meeting the specific criteria and then adding those rows together.

Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine. For example, a store selling automotive parts might have a fact table recording a sale of each item. The fact table of an educational entity could track credit hours awarded to students. A bakery could have a fact table that records manufacturing of various baked goods.

Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table “grain”. For instance, for a healthcare billing company it might be sufficient to track revenues by month; daily and hourly data might not exist or might not be relevant. On the other hand, the assembly line warehouse analysts might be very concerned in number of defective goods that were manufactured each hour. Similarly a marketing data warehouse might be concerned by the activity of a consumer group with a specific income-level rather than purchases made by each individual.

Fact Table Grain

When designing a fact table, developers must pay careful attention to the grain of the table — the level of detail contained within the table. The developer designing the purchase fact table described above would need to decide, for example, whether the grain of the table is a customer transaction or an individual item purchase. In the case of an individual item purchase grain, each customer transaction would generate multiple fact table entries, corresponding to each item purchased. The choice of grain is a fundamental decision made during the design process that can have significant impact on the business intelligence effort down the road.