Information Analyzer – REGEX

source-data  [NOT] MATCHES_REGEX pattern-string

You can build your own REGEX based on need. Some examples are listed below:

Postal Codes Example
  • You are searching for all US postal codes in a column (9 bytes with last 4 optional):
    '\b[0-9]{5}(?:-[0-9]{4})?\b'
Phone numbers Example
  • You are searching for phone numbers in the standard North American format. You want to account for variations in the format such as 123-456-7890, (123)-456-7890, 123 456 7890 , 123.456.7890 and so on:
    '\(?[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}'

If you are on 8.5 or 8.7, then you can import a bunch of pre-built rules using the following developerworks article: http://www.ibm.com/developerworks/data/library/techarticle/dm-1112isanalyzerrules/

The above rules are automatically installed with version 9.1 and beyond.

There are many examples of the use of RegEx functions, including the data rule definition ‘SsnMatchesRegex’

 

More information

https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.ia.quality.doc/topics/dq_matches_regex_check.html

Information Analyzer

IA or Information Analyzer gives one frequency distributions of values, formats, etc. IA provides data sampling (random, sequential and Nth). At a fraction of the cost, a lot of useful information can be generated:
– Is the column unique?
– Are there nulls?
– What are the high frequency values?
– What are the high frequency patterns?
– Are there many outliers?

Note: Sampling from the IA perspective may still incur a full table scan on the source system. All rows may still be read and sent to IA whereby IA only chooses to analyze every 100th record or randomly ignores all but 10% of the records depending on which sampling options you selected.

“Frequency by Frequency” report under the “column frequency” option

Column Level Summary
Cardinality Count : 1,034,710
Null Count : 0
Actual Row Count : 17,937,920
Frequency Cut Off : 0
Total Rows Covered : 12,633,758
%Rows Covered : 70.4305

Check Display options: Default may the 1000 distinct values with counts.

 

if you are using the Where clause in the Analysis Settings, that will only apply to Column Analysis not to Data Rules(e.g. to get rid of the suspended accounts etc.).

For Data Rules, create a Virtual Table (which appears under the Column Analysis menu).  The Virtual Table allows you to build the where clause that is relevant and that can be used in both Column Analysis and in any of the data rules.

scheduling the Column Analysis (CA) Job from Information Analyzer (IA)

  • Go into the ‘Web Console for Information Server’ ( https://server:port/ibm/iis/console ) and
  • create a scheduling view (Schedule Monitoring > Views of Schedules > New Scheduling View) and select at least an ‘Origin’ of Information Analyzer.
  • You should be able to see the scheduled Information Analyzer processes.

COBOL files in CFF stage

It could be a complex cobol copybook structure which has a good mix of occurs, redefines and level 88, COMP and COMP3.

Note

Packed (Comp-3)

Binary (Comp)

Or it could be a simple one such as below:

********************************
* COBOL DECLARATION FOR ORDERS *
********************************

01 HEADER.
05 RECTYPE PIC X(1).
05 NAME PIC X(20).
05 ORDDATE PIC X(10).

01 DETAIL.
05 ORDERNUM PIC S9(5).
05 RECTYPE PIC X(1).
05 PRODUCTID PIC S9(5).
05 QTY PIC S9(5).

But the main issue is that the source file would be in a native EBCDIC binary format and not ASCII. The source text file is generated from Mainframe and uses EBDIC format. Each row has different type of data columns and varies depending on available data.  Mostly a header, one or more type of detail records and a trailer record.

what stage?

Either way, use a CFF stage, import the copybook layout and view the source content through CFF (server) stage. Using sequential stage could throw errors such as import error, buffer overrun, etc. Seq file stage does do the EBCIDIC to ACII conversion, but not well if it has odd fields like packed decimal and such.

endianness

endianness (Byte Order) setting would need to be checked for Binary/Comp fields. The same file definition could change based on the byte order

849 = x0351 Big Endian
20739 = x5103 Little Endian

more details on Endianness https://en.wikipedia.org/wiki/Endianness

Convertion to ASCII string

COMP-3 is a decimal storage format. When a decimal is converted to a string, it will contain a decimal point. If the scale of the decimal is 0, that decimal point will still be present in the string as the last byte.

If the scale is 0, you can likely just convert the decimal to an integer column before writing to your output.

Check with the mainframe whether they are using CCSID 037 or CCSID 500 or others. Sometime simple COBOL files can be read as binary data type and by using ASCII function in the transformer stage.

Make sure “Unicode” is used in column property

IBM037 or IBM500 is used in NLS (default is probably UTF-8)

More COBOL notes

CCSID 500 — This is an EBCDIC code page, used mainly on z/OS. It is known as the ‘International’ codepage. The related CCSID with Euro support is 1148.

CCSID 037 — This is another popular EBCDIC code page, used on OS/400 and VSE/ ESA. It is the ‘US English’ codepage. The related CCSID with Euro support is 1140.

Dynamic Relational Stage (DRS) versus OBDC or Native

DRS is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivity for the chosen target

DRS stage should be faster then the ODBC stage as it uses native database connectivity. You will need to install and configure the required database clients on your DataStage server for it to work.

Dynamic Relational Stage was leveraged for Peoplesoft to have a job to run on any of the supported databases. It supports ODBC connections too. Read more of that in the plug-in documentation. ODBC uses the ODBC driver (Open Database Connectivity) for a particular database.

Native versus DRS versus OBDC

Performance wise there is not much of a difference. DRS uses native drivers. DRS stage has both the ODBC and Native API connectivity(Like connecting to Oracle or other DBMS) whereas ODBC stage only uses Open Database Connectivity.

Dynamic Relational Stage

Dynamic RDBMS stage has existed since the early version 7s

DRS is replaced by the DRS connector Stage after V 8.5. There is an optional automatic conversion when you upgrade in place. Otherwise there is a conversion tool that you can use. Or, of course, you can effect the change manually.

Additional reading

https://www.ibm.com/support/knowledgecenter/SSZJPZ_8.7.0/com.ibm.swg.im.iis.conn.drs.doc/topics/DRS013.html

Data Lake implementation

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

-James Dixon,  the founder and CTO of Pentaho

“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

-Unkown source

The term data lake is regularly connected with Hadoop-oriented object storage. In such a situation, an entity’s data is initially stacked into the Hadoop platform, and afterward business analytics and data mining instruments are connected to the data where it lives on Hadoop’s bunch hubs of product PCs.

Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried.

Data Lake and Data warehouse

The commonality between the two is that they are both data storage repositories.

Data warehouse

Vs

Data Lakes

Structured, processed

DATA

Structured/semi-structured/unstructured/raw

Schema on write

PROCESSING

Schema on read

Expensive for large volumes

STORAGE

Designed for low cost storage

Less agile, fixed configuration

AGILITY

Highly agile, configure and reconfigure as needed

Mature

SECURITY

Maturing

Business professionals

USERS

Data scientists

Gaps in Data Lake Concept


Data lakes convey significant dangers. The most critical is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or administration. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a information swamp. What’s more- without metadata, every subsequent utilization of data means analysts start from the beginning.

Another danger is security and access control. Data can be placed into the data lake with no oversight of the contents. Numerous data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security abilities of central data lake technologies are still embryonic. These issues will not be addressed if left to non-IT personnel.

At last, performance aspects should not be neglected. Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure.