Oracle column message type NCLOB

Officially an NCLOB is not supported in lower versions. Is there a way to make this work in a bare bones SRC -> TRANSFORMER -> TGT kind of job or something more complex? Please let me know in comments. In V11.5, define NCLOB as LONGNVARCHAR.

https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.conn.oracon.usage.doc/topics/data_map_to_ds_oracc.html

What is NCLOB?

NCLOB (National Character Large Object) is an Oracle data type that can hold up to 4 GB of character data. It’s similar to a CLOB, but characters are stored in a NLS or multibyte national character set.

Other details that may be useful

1. If you are using JAVA applications to modify the database columns, then you have to open streaming connections to write data for these columns, where the streaming connections can be ASCII or Binary connections.
2. From SQL plus it not possible to write as max value that can be written is 4000 chars.
3. From PL/SQL you have to use the dbms_lob.writeappend packages for inserting data.

 

Datastage Director – Job Running and viewing issues

Operator role by default has very few options, check how it is setup in your environment.

You could recheck – extract the job log via command line

  • sanity test your environment.
    • RowGen->Peek … then compile and run
    • RowGen -> Transformer -> Peek … compile and run

$DSHOME/bin/dsjob -lprojects

$DSHOME/bin/dsjob -domain host:port -server host -user u124443 -password some43@$ -lprojects

 

 

‘mapping file error’ -Lookup Stage

Join versus Lookup

The Lookup Stage: It memory maps files. This means that you MUST have enough system memory available to store the entire contents of the file AND you must have enough disk space on the resource disk (defined in the APT_CONFIG_FILE) to shadow the file in memory.

‘Each physical process can address only up to 4 GB of memory because it is a 32-bit application. The Windows version of the InfoSphere DataStage Parallel Engine is available only with 32-bit pointers. ‘

It is not that you dont have enough memory on the system, but that to load the whole map into memory hits the limit. The issue was worked around by using a join stage instead of lookup.

Another option is to Change the job to re-partition both the reference input to the lookup and the primary input to lookup to match on the keys. Because the lookup is running 4 way parallel and because we have explicitly partitioned the data, the lookup will disable memory sharing and the per process memory requirement is reduced on the reference input because of the data distribution. This will enable the job to complete.

 

Fatal Error: APT_Communicator::shmemInitBuffers: createMapFile (/tmp) failed: Not enough space on node.

This message can also be caused by system-wide limit on the number of mmap’ed shared memory segments. Often this issue occurs when the value for APT_DEFAULT_TRANSPORT_BLOCK_SIZE is set too high. Please check the environmental variable APT_DEFAULT_TRANSPORT_BLOCK_SIZE. The default for this variable is 131072 or 128 KB. The maximum value should be is 1048576 or 1MB. Please see page 79 on the Parallel Job Advanced Developer Guide for additional information regarding this environmental variable.

Resolving the problem
To resolve this issue, ensure that there is adequate space in the assigned TMPDIR (temporary directory) or verify the settings on the environmental variable APT_DEFAULT_TRANSPORT_BLOCK_SIZE.

Transformer Stage Looping

Looping from V8.5 and up

Aggregation operations make use of a cache that stores input rows.

Two functions, SaveInputRecord() and GetSavedInputRecord(), are used to add input rows to the cache and retrieve them.

  • SaveInputRecord() is called when a stage variable is evaluated, and returns the count of rows in the cache (starting at 1 when the first row is added).
  • GetSavedInputRecord() is called when a loop variable is evaluated.

In the Transformer Stage settings, these are the various inputs

Stage variable

Define the stage variables:

can use functions such as
SaveInputRecord()
or use functions with link columns such as
LastRowInGroup(inlink.Col1)
can use IF THEN ELSE, such as
IF IsBreak THEN 0 ELSE SummingPrice + inlink.Price
Loop condition
Enter the expression as the loop condition:
@ITERATION <= NumRows

The loop continues to iterate for the count specified in the NumRows variable.
Loop variables

Define the loop variable, for instance:

SavedRowIndex
GetSavedInputRecord()

Output link metadata and derivations

Define the output link columns and their derivations, example:

  • Col1 – inlink.Col1
  • Price – inlink.Price
  • Percentage – (inlink.Price * 100)/TotalPrice

 

RUNTIME ERRORS

The number of calls to SaveInputRecord() and GetSavedInputRecord() must match for each loop. You can call SaveInputRecord() multiple times to add to the cache, but once you call GetSavedInputRecord(), then you must call it enough times to empty the input cache before you can call SaveInputRecord() again. The examples described can generate runtime errors in the following circumstances by not observing this rule:
  • If your Transformer stage calls GetSavedInputRecord before SaveInputRecord, then a fatal error similar to the following example is reported in the job log:
    APT_CombinedOperatorController,0: Fatal Error: get_record() called on 
    record 1 but only 0 records saved by save_record()
  • If your Transformer stage calls GetSavedInputRecord more times than SaveInputRecord is called, then a fatal error similar to the following example is reported in the job log:
    APT_CombinedOperatorController,0: Fatal Error: get_record() called on 
    record 3 but only 2 records saved by save_record()
  • If your Transformer stage calls SaveInputRecord but does not call GetSavedInputRecord, then a fatal error similar to the following example is reported in the job log:
    APT_CombinedOperatorController,0: Fatal Error: save_record() called on 
    record 3, but only 0 records retrieved by get_record()
  • If your Transformer stage does not call GetSavedInputRecord as many times as SaveInputRecord, then a fatal error similar to the following example is reported in the job log:
    APT_CombinedOperatorController,0: Fatal Error: save_record() called on 
    record 3, but only 2 records retrieved by get_record()

ETL CLOB data type

What is CLOB

A Character Large OBject (or CLOB) is a collection of character data in a database management system, usually stored in a separate location that is referenced in the table itself.

A CLOB (character large object) value can be up to 2,147,483,647 characters long. A CLOB is used to store unicode character-based data, such as large documents in any character set.

The length is given in number characters for both CLOB, unless one of the suffixes K, M, or G is given, relating to the multiples of 1024, 1024*1024, 1024*1024*1024 respectively.

Length is specified in characters (unicode) for CLOB.

From 8.5 version, The Oracle connector supports reading and writing Oracle LOB data types BFILE,BLOB, CLOB, NCLOB, LONG RAW, RAW, and XMLType.

Example

One of the column in a particular table contains HTML or XML data stored as a CLOB, The length could be large say ‘1048576’.

Use LongVarchar(n) datatype within DataStage.

In ETL job, replace CLOB with LongVarChar of same size (1048576) as CLOB is not defined in data Stage. Where “n” depends on what is in the CLOB field.

For example, DBA/Analyst confirms it will not be more than 1500 chars in the field. Use Varchar(2000) – not LongVarChar.

images stored as CLOBs

BLOBs are generally used to store images. Sometimes though, CLOBs have been used to store images. Sometimes only the Path is stored while the images are physically stored on the disk.

In cases where we can encode the images and store the text and the encoding metadata, we use BLOB/CLOB to store the images.

What is unicode and what is the use of it

While doing the View Data shows you “un-American” characters. View data wont give the correct result ,please check the final out put (file or table). Don’t trust what View Data shows you in regard to “un-American” characters. Use a viewer that can display (Eg. the Euro symbol, accented characters, etc.) properly.
Use a viewer that can display the Euro symbol, accented characters, etc. properly.
Use “UNICODE” to handle LATIN source data/Chinese characters .

Apply stringtoUstring(columnvalue,’UTF-8′) and make column datatype as nvarchar

NLS-settings:
– Server Maps=UTF-8
– Parallel Maps=UTF-8

code page IBM 1388

Error and resolution

APT_CombinedOperatorController(1),0: Invalid character(s) ([xE4]) found converting string (code point(s): [xEF][xBC][x8A][xE3][x80][x80][xE3][x80][x80][xE3][ x80][x80][xE3][x80][x80][xE3][x80][x80][xE3][x80][ x80][xE3][x80][x80][xE3][x80][x80][xE3][x80][x80][ xE3][x80][ x80][xE3][x80][x80][xE3][x80][x80][xE3][x80][x80][ xE3][x80][ x80][xE3][x80][x80][xE3][x80][x80] [xE3][x80][x80][xE3][x80][x80][xE3][x80][x80][xE3][ x80][ x80][xE3][x80][x80][xE3][x80][x80][xE3][x80][x80][ xE3][x80][ x80][xE3][x80][x80][xE3]…) from codepage UTF-8 to Unicode, substituting.

Probable Resolution:

the job (or project) was setup to default to UTF-8 character set, but the stage used to read source uses ISO-8859-1 characters.

Useful links

http://unicode-table.com/en/#0000

 

Information Analyzer – REGEX

source-data  [NOT] MATCHES_REGEX pattern-string

You can build your own REGEX based on need. Some examples are listed below:

Postal Codes Example
  • You are searching for all US postal codes in a column (9 bytes with last 4 optional):
    '\b[0-9]{5}(?:-[0-9]{4})?\b'
Phone numbers Example
  • You are searching for phone numbers in the standard North American format. You want to account for variations in the format such as 123-456-7890, (123)-456-7890, 123 456 7890 , 123.456.7890 and so on:
    '\(?[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}'

If you are on 8.5 or 8.7, then you can import a bunch of pre-built rules using the following developerworks article: http://www.ibm.com/developerworks/data/library/techarticle/dm-1112isanalyzerrules/

The above rules are automatically installed with version 9.1 and beyond.

There are many examples of the use of RegEx functions, including the data rule definition ‘SsnMatchesRegex’

 

More information

https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.ia.quality.doc/topics/dq_matches_regex_check.html

Information Analyzer

IA or Information Analyzer gives one frequency distributions of values, formats, etc. IA provides data sampling (random, sequential and Nth). At a fraction of the cost, a lot of useful information can be generated:
– Is the column unique?
– Are there nulls?
– What are the high frequency values?
– What are the high frequency patterns?
– Are there many outliers?

Note: Sampling from the IA perspective may still incur a full table scan on the source system. All rows may still be read and sent to IA whereby IA only chooses to analyze every 100th record or randomly ignores all but 10% of the records depending on which sampling options you selected.

“Frequency by Frequency” report under the “column frequency” option

Column Level Summary
Cardinality Count : 1,034,710
Null Count : 0
Actual Row Count : 17,937,920
Frequency Cut Off : 0
Total Rows Covered : 12,633,758
%Rows Covered : 70.4305

Check Display options: Default may the 1000 distinct values with counts.

 

if you are using the Where clause in the Analysis Settings, that will only apply to Column Analysis not to Data Rules(e.g. to get rid of the suspended accounts etc.).

For Data Rules, create a Virtual Table (which appears under the Column Analysis menu).  The Virtual Table allows you to build the where clause that is relevant and that can be used in both Column Analysis and in any of the data rules.

scheduling the Column Analysis (CA) Job from Information Analyzer (IA)

  • Go into the ‘Web Console for Information Server’ ( https://server:port/ibm/iis/console ) and
  • create a scheduling view (Schedule Monitoring > Views of Schedules > New Scheduling View) and select at least an ‘Origin’ of Information Analyzer.
  • You should be able to see the scheduled Information Analyzer processes.