Datastage | Datastage

June 5, 2012

What is ETL

Extract -> Transform -> Load

A process of gathering, converting and storing data, often from many locations. The data is often converted from one format to another in the process.

Examples: IBM Datastage, Informatica

June 5, 2012

A data warehouse is a collection of data gathered and organized so that it can easily by analyzed, extracted, synthesized, and otherwise be used for the purposes of further understanding the data. It may be contrasted with data that is gathered to meet immediate business objectives such as order and payment transactions, although this data would also usually become part of a data warehouse.

June 5, 2012

What can Datastage do?

Design jobs for Extraction, Transformation, and Loading (ETL)
Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations
Import, export, create, and managed metadata for use within jobs
Schedule, run, and monitor jobs all within DataStage
Administer your DataStage development and execution environments

June 4, 2012

Complex Flat File Stage

CFF stage now processes Multi Format Flat (MFF) file
Constraints can be specified on the output links to filter data and/or define when a record should be sent down the link

May 27, 2012

IBM Big Data

May 27, 2012

Infosphere Information Server V 8.7

What is IIS Suite?
IBM’s Infosphere Information Server is a data integration platform. It includes products for Data Warehousing (Infosphere Warehouse 10), Information Integration (IBM Datastage), Master Data Management (MDM V10) and Big Data Analytics.

Which is the latest version?
V8.7

What is new in V8.7?

Supports Big Data and Hadoop (directly access big data file on a distributed file system)
Deep & Tight Metadata integration
Comprehensive information governance
parallel debugging capabilities for partitioned data
smart management of metadata and metadata imports
New operational intelligence capabilities
Netezza connectivity as a part of this release
A new configuration option for the database connectivity layer of Information Server creates an audit log for any database operation directly into InfoSphere Guardium
A new InfoSphere Change Data Capture stage, as well as expanded metadata capture with Data Replication/Change Data Delivery for end-to-end data lineage
A new Teradata Connector TMSM and dual load support to help Teradata users better manage disaster recovery
Advanced capabilities to identify, cleanse, and manage metadata for SAP projects help SAP project managers meet their go-live dates

New features in IBM Datastage V 8.7 – http://dsxchange.net/uploads/DSXChange_DataStage_8_7.pdf

May 26, 2012

Hadoop! – What is Hadoop?

Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

http://hadoop.apache.org/ for more details

The project includes these subprojects:

Hadoop Common	The common utilities that support the other Hadoop subprojects.
Hadoop Distributed File System (HDFS™)	A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce	A software framework for distributed processing of large data sets on compute clusters.
Avro™	A data serialization system.
Cassandra™	A scalable multi-master database with no single points of failure.
Chukwa™	A data collection system for managing large distributed systems.
HBase™	A scalable, distributed database that supports structured data storage for large tables.
Hive™	A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™	A Scalable machine learning and data mining library.
Pig™	A high-level data-flow language and execution framework for parallel computation.
ZooKeeper™	A high-performance coordination service for distributed applications.

Scalable–

New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.

Cost effective–

Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.

Flexible–

Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.

Fault tolerant–

When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

IBM and Hadoop

Eighty percent of the world’s data is unstructured, and most businesses don’t even attempt to use this data to their advantage. Imagine if you could afford to keep all the data generated by your business? Imagine if you had a way to analyze that data?IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. With built-in analytics, extensive integration capabilities and the reliability, security and support that you require, IBM can help put your big data to work for you.Other Hadoop-related projects at Apache include:Hadoop changes the economics and the dynamics of large scale computing. Its impact can be boiled down to four salient characteristics.

May 25, 2012

XML Stages

Yes, Datastage does have the XML stages. From V8.5 onwards (Introduced sometime around late 2010). More coming on XML support

The good news is that version 8.5 of IBM InfoSphere DataStage has greatly enhanced XML capabilities. I first saw the new XML features in DataStage 8.5 in demo at the Information On Demand conference last October. And I was very impressed because the XML support goes far beyond the half-hearted XML handling that many tools offer.

For example, it’s easy to import and work with XML Schemas in DataStage 8.5. Many industry standard XML Schemas that are used in the financial sector, health care, insurance, government, retail, etc. are quite complex and consist of dozens or even hundreds of XSD files that comprise a single XML Schema. Examples includeFpML, FIXML, HL7, IRS1120, OAGIS, and many others.

You might receive such a schema as a ZIP file that contains multiple folders with XSD files. DataStage 8.5 can simply read the entire ZIP file, which saves you the tedious job of importing all the XSD files separately or dealing with their import and include relationships.

Once the XML Schema is imported, DataStage understands the structure of your XML document and allows you define the transformations that you need.

The XML transformation capabilities certainly include some of the intuitive things. For example, you can:

compose new XML documents from relational tables or other sources
shred XML documents to a set of relational row sets (or tables)
extract selected pieces from each XML document and leave other parts of the XML unparsed and “as-is”
extend your XML processing by applying XSLT stylesheets to the incoming XML data

And there is also a powerful set of transformation steps that allow you to implement any custom XML transformation that you may need.

A big bonus is the ability to process even very large XML files very efficiently. When batches of XML documents are moved between systems, they are sometimes concatenated into a single large XML document that can be GBs in size. Such mega-documents typically contain many independent business objects and often need to be split by the consumer. DataStage 8.5 handles documents of 20GB or larger very efficiently and can even parse and process a single large document with multiple threads in parallel! This is very cool and a big win for performance

May 24, 2012

ETL Tools – A comparison

ETL tool	Company
IIS Suite – Datastage	IBM (Ascential)
Powercenter Informatica	Informatica
Ab Initio	Ab Initio
Cognos Decision Stream	IBM(Cognos)
Oracle Warehouse Builder/Oracle Data Integrator	Oracle
SAS Data Integration studio	SAS
Micosoft SQL Server Integration Srvs	Micosoft
Reveleus
Talend Open Studio	Talend
Genio
Acta
Data services	SAP BusinessObjects
Compuware
Sybase
Elixir Repertoire	Elixir
Data Migrator	Information Builders
Pentaho Data Integration	Pentaho
Adeptia Integration Suite	Adeptia

Datastage and Informatica are the leading ETL tools. Each has its own advantages, these are compared here.

May 24, 2012

Datastage Certification

Datastage V 8.5 – Test 000-421 (IBM link – http://www-03.ibm.com/certify/tests/ovr421.shtml)

Get free Dumps cheap – http://ds.iexpertify.com/2013/09/datastage-certification-dumps-for-v8-5-000-421.html
(Note for copyright: iexpertify.com does not host datastage certification dumps)

Test information:

Number of questions: 65
Time allowed in minutes: 90
Required passing score: 65%

To register, click on the IBM link and follow the IBM at Prometric link at the right of the webpage.

If you are taking the IIS Datastage certification, these are the topics you will be tested on:

Section 1 – Configuration (6%)

Describe how to properly configure DataStage v8.5
Identify tasks required to create and configure a project to be used for v8.5 jobs
Given a configuration file, identify its components and its overall intended purpose

Section 2 – Metadata (6%)

Demonstrate knowledge of Orchestrate schema
Identify the method of importing, sharing, and managing metadata
Demonstrate knowledge of runtime column propagation

Section 3 – Persistent Storage (10.5%)

Explain the process of importing/exporting data to/from framework (e.g., sequential file, external source/target)
Describe proper use of a sequential file
Describe proper usage of FileSets and DataSets
Describe use of FTP stage for remote data
Describe use of restructure stages (e.g., column import/export)
Identify importing/exporting of XML data

Section 4 – Parallel Architecture (9%)

Demonstrate proper use of data partitioning and collecting
Demonstrate knowledge of parallel execution

Section 5 – Datatbases (9%)

Demonstrate proper selection of database stages and database specific stage properties
Identify source database options
Demonstrate knowledge of target database options

Section 6 – Data Transformation (12%)

Demonstrate knowledge of default type conversions, output mappings, and associated warnings
Demonstrate proper selections of Transformer stage vs. other stages
Describe Transformer stage capabilities (including: stage variables, link variables, DataStage macros, constraints, system variables, link ordering, @PART NUM, functions
Demonstrate the use of Transformer stage variables (e.g., to identify key grouping boundaries on incoming data)
Identify process to add functionality not provided by existing DataStage stages. (e.g., wrapper, BuildOps, user def functions/routines)
Demonstrate proper use of SCD stage
Demonstrate job design knowledge of using RCP (modify, filter, dynamic transformer)
Demonstrate knowledge of Transformer Stage input and output loop processing (e.g., LastRecord(), LastRowInGroup(), SaveRecord(), etc.)

Section 7 – Job Components (12%)

Demonstrate knowledge of Join, Lookup and Merge stages
Demonstrate knowledge of SORT stage
Demonstrate understanding of Aggregator stage
Describe proper usage of change capture/change apply
Demonstrate knowledge of Real-time components

Section 8 – Job Design (9%)

Demonstrate knowledge of shared containers
Describe how to minimize Sorts and repartitions
Demonstrate knowledge of creating restart points and methodologies
Demonstrate proper use of standards
Explain the process necessary to run multiple copies of the source (job multi-instance)

Section 9 – Monitor and Troubleshoot (7%)

Demonstrate knowledge of parallel job score
Identify and define environment variables that control DataStage v8.5 with regard to added functionality and reporting
Given a process list, identify conductor, section leader, and player process
Identify areas that may improve performance (e.g., buffer size, repartitioning, config files, operator combination, etc.)
Demonstrate knowledge of runtime metadata analysis and performance monitoring

Section 10 – Job Management and Deployment (10.5%)

Demonstrate knowledge of advanced find
Demonstrate knowledge and the purpose of impact analysis
Demonstrate knowledge and purpose of job compare
Articulate the change control process
Source Code Control Integration

Section 11 – Job Control and Runtime Management (6%)

Demonstrate knowledge of message handlers
Identify the use of dsjob command line utility
Demonstrate ability to use job sequencers (e.g., exception hunting, re-startable, dependencies, passing return value from routing, parameter passing and job status)