What is ETL

Extract -> Transform -> Load

A process of gathering, converting and storing data, often from many locations. The data is often converted from one format to another in the process.

Examples: IBM Datastage, Informatica

Data Warehousing

A data warehouse is a collection of data gathered and organized so that it can easily by analyzed, extracted, synthesized, and otherwise be used for the purposes of further understanding the data. It may be contrasted with data that is gathered to meet immediate business objectives such as order and payment transactions, although this data would also usually become part of a data warehouse.

What can Datastage do?

  • Design jobs for Extraction, Transformation, and Loading (ETL)
  • Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations
  • Import, export, create, and managed metadata for use within jobs
  • Schedule, run, and monitor jobs all within DataStage
  • Administer your DataStage development and execution environments

Complex Flat File Stage

  • CFF stage now processes Multi Format Flat (MFF) file
  • Constraints can be specified on the output links to filter data and/or define when a record should be sent down the link

Infosphere Information Server V 8.7

What is IIS Suite?
IBM’s Infosphere Information Server is a data integration platform. It includes products for Data Warehousing (Infosphere Warehouse 10), Information Integration (IBM Datastage), Master Data Management (MDM V10) and Big Data Analytics.

Which is the latest version?

What is new in V8.7?

  • Supports Big Data and Hadoop (directly access big data file on a distributed file system)
  • Deep & Tight Metadata integration
  • Comprehensive information governance
  • parallel debugging capabilities for partitioned data
  • smart management of metadata and metadata imports
  • New operational intelligence capabilities
  • Netezza connectivity as a part of this release
  • A new configuration option for the database connectivity layer of Information Server creates an audit log for any database operation directly into InfoSphere Guardium
  • A new InfoSphere Change Data Capture stage, as well as expanded metadata capture with Data Replication/Change Data Delivery for end-to-end data lineage
  • A new Teradata Connector TMSM and dual load support to help Teradata users better manage disaster recovery
  • Advanced capabilities to identify, cleanse, and manage metadata for SAP projects help SAP project managers meet their go-live dates

New features in IBM Datastage V 8.7 – http://dsxchange.net/uploads/DSXChange_DataStage_8_7.pdf

Hadoop! – What is Hadoop?

Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

http://hadoop.apache.org/ for more details

The project includes these subprojects: 
Hadoop Common  The common utilities that support the other Hadoop subprojects.
Hadoop Distributed File System (HDFS™)  A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce  A software framework for distributed processing of large data sets on compute clusters.
Avro™  A data serialization system.
Cassandra™  A scalable multi-master database with no single points of failure.
Chukwa™  A data collection system for managing large distributed systems.
HBase™  A scalable, distributed database that supports structured data storage for large tables.
Hive™  A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™  A Scalable machine learning and data mining library.
Pig™  A high-level data-flow language and execution framework for parallel computation.
ZooKeeper™  A high-performance coordination service for distributed applications.

 New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
Cost effective–
 Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
 Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
Fault tolerant–
 When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

IBM and Hadoop
Eighty percent of the world’s data is unstructured, and most businesses don’t even attempt to use this data to their advantage. Imagine if you could afford to keep all the data generated by your business? Imagine if you had a way to analyze that data?IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. With built-in analytics, extensive integration capabilities and the reliability, security and support that you require, IBM can help put your big data to work for you.Other Hadoop-related projects at Apache include:Hadoop changes the economics and the dynamics of large scale computing. Its impact can be boiled down to four salient characteristics.

XML Stages

Yes, Datastage does have the XML stages. From V8.5 onwards (Introduced sometime around late 2010). More coming on XML support

The good news is that version 8.5 of IBM InfoSphere DataStage has greatly enhanced XML capabilities. I first saw the new XML features in DataStage 8.5 in demo at the Information On Demand conference last October. And I was very impressed because the XML support goes far beyond the half-hearted XML handling that many tools offer.

For example, it’s easy to import and work with XML Schemas in DataStage 8.5. Many industry standard XML Schemas that are used in the financial sector, health care, insurance, government, retail, etc. are quite complex and consist of dozens or even hundreds of XSD files that comprise a single XML Schema. Examples includeFpMLFIXMLHL7IRS1120OAGIS, and many others.
You might receive such a schema as a ZIP file that contains multiple folders with XSD files. DataStage 8.5 can simply read the entire ZIP file, which saves you the tedious job of importing all the XSD files separately or dealing with their import and include relationships.
Once the XML Schema is imported, DataStage understands the structure of your XML document and allows you define the transformations that you need.
The XML transformation capabilities certainly include some of the intuitive things. For example, you can:
  • compose new XML documents from relational tables or other sources
  • shred XML documents to a set of relational row sets (or tables)
  • extract selected pieces from each XML document and leave other parts of the XML unparsed and “as-is”
  • extend your XML processing by applying XSLT stylesheets to the incoming XML data
And there is also a powerful set of transformation steps that allow you to implement any custom XML transformation that you may need.
A big bonus is the ability to process even very large XML files very efficiently. When batches of XML documents are moved between systems, they are sometimes concatenated into a single large XML document that can be GBs in size. Such mega-documents typically contain many independent business objects and often need to be split by the consumer. DataStage 8.5 handles documents of 20GB or larger very efficiently and can even parse and process a single large document with multiple threads in parallel! This is very cool and a big win for performance

ETL Tools – A comparison

ETL tool Company
IIS Suite – Datastage IBM (Ascential)
Powercenter Informatica Informatica
Ab Initio  Ab Initio
Cognos Decision Stream IBM(Cognos)
Oracle Warehouse Builder/Oracle Data Integrator  Oracle
SAS Data Integration studio SAS
Micosoft SQL Server Integration Srvs Micosoft
Talend Open Studio Talend
Data services SAP BusinessObjects
Elixir Repertoire Elixir
Data Migrator Information Builders
Pentaho Data Integration Pentaho
Adeptia Integration Suite Adeptia

Datastage and Informatica are the leading ETL tools. Each has its own advantages, these are compared here.

Datastage Certification

Datastage V 8.5  – Test 000-421 (IBM link – http://www-03.ibm.com/certify/tests/ovr421.shtml)

Get free Dumps cheap  – http://ds.iexpertify.com/2013/09/datastage-certification-dumps-for-v8-5-000-421.html
(Note for copyright: iexpertify.com does not host datastage certification dumps)

Test information:
  • Number of questions: 65
  • Time allowed in minutes: 90
  • Required passing score: 65%

To register, click on the IBM link and follow the IBM at Prometric link at the right of the webpage.

If you are taking the IIS Datastage certification, these are the topics you will be tested on:

Section 1 – Configuration (6%)
  1. Describe how to properly configure DataStage v8.5
  2. Identify tasks required to create and configure a project to be used for v8.5 jobs
  3. Given a configuration file, identify its components and its overall intended purpose

Section 2 – Metadata (6%)
  1. Demonstrate knowledge of Orchestrate schema
  2. Identify the method of importing, sharing, and managing metadata
  3. Demonstrate knowledge of runtime column propagation

Section 3 – Persistent Storage (10.5%)
  1. Explain the process of importing/exporting data to/from framework (e.g., sequential file, external source/target)
  2. Describe proper use of a sequential file
  3. Describe proper usage of FileSets and DataSets
  4. Describe use of FTP stage for remote data
  5. Describe use of restructure stages (e.g., column import/export)
  6. Identify importing/exporting of XML data

Section 4 – Parallel Architecture (9%)
  1. Demonstrate proper use of data partitioning and collecting
  2. Demonstrate knowledge of parallel execution

Section 5 – Datatbases (9%)
  1. Demonstrate proper selection of database stages and database specific stage properties
  2. Identify source database options
  3. Demonstrate knowledge of target database options

Section 6 – Data Transformation (12%)
  1. Demonstrate knowledge of default type conversions, output mappings, and associated warnings
  2. Demonstrate proper selections of Transformer stage vs. other stages
  3. Describe Transformer stage capabilities (including: stage variables, link variables, DataStage macros, constraints, system variables, link ordering, @PART NUM, functions
  4. Demonstrate the use of Transformer stage variables (e.g., to identify key grouping boundaries on incoming data)
  5. Identify process to add functionality not provided by existing DataStage stages. (e.g., wrapper, BuildOps, user def functions/routines)
  6. Demonstrate proper use of SCD stage
  7. Demonstrate job design knowledge of using RCP (modify, filter, dynamic transformer)
  8. Demonstrate knowledge of Transformer Stage input and output loop processing (e.g., LastRecord(), LastRowInGroup(), SaveRecord(), etc.)

Section 7 – Job Components (12%)
  1. Demonstrate knowledge of Join, Lookup and Merge stages
  2. Demonstrate knowledge of SORT stage
  3. Demonstrate understanding of Aggregator stage
  4. Describe proper usage of change capture/change apply
  5. Demonstrate knowledge of Real-time components

Section 8 – Job Design (9%)
  1. Demonstrate knowledge of shared containers
  2. Describe how to minimize Sorts and repartitions
  3. Demonstrate knowledge of creating restart points and methodologies
  4. Demonstrate proper use of standards
  5. Explain the process necessary to run multiple copies of the source (job multi-instance)

Section 9 – Monitor and Troubleshoot (7%)
  1. Demonstrate knowledge of parallel job score
  2. Identify and define environment variables that control DataStage v8.5 with regard to added functionality and reporting
  3. Given a process list, identify conductor, section leader, and player process
  4. Identify areas that may improve performance (e.g., buffer size, repartitioning, config files, operator combination, etc.)
  5. Demonstrate knowledge of runtime metadata analysis and performance monitoring

Section 10 – Job Management and Deployment (10.5%)
  1. Demonstrate knowledge of advanced find
  2. Demonstrate knowledge and the purpose of impact analysis
  3. Demonstrate knowledge and purpose of job compare
  4. Articulate the change control process
  5. Source Code Control Integration

Section 11 – Job Control and Runtime Management (6%)
  1. Demonstrate knowledge of message handlers
  2. Identify the use of dsjob command line utility
  3. Demonstrate ability to use job sequencers (e.g., exception hunting, re-startable, dependencies, passing return value from routing, parameter passing and job status)