September 28, 2014 What is Merge stage? The Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.(according to DS documentation) Merge stage combines a master dataset with one or more update datasets based on the key columns. The output record contains all the columns from master record plus any additional columns from each update record that are required. A master record and update record will be merged only if both have same key column values. The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time. As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well. Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link.
September 27, 2014 When to use join stage vs lookup stage? The major consideration is the volume of the data. One of the most important mistakes that developers make is to not have volumetric analyses done before deciding to use Join or Lookup stages. There are two data sets being combined. One is the primary or driving data set, sometimes called the left of the join. The other dataset are the reference data set or the right of the join. Join Lookup If the reference datasets take up a large amount of memory relative to the physical RAM memory size of the computer DataStage is running on, then a Lookup stage might crash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. In this case, Join is most preferred. So, if the reference datasets are not big enough to cause trouble, use a Lookup. Large reference datasets can cause slow performance since each lookup operation can, and typically will, cause a page fault and an I/O operation.
April 29, 2014 Why Data warehouse? Why do we need a data warehouse? Heterogeneities are everywhereEnterprises tend to rely on different databases, files, external sources and even the internet and social media for all the information. This means: Different interfacesDifferent data representationsDuplicate and inconsistent information Data Management in Large Enterprises Vertical fragmentation of informational systems (vertical stove pipes)Result of application (user)-driven development of operational systems The goal of building a warehouse is to have a system that: Collects and combines informationProvides integrated view, uniform user interfaceSupports sharing
September 24, 2013 Slowly Changing Dimension/SCD in datastage To understand what is Slowly Changing Dimension, we first understand these: Most frequently, we insert into Fact tables . Very infrequently we update the facts that were loaded incorrectly. It is even less likely to delete rows from the fact table; the only time this might happen is if you wanted to archive the old data that is no longer relevant for the current portion of the data warehouse. Dimension tables, on the other hand are likely to see frequent updates. A classic example of this is a consumer dimension. People can change names, phone numbers, addresses, etc regularly. Dimensions that are changing over time are referred to as slowly changing dimensions (SCD). Slowly Changing Dimension Type 1: override the existing value Customer_key Customer_name Customer_city Customer_state 123 Ms. Brown Houston Texas 123 Ms. Brown Dallas Texas Slowly Changing Dimension – SCD Type 1 Type 2: create a new record with effective dates Customer_key Customer_name Customer_city Customer_state Effective_date Obsolete_date 123 Ms.Brown Houston Texas 1/1/2000 1/1/2005 234 Ms. Brown Dallas Texas 1/1/2005 NULL Slowly Changing Dimension – SCD Type 3 Type 3: create a new and old columns, no new rows Customer_key Customer_name Customer_city Customer_state Change_date New_city 123 Ms. Brown Houston Texas 1/1/2005 Dallas Slowly Changing Dimension – SCD Type 3
September 11, 2013 What is DataWarehouse? A data warehouse is a central integrated database containing data from all the operational sources and archive systems in an organization. It contains a copy of transaction data specifically structured for query analysis. This database can be accessed by all users, ensuring that each group in an organization is accessing valuable, stable data A Data Warehouse Is A Structured Repository of Historic Data. It Is Developed in an Evolutionary Process by Integrating Data from Non-integrated Legacy Systems. It Is Usually: ü Subject Oriented ü Integrated ü Time Variant ü Non-volatile Next: Under stand Why do we need a Data warehouse
January 31, 2013 Datastage History At the beginning datastage was available under Ardent DataStage then it’s name was changed to Ascential DataStage. In 2005 IBM acquired the Ascential company and added datastage product to it’s WebSphere family. The October 2006 release of DataStage has integrated it into the new IBM Information Server platform, but still called Websphere Datastage. IBM WebSphere DataStage is one of the products from IBM’s WebSphere Information Integration suite and Information Server. The tool uses mostly graphical notaction to present data integration solutions. Datastage is delivered in various versions as Server and Enterprise Edition. In 2008 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage. For more detailed early history, see below: DataStage was conceived at VMark, a spin off from Prime Computers that developed two notable products: UniVerse database and the DataStage ETL tool. The first VMark ETL prototype was built by Lee Scheffler in 1996 Lee Scheffler presented the DataStage product overview to the board of VMark in 1996 and it was approved for development. The product was in alpha testing in October, beta testing in November and was generally available in 1997. VMark acquired UniData in 1997 and renamed itself to Ardent Software. In 1999 Ardent Software was acquired by Informix the database software vendor. In 2001 IBM acquired Informix and took just the database business leaving the data integration tools to be spun off as an independent software company called Ascential Software For various edition names, as people refer to various DataStage options using these names: Edition Details Enterprise Edition a name give to the version of DataStage that had a parallel processing architecture and parallel ETL jobs. Server Edition the name of the original version of DataStage representing Server Jobs. Early DataStage versions only contained Server Jobs. DataStage 5 added Sequence Jobs and DataStage 6 added Parallel Jobs via Enterprise Edition. MVS Edition mainframe jobs, developed on a Windows or Unix/Linux platform and transferred to the mainframe as compiled mainframe jobs. DataStage for PeopleSoft a server edition with prebuilt PeopleSoft EPM jobs under an OEM arrangement with PeopeSoft and Oracle Corporation. DataStage TX for processing complex transactions and messages, formerly known as Mercator. DataStage SOA Real Time Integration pack can turn server or parallel jobs into SOA services.
January 5, 2013 What are Stage Variables, Derivations and Constants? Stage Variable – An intermediate processing variable that retains value during read and doesnt pass the value into target column.Derivation – Expression that specifies value to be passed on to the target column.Constant – Conditions that are either true or false that specifies flow of data with a link
December 17, 2012 What is IIS Suite IIS = IBM InfoSphere Information ServerIIS from IBM could mean: InfoSphere Information Server Workgroup Edition (IIS WE) = Data Integration or ETL(DataStage, QualityStage, Information Analyzer, and Metadata Workbench) InfoSphere Information Server for Data Warehousing (IIS DW) = Data Warehousefor Smart Analytics System IIS consists of: Data Stage: IBM’s main ETL & data integration tool Quality Stage: IBM’s main data quality tool (needs to be licensed seperately) Fast Track: write the mapping spec to generate Data Stage & Quality Stage jobs & reports Business Glossary: to edit business meaning/data dictionary Blueprint Director: link everything: metadata, ETL, data model Information Analyzer: to understand the content, structure and quality of the data Information Services Director: to deploy data stage/quality jobs as web services or EJB Metadata Workbench: create data lineage between database, files and BI models Metadata Server : stores operational metadata like how many rows were written Links to IBM Documentation: IIS Info Centre, in particular the Learning section where they provide an intro, and the Roadmap section where they explain aboutintegration with Cognos and Cloud, as well as reporting and metadata. IIS Product Documentation, where we can find not only release 8.5 documentation, but also all previous releases. The documentationis very exhaustive, covering DataStage, QualityStage, Blueprint Director, Business Glossary, Metadata Workbench, and Information Analyzer.
December 16, 2012 OLAP – What is OLAP? OLAP stands for On-Line Analytical Processing. OLAP is a technology that is used to organize large business databases and support business intelligence. OLAP databases are divided into one or more cubes, and each cube is organized and designed by a cube administrator to fit the way that you retrieve and analyze data so that it is easier to create. OLAP OLAP database & cubes Online Analytical Processing (OLAP) databases facilitate business-intelligence queries. OLAP is a database technology that has been optimized for querying and reporting, instead of processing transactions. The source data for OLAP is Online Transactional Processing (OLTP) databases that are commonly stored in data warehouses. OLAP data is derived from this historical data, and aggregated into structures that permit sophisticated analysis. OLAP data is also organized hierarchically and stored in cubes instead of tables. It is a sophisticated technology that uses multidimensional structures to provide rapid access to data for analysis. OLAP cubes OLAP – volumes, complexity Online Analytical Process (OLAP) is a characterized by relatively low volume of transactions. Actually the queries are often very complex. In the OLAP System response time more. In OLAP Database there is Aggregated, historical Inf. Data , stored in multi-dimensional schemas. The first attempt to provide a definition to OLAP was by Dr. Codd, who proposed 12 rules for OLAP. Not to be confused with OLTP (Online Transaction processing) For people on the business side, the key feature out of the above list is “Multidimensional.” In other words, the ability to analyze metrics in different dimensions such as time, geography, gender, product, etc. For example, sales for the company are up. What region is most responsible for this increase? Which store in this region is most responsible for the increase? What particular product category or categories contributed the most to the increase? Answering these types of questions in order means that you are performing an OLAP analysis. Depending on the underlying technology used, OLAP can be broadly divided into two different camps: MOLAP and ROLAP.
June 21, 2012 Sort Stage Basics Like in many data related processes, sort plays an important part in etl transformations using Datastage. In fact sorted data is a pre-requisite for a number of stages like remove-duplicate, join, etc.. There are two ways in which you can sort your data in Datastage. One way is to do inline sorts in any stage that you want. For example if you want your data sorted before using it in your transformer you simply have to go to the input -> partitioning tab and check the ‘PERFORM SORT’ option. Even though sort is achievable in this way in some cases people specifically go for the sort stage due to the flexibility, power and variety of options it offers. You can find the sort stage in the processing tab of the palettes section. In order to perform the sort you must provide the keys on which you want the sort performed. If you want to perform the sort sequentially in a parallel job then this must be specified by setting the sort stage to run in sequential mode rather than the parallel mode. This option can be set in the advanced tab of the sort stage as shown below. The choice of running the sort in sequential or parallel mode is entirely up to you based on your business requirement. The choice on where to carry out the sort is entirely dependant on how you have configured your APT_CONFIG_FILE. If you have set a separate pool to carry out the sort operation then sort will only use the scratch disk specified in the sort pool. If such a sort pool is not specified then the scratch disks in the default pool is used for sorting. As a last option it will use the temporary disk spaces specified by the TMPDIR env variable and if thats not there then the /tmp folder is used. Since most of you will be knowing how to configure the sort stage for a basic use I’ve decided to explain certain options in the Sort stage that might be useful when you design your job. Sort Key mode This option will be present under each sort key. This option is important mainly because it allows us to use the sort stage in an efficient manner such that we have an optimally running sort operation. By default the value is set to ‘Sort’. The other two options are ‘Don’t Sort (Previously Sorted)’ and ‘Don’t Sort (Previously Grouped)’. If out of the set of columns we are sorting on if we already know that the data is sorted on a particular column then we can set the sort key mode option to Don’t Sort (Previously Sorted) which will indicate the input records have already been sorted prior to this stage using that particular column. This ensures that we won’t be doing any unnecessary sorts, which thereby increases your job performance. Supposing you have data as below A B C 1 127 AAA 1 124 BBB 1 156 CCC 2 09 DDD 2 540 EEE Now if we specify the sorting keys as A and B and if you are only interested in seeing data in a group(specified by column A in this case) suitably sorted then it makes sense to set the sort key mode for key ‘A’ to Set to Don’t Sort (Previously Grouped) to indicate that input records are already grouped by this particular column, but not sorted. This will make the sort stage only sort on column B. Sort Utility There are two different types of sorts which you can use in Datastage. They are mentioned below – Datastage – This is the default sort used by Datastage and uses the tsort operator. It does not require any other application to work – UNIX – As the name indicates this carries out the sort operation using the Unix sort command. For this the psort operator will be used by Datastage. Cluster Key change column and create key change column There will be requirements in you design in which you will want to know the first record in a group(specified by the sort keys) from the data in your dataset. This can be easily achieved by set the key change option to true. Even though both of the above options do the same thing they are different in the sense that they will have to be used based on the sort key mode specified. If you have used the sort key mode as Don’t sort for the keys during sorting then you will have to use the ‘Cluster Key Change column’ option. If you have used the default sort key mode then you need only use the ‘create key change column’ option. Setting these options will ensure that an extra column is added to your output records and this column will contain a value of 1 if the particular record in that group (where group is indicated by the change in sort key value) is the first record. All following records in that group will have a value of 0. eg. If the data shown above was sorted on the key A and the key change option was set then the output data would look like this A B C ClusterKeyChange 1 127 AAA 1 1 124 BBB 0 1 156 CCC 0 2 09 DDD 1 2 540 EEE 0 Restrict memory usage By default the value is 20MB. This restricts the sort stage to carry outs its operation to a certain amount of virtual memory on the node on which it is being processed. You can increase this to increase your sort rate. However you should remember that the size should never be more than the amount of actual physical memory that is actually available on the processing node.