January 26, 2013 \ Ananth TM Datastage 7.5 certification Official IBM website (http://www-03.ibm.com/certify/tests/obj415.shtml) below are areas to prepare in order to pass the test:Test 415 – IBM WebSphere IIS DataStage Enterprise Edition v7.5 Section 1 – Installation and Configuration (5%) Describe how to properly install and configure DataStage EE Describe users and groups Describe the environment (e.g., dsenv, ODBC) Setup database connectivity Describe OS configuration/kernel Describe USS configuration Identify required components for server C++ compiler Identify the tasks required to create and configure a project to be used for EE jobs. Project location Assign DataStage EE roles Environment defaults Given a configuration file, identify its components and its overall intended purpose. Project location Assign DataStage EE roles Environment defaults List the steps necessary to start/stop DataStage EE properly. netstat -a|grep ds Section 2 – Metadata (5%) Demonstrate knowledge of Orchestrate schema. Distinguish internal data type (Orchestrate schema) vs external data type Describe how to set extended properties for table definition Import metadata using plug-ins vs orchdbutil Explain nullable mapping rules (e.g., source vs target) NLS data types Identify the method of importing metadata. Flat sources (e.g., sequential file, Orchestrate schema, ODBC, etc.) COBOL CopyBook XML Given a scenario, demonstrate knowledge of runtime column propagation. Usage Impact on stage mapping and target databases Section 3 – Persistent Storage (10%) Given a scenario, explain the process of importing/exporting data to/from framework (e.g., sequential file, external source/target). Explain use of various file stages (e.g., file, CFF, fileset, dataset) and where appropriate to use If USS, define the native file format (e.g., EBCDIC, VSDM) Given a scenario, describe proper use of a sequential file. Read in parallel (e.g., reader per node, multiple files) Handle various formats (e.g., fix vs variable, delimeted vs nondelimited, etc.) Describe how to import and export nullable data Explain how to identify and capture rejected records (e.g., log counts, using reject link, options for rejection)Given a scenario, describe proper usage of CFF (native not plug-in). Explain how to import data from a file that has multiple varying record types (e.g., COBOL CopyBook, EBCDIC to ASCII) Describe proper usage of FileSets and DataSets. Explain differences and similarities of FileSet and DataSet (e.g., header data file segments, internal DS vs external format (FS)) Determine which tools can be used to manage FileSets and DataSets (GUI and CLI) Describe use of FTP stage for remote data (e.g., how to parallel, plug-in vs enterprise). Restructure stages (e.g., column import/export) Identify importing/exporting of XML data. XML stage options and usage XPATH and XLS Section 4 – Parallel Architecture (10%) Given a scenario, demonstrate proper use of data partitioning and collecting. Partitioning goals (e.g., data distribution, meeting data grouping requirements) Explain the differences and usage of round robin, entire, same, and hash partitioning Explain the differences between auto, ordered, and sort collector Identify partitioning type, parallel/sequential by analyzing a DataStage EE screen shot Explain the differences between partitioning keys and sorting (stage) keys through a scenario where these do not match Demonstrate proper use of partitioning methods based on given business requirements and DataStage EE technical requirements Explain the differences between a funnel state and a collector Describe input of partitioning and re-partitioning in an MPP/cluster environment Given a scenario, demonstrate knowledge of parallel execution. Given a job design and configuration file, provide estimates of the number of processes generated at runtime Explain the purpose and use of resource and node pools Given a source DataSet, describe the degree of parallelism using auto and same partitioning Section 5 – Databases (15%) Given a scenario, demonstrate proper selection of database stages and database specific stage properties. DB2 Based on DB2 connectivity requirements, select appropriate stage (e.g., DB2 API, DB2 Enterprise) List environment variables needed to successfully run a DB2 job Teradata Based on functional requirements, select appropriate Teradata stage (e.g., TD Enterprise, TD MultiLoad (target only), TD API) Requested sessions and total sessions DRS ODBC Informix Sybase Oracle Remote Server Partition table Given a scenario using Oracle, load target and proper use of index mode and $API_ORA_LOAD_OPTIONS Identify source database options. Identify which stages/options read in parallel Explain the degree of parallelism for Oracle Enterprise and DB2 Enterprise Identify the use of “sparse” vs “normal” lookup When to use (e.g., 1:100) How to construct sparse lookup and SQL statements Given a scenario, demonstrate knowledge of target database options. Articulate benefits and limitations of using parallel load method (e.g., constraints, speed, exclusive locks, etc.) Explain the differences between upsert and load methods on target database stages Describe how to control restart of DB target (separate the “L” from ETL) Use OPEN, CLOSE, create temp, Load, SELECT FROM… INSERT INFO, DROP Identify the impact of RCP when target has fewer columns than job flow Separate update and insert records before target upsert Given a scenario, describe how to design EE ETL jobs that will extract data from DB2 (or any DBMS), combine with data from another source and load to another DBMS target. Demonstrate knowledge of working with NLS database sources and targets. Section 6 – Data Transformation (15%) Given a scenario, demonstrate knowledge of default type conversions, output mappings, and associated warnings. Demonstrate appropriate use of modify for NULL handling, string trim, non-default conversions Given a scenario, demonstrate proper selections of Transformer stage vs other stages. Copy stage/output mapping No BASIC Transformer Filter vs Transformer constraint (e.g., filter vs interpreted) Given a scenario, describe Transformer stage capabilities (including stage variables, link variables, DataStage macros, constraints, system variables, link ordering, @PART NUM, functions). Explain NULL handling within parallel transformer (e.g., reject rule, NULL functions) Demonstrate the use of Transformer stage variables (e.g., to identify key grouping boundaries on incoming data Use a Transformer only to generate a sequence of numbers in parallel Identify processes to add functionality not provided by existing DataStage stages (e.g., wrapper, buildops, user def functions/routines). Given a scenario, demonstrate no closed loop scenario when updating a reference table. Section 7 – Combining and Sorting Data (10%) Demonstrate knowledge of Join, Lookup, and Merge stages. Explain the differences between Lookup, Join, and Merge stages and demonstrate when to use eachExplain link (e.g., input/output/reject) requirements Explain sorting requirements Explain full outer join usage Demonstrate understanding of link ordering with JOIN and MERGE stages Lookup – continue (also outer JOIN) with non-nullable input columns Explain what happens with duplicate input column names on JOIN and MERGE Given a scenario, demonstrate knowledge of SORT stage. Describe the differences between parallel SORT, SORT, and sequential SORT Describe the differences between unique option in osrt, RemDup stage Demonstrate understanding of SORT key column properties (e.g., don’t sort, previously sorted)Demonstrate understanding of framework inserted sorts Examine score Set $APT_SORT_INSERTION_CHECK_ONLY Distinguish between SQL ORDER BY vs parallel sort Explain NLS collation sequences Given a scenario, demonstrate understanding of Aggregator stage. Memory impact of sort vs hash Demonstrate use of Aggregator properties Count vs calculation Explicitly create output columns and map Take output of Aggregator and join with detail records Describe the proper usage of change capture/change apply. Separate delta rows (e.g., inserts vs updates) Section 8 – Automation and Production Deployment (10%) Given a scenario, articulate the change control process. “Dev to QA to prod” Explain the differences between compile and force compile Use of protected projects Import/Export Including compiled Transformer Managing custom components (e.g., buildops, wrappers) USS deployment topics (e.g., remote deployment, how to deploy) Identify the use of the dsjob command line utility. Explain how a DataStage job may be run using a third party scheduler (e.g., autosys, control U, CRON, etc.) Explain how performance run statistics from a job run can be captured Capture log records to a text file (XML format) Explain how to use parameters to simplify the creation and maintenance of job designs. Explain the processes necessary to run multiple copies of the source (job multi-instance) Given a scenario, demonstrate the ability to use job sequencers (e.g., exception hunting, re-startable, dependencies, passing return value from routing, parameter passing and job status) Create different designs using Job Sequencer Section 9 – Monitoring and Troubleshooting (10%) Given a scenario, demonstrate knowledge of parallel job score. Show how it is captured and interpret its results Identify and interpret its contents Number of processes Number of nodes Number of DataSets Combined operators Parallel/Sequential execution Inserted buffer and sort operators Given a scenario, identify and define environment variables that control EE with regard to added functionality and reporting. Articulate ability to identify runtime metadata vs design metadata APT_PM_SHOW_PIDS, etc. Given a process list, identify conductor, section leader, and player processes. Given a scenario, identify areas that may improve performance (e.g., buffer size, repartitioning, config files, operator combination, etc.). Section 10 – Job Design (10%) Demonstrate knowledge of shared containers. Reuse using RCP Using usage analysis and multi-job compile to recompile jobs that use a modified shared container Given a scenario, describe how to minimize SORTS and repartitions.Demonstrate knowledge of creating restart points across jobs using intermediate DataSets.Given a scenario, demonstrate proper use of standards. Naming conventions for stages and links Differences between annotation and description annotation