June 21, 2012 Sort Stage Basics Like in many data related processes, sort plays an important part in etl transformations using Datastage. In fact sorted data is a pre-requisite for a number of stages like remove-duplicate, join, etc.. There are two ways in which you can sort your data in Datastage. One way is to do inline sorts in any stage that you want. For example if you want your data sorted before using it in your transformer you simply have to go to the input -> partitioning tab and check the ‘PERFORM SORT’ option. Even though sort is achievable in this way in some cases people specifically go for the sort stage due to the flexibility, power and variety of options it offers. You can find the sort stage in the processing tab of the palettes section. In order to perform the sort you must provide the keys on which you want the sort performed. If you want to perform the sort sequentially in a parallel job then this must be specified by setting the sort stage to run in sequential mode rather than the parallel mode. This option can be set in the advanced tab of the sort stage as shown below. The choice of running the sort in sequential or parallel mode is entirely up to you based on your business requirement. The choice on where to carry out the sort is entirely dependant on how you have configured your APT_CONFIG_FILE. If you have set a separate pool to carry out the sort operation then sort will only use the scratch disk specified in the sort pool. If such a sort pool is not specified then the scratch disks in the default pool is used for sorting. As a last option it will use the temporary disk spaces specified by the TMPDIR env variable and if thats not there then the /tmp folder is used. Since most of you will be knowing how to configure the sort stage for a basic use I’ve decided to explain certain options in the Sort stage that might be useful when you design your job. Sort Key mode This option will be present under each sort key. This option is important mainly because it allows us to use the sort stage in an efficient manner such that we have an optimally running sort operation. By default the value is set to ‘Sort’. The other two options are ‘Don’t Sort (Previously Sorted)’ and ‘Don’t Sort (Previously Grouped)’. If out of the set of columns we are sorting on if we already know that the data is sorted on a particular column then we can set the sort key mode option to Don’t Sort (Previously Sorted) which will indicate the input records have already been sorted prior to this stage using that particular column. This ensures that we won’t be doing any unnecessary sorts, which thereby increases your job performance. Supposing you have data as below A B C 1 127 AAA 1 124 BBB 1 156 CCC 2 09 DDD 2 540 EEE Now if we specify the sorting keys as A and B and if you are only interested in seeing data in a group(specified by column A in this case) suitably sorted then it makes sense to set the sort key mode for key ‘A’ to Set to Don’t Sort (Previously Grouped) to indicate that input records are already grouped by this particular column, but not sorted. This will make the sort stage only sort on column B. Sort Utility There are two different types of sorts which you can use in Datastage. They are mentioned below – Datastage – This is the default sort used by Datastage and uses the tsort operator. It does not require any other application to work – UNIX – As the name indicates this carries out the sort operation using the Unix sort command. For this the psort operator will be used by Datastage. Cluster Key change column and create key change column There will be requirements in you design in which you will want to know the first record in a group(specified by the sort keys) from the data in your dataset. This can be easily achieved by set the key change option to true. Even though both of the above options do the same thing they are different in the sense that they will have to be used based on the sort key mode specified. If you have used the sort key mode as Don’t sort for the keys during sorting then you will have to use the ‘Cluster Key Change column’ option. If you have used the default sort key mode then you need only use the ‘create key change column’ option. Setting these options will ensure that an extra column is added to your output records and this column will contain a value of 1 if the particular record in that group (where group is indicated by the change in sort key value) is the first record. All following records in that group will have a value of 0. eg. If the data shown above was sorted on the key A and the key change option was set then the output data would look like this A B C ClusterKeyChange 1 127 AAA 1 1 124 BBB 0 1 156 CCC 0 2 09 DDD 1 2 540 EEE 0 Restrict memory usage By default the value is 20MB. This restricts the sort stage to carry outs its operation to a certain amount of virtual memory on the node on which it is being processed. You can increase this to increase your sort rate. However you should remember that the size should never be more than the amount of actual physical memory that is actually available on the processing node.