Failed job – **** Parallel startup failed ****


A parallel DataStage job with configuration file setup to run multiple nodes on a single server fails with error:
Message: main_program: **** Parallel startup failed ****

Resolving the problem

The full text for this “parallel startup failed” error provides some additional information about possible causes:
    This is usually due to a configuration error, such as not having the Orchestrate install directory properly mounted on all nodes, rsh permissions not correctly set (via /etc/hosts.equiv or .rhosts), or running from a directory that is not mounted on all nodes. Look for error messages in the preceding output.

For the situation where a site is attempting to run multiple nodes on multiple server machines, the above statement is correct. More information on setting up ssh/rsh and parallel processing can be found in the following topics: 
Configuring remote and secure shells
Configuring a parallel processing environment

However, in the case where all nodes are running on a single server machine, the “Parallel startup failed” message is usually an indication that the fastname defined in the configuration file does not match the name output by the server’s “hostname” command. 

In a typical node configuration file, the server name where each node runs is indicated by the fastname, i.e., /opt/IBM/InformationServer/Server/Configurations/default.apt:

  node "node1"
     fastname "server1"
     pools ""
     resource disk "/opt/resource/node1/Datasets" {pools ""}

      resource scratchdisk "/opt/resource/node1/Scratch" {pools ""}
  node "node2"
     fastname "server1"

      pools ""
     resource disk "/opt/resource/node2/Datasets" {pools ""}
     resource scratchdisk "/opt/resource/node2/Scratch" {pools ""}

Login to the DataStage server machine and at the operating system command prompt, enter command:

If the hostname output EXACTLY matches the fastname defined for local nodes, then the job will run correctly on that server. However, if the “hostname” command outputs the hostname in a different format (such as with domain name appended) then the names defined for fastname will be considered remote nodes and a failed attempt will be made to access the node via rsh/ssh.

Using the above example, if the hostname output was then prior to the “Parallel startup failed” error in job log you will likely see the following error:

    Message: main_program: Accept timed out retries = 4
    server1: Connection refused

The above problem will occur even if your /etc/hosts file maps server1 and to the same address since it is not the inability to resolve either address that causes this issue, but rather that the fastname in node configuration file does not exactly match the system hostname (or value of APT_PM_CONDUCTOR_NODE if defined).

You have several options to deal with this situation:

  • change fastname for nodes in configuration file to exactly match the output of hostname command.
  • set APT_PM_CONDUCTOR_NODE to the same value as fastname. This would need to be defined either in every project or every job.
  • You should NOT change the hostname of server to match fastname. Information Server / DataStage stores some information based on the current hostname. If you change the hostname after installation of Information Server / DataStage, then you will need to contact support team for additional instructions to allow DataStage to work correctly with the new hostname.

Leave a Reply

Your email address will not be published.