-
Notifications
You must be signed in to change notification settings - Fork 20
Fatal error: Task disappeared from cluster's queue
If you always get Task disappeared from cluster's queue fatal error, you need to check that pidRegex, pidRegexCheckTaskRunning and pidColumnCheckTaskRunning are properly configured in your bds.config
Sometimes clusters fail in ways that the cluster management system is unable to detect, let alone report the error.
It can happen that tasks disappear without any trace from the cluster (this is not as rare as you may think, particularly when executing thousands of tasks per pipeline).
For this reason, bds performs active monitoring, to ensure that tasks are still alive. If any task "mysteriously disappears", bds reports the problem and considers the task as failed.
Incorrect bds.config: Sometimes, the mechanism that bds uses to check tasks is not properly configured in your bds.config.
In this case, bds is unable to find the tasks even though they are running on the cluster which leads to the fatal error Task disappeared from cluster's queue
taskID: This is the unique task that bds uses internally to identify tasks.
PID: This is the processID / jobID that the cluster management system uses to identify the processes/jobs running in the cluster.
Here is how bds executes tasks on clusters:
- When you run a 'task' bds creates a shell script file.
- then bds runs the proper command to schedule the job (e.g. qsub).
- bds parses the output of the previous command (e.g. qsub) in order to find the jobs' PID. Parsing is done by trying to find any match of 'pidRegex' (only the first match on the first line is used). If
pidRegexis not configured,bdsuses the whole line as PID. This is a reasonable default because many cluster systems only output the PID when tunning the scheduling command. - This PID is stored and used to check that tasks are running.
Bds checks every few minutes if all tasks are still running in the cluster. If any task mysteriously disappears from cluster, a fatal error is thrown.
- In order to check if tasks are alive, bds runs a 'stat' command
- then
bdsand parses the output of the 'stat' command. It splits the command's output into lines and parses each line using the regular expression defined in 'pidRegexCheckTaskRunning'. If nothing matches, it tries to split the line into columns and use the column defined in 'pidColumnCheckTaskRunning' as PID. - PIDs found executing the tasks are matched against all the PIDs found in this 'stat' command.
- If any PID is not found, then a "missing counter" is incremented.
- When the "missing counter" reaches a value (default is 3), the task is declared as missing and a fatal error is issued.
Here is an example of debugging a "Task disappeared" error:
First, you need to run bds in debug mode and redirect the output to a log file:
bds -d myscript.bds 2>&1 | tee myscript.log
Then you analyze the log file. In this example, we see the follwoing events:
- A task is run and bds creates a shell script file
Task: Saving file '/home/user/myscript.bds.20180503_145643_806/task.myscript.myfunction.line_59.id_3.sh'
- bds runs the proper command to schedule the job. Note that in this case "generic cluster" is configured:
ExecutionerClusterGeneric 'ClusterGeneric[19]': Running task myscript.bds.20180503_145643_806/task.myscript.myfunction.line_59.id_3
ExecutionerClusterGeneric 'ClusterGeneric[19]': Custom script command line arguments:
0 /home/user/BigDataScript/config/clusterGeneric_LSF/run_4.sh
1 bbc
2 86400
3 5
4 4194304
5 queuename
....
CmdLocal: Executing /home/user/BigDataScript/config/clusterGeneric_LSF/run_4.sh bbc 86400 5 4194304...
- then bds parses the output of the command in order to find the PID. In this case
pidRegexwas not configured, so bds used the whole line:
CmdCluster 'myscript.bds.20180503_145643_806/task.myscript.myfunction.line_59.id_3': Reading PID line 'Job <18943025> is submitted'
ExecutionerClusterGeneric 'ClusterGeneric[19]': No PID regex configured in (missing pidRegex entry in config file?). Using whole line
-
Found Problem: bds is telling you that since you didn't configure
pidRegex, the PID will beJob <18943025> is submittedwhich is incorrect, since you only want18943025as PID. -
Once the task is running on the cluster, bds checks if the task is still there every few minutes. So, bds runs a 'stat' command and parses the output:
Executing command. Arguments: [/home/user/BigDataScript/config/clusterGeneric_LSF/stat.pl]
ExecutionerClusterGeneric 'ClusterGeneric[19]': CheckTasksRunningCluster:Parsing line: 18943025 usr123 RUN queuename login3 5*n128 *.id_3.sh'
ExecutionerClusterGeneric 'ClusterGeneric[19]': CheckTasksRunningCluster: Adding ID (column number 0): '18943025'
-
Found Problem: So in your case, the PID found by your 'stat' command was
18943025, but the PID parsed when the task was scheduled, was different (the whole line:Job <18943025> is submitted). So the "missing counter" for the task is incremented:
WARNING: Task PID 'Job <18943025> is submitted to queue <queuename>.' not found for task 'myscript.bds.20180503_145643_806/task.myscript.myfunction.line_59.id_3'. Incrementing 'missing counter': 1 (max. allowed 3)
- Checking tasks is done every couple of minutes. When the "missing counter" reaches the maximum, the task is declared missing and bds issues a fatal error:
WARNING: Task PID 'Job <18943025> is submitted to queue <queuename>.' not found for task 'myscript.bds.20180503_145643_806/task.myscript.myfunction.line_59.id_3'. Incrementing 'missing counter': 4 (max. allowed 3)
Fatal error: myscript.bds, line 62, pos 9. Task/s failed.
Solution: Make sure that pidRegex, pidRegexCheckTaskRunning and pidColumnCheckTaskRunning are properly configured in your bds.config.