Currently, the Watchdog seems to compute the "time left" based on the CPU work, which is the product of
the CPUtime that we get from the underlying batch system, which is (in most of the case I guess) accurate,
and the CPU power, which might be not really accurate in some cases.
Then, based on this "time left" value, the watchdog seems to perform a complex logic to know whether a job should be killed or not.
- First it performs a check every
checkingTime until timeLeft < grossTimeLeftLimit - grossTimeLeftLimit being 18,000 see here.
- When this happens,
timeLeft is then computed every pollingTime and the variable littleTimeLeftCount, initialized to 15, is decremented every pollingTime (it can be negative apparently) see here.
- When
timeLeft < fineTimeLimitLeft - fineTimeLimitLeft being 150 * pollingTime by default - and littleTimeLeftCount == 0 (keeping in mind that it can also be negative), then the job is killed.
I would like to simplify this logic such as:
- We add a
TimeLeft.getCPUTimeLeft() method to get the CPU time left in seconds, and TimeLeft.getTimeLeft() in this case becomes getCPUWorkLeft().
In the watchdog we use this new method to get the time left in seconds: I guess it would be more accurate.
- Once
timeLeft < 4000s or maybe checkingTime * 1.5 then we do regular check every pollingTime.
- Once
timeLeft < 600s (10 minutes) then we kill the job
There is probably many historical reasons that I do not understand or use cases that I do not know that would explain this complex logic.
Let me know if you have further details or comments about what I propose.
Currently, the Watchdog seems to compute the "time left" based on the CPU work, which is the product of
the CPUtime that we get from the underlying batch system, which is (in most of the case I guess) accurate,
and the CPU power, which might be not really accurate in some cases.
Then, based on this "time left" value, the watchdog seems to perform a complex logic to know whether a job should be killed or not.
checkingTimeuntiltimeLeft < grossTimeLeftLimit-grossTimeLeftLimitbeing 18,000 see here.timeLeftis then computed everypollingTimeand the variablelittleTimeLeftCount, initialized to 15, is decremented everypollingTime(it can be negative apparently) see here.timeLeft < fineTimeLimitLeft-fineTimeLimitLeftbeing150 * pollingTimeby default - andlittleTimeLeftCount == 0(keeping in mind that it can also be negative), then the job is killed.I would like to simplify this logic such as:
TimeLeft.getCPUTimeLeft()method to get the CPU time left in seconds, andTimeLeft.getTimeLeft()in this case becomesgetCPUWorkLeft().In the watchdog we use this new method to get the time left in seconds: I guess it would be more accurate.
timeLeft < 4000s or maybe checkingTime * 1.5then we do regular check everypollingTime.timeLeft < 600s(10 minutes) then we kill the jobThere is probably many historical reasons that I do not understand or use cases that I do not know that would explain this complex logic.
Let me know if you have further details or comments about what I propose.