-
Notifications
You must be signed in to change notification settings - Fork 43
Open
Description
In the case that hq server has a very large memory footprint (e.g., in excess of 100GB), if you run hq job forget all to release memory for old finished jobs, it succeeds in forgetting the jobs and freeing much of the memory, but often times I have noticed some (or many) of the workers lose their connection to the server during the forget operation. If the workers are set to "finish-running" they will remain active on the slurm node but are dropped from the server's worker list. It is hard to replicate this without submitting many jobs with hundreds of thousands of tasks each via JDF, but it has happened to me multiple times. What logs would need to be provided to help diagnose the issue the next time this happens?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels