HQ Workers disconnect on hq job forget

In the case that hq server has a very large memory footprint (e.g., in excess of 100GB), if you run `hq job forget all` to release memory for old finished jobs, it succeeds in forgetting the jobs and freeing much of the memory, but often times I have noticed some (or many) of the workers lose their connection to the server during the forget operation. If the workers are set to "finish-running" they will remain active on the slurm node but are dropped from the server's worker list. It is hard to replicate this without submitting many jobs with hundreds of thousands of tasks each via JDF, but it has happened to me multiple times. What logs would need to be provided to help diagnose the issue the next time this happens?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HQ Workers disconnect on hq job forget #1064

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HQ Workers disconnect on hq job forget #1064

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions