Skip to content

HQ Workers disconnect on hq job forget #1064

@kyleabbott

Description

@kyleabbott

In the case that hq server has a very large memory footprint (e.g., in excess of 100GB), if you run hq job forget all to release memory for old finished jobs, it succeeds in forgetting the jobs and freeing much of the memory, but often times I have noticed some (or many) of the workers lose their connection to the server during the forget operation. If the workers are set to "finish-running" they will remain active on the slurm node but are dropped from the server's worker list. It is hard to replicate this without submitting many jobs with hundreds of thousands of tasks each via JDF, but it has happened to me multiple times. What logs would need to be provided to help diagnose the issue the next time this happens?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions