Skip to content

Conversation

@crystalzhaizhai
Copy link
Contributor

This is making xid error mitigation public

cluster.

```
apiVersion: v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these as actual files in a directory with a README.md file explaining how to deploy it? Then the user could just clone the repo and deploy it from their commandline.

Copy link
Member

@grac3gao grac3gao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a short-term mitigation which can't be used in a long run. The device plugin change introduced in this doc will be reverted by the add-on manager during an upgrade. If the GPU nodes are used together with autoscaling, new nodes may not contain this mitigation.

Ideally, we would like to have a more stable and simpler solution for this problem. (e.g. user only need to configure the configmap to customize the XID they like).

It would be better to add more explanation in this doc for this situation, mentioning it is a short-term mitigation.

@thomas-riccardi
Copy link

Ideally, we would like to have a more stable and simpler solution for this problem. (e.g. user only need to configure the configmap to customize the XID they like).

We would like that, yes: just edit/create a configmap.

For now, we can modify the /etc/nvidia/gpu_config.json file on the host, but we have no guarantee it's done before the gpu-device-plugin loads it, and have no easy way to reload it if it's done after. (alternatively, gpu-device-plugin could watch it?)

Also, xid 79 should IMO be there by default: it's not a user error: https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4

Context: we get NVRM: Xid (PCI:0000:00:04): 79, pid=0, GPU has fallen off the bus. on GKE, and the node stay stuck with a unusable GPU, our workload using the GPU at best reaches a CRITICAL/FATAL state and crash-loops, or at worse gets silently stuck.
The end goal is that this would somehow mark the node as broken, so the GKE auto-repair feature kicks-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants