-
Notifications
You must be signed in to change notification settings - Fork 176
Open
Description
How to re-create
A job that is marked as requiring nvidia.com/gpu, if results in a new node being spun up in GKE, will fail to be scheduled on that node.
Why is this bad
- Using GPU nodes with Node-Auto-Provisioning in GKE is broken (at least for T4s, not sure which other GPU types are affected)
- It feels strange that such a core "elasticity behavior" is unacknowledged -- hoping this issue gets attention and results in at least an ETA for the fix
Details on error
The provisioned node has a nvidia-device-plugin pod
This pod has a nvidia-driver-installer container which is an init container
This container is stuck on startup
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 720 100 720 0 0 113k 0 --:--:-- --:--:-- --:--:-- 117k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.
As a result, the kubelet never registers the nvidia.com/gpu resource, which means that the job (which triggered the node in the first place!) can't get its pods scheduled on it.
Prior context:
This is based off the following issue, which is no longer fixed (but which I cannot reopen)
zifeo, AssenDimitrov and MeCode4Food
Metadata
Metadata
Assignees
Labels
No labels