-
Notifications
You must be signed in to change notification settings - Fork 176
Open
Description
How to reproduce
- Create a GKE cluster in Standard mode
- Enable Node Auto-Provisioning with L4 GPU capacity
- Try to create a pod with
nvidia.com/gpuresource request - The pod will be stuck in PodInitializing state
Analysis
Using GPU nodes with Node-Auto-Provisioning in GKE will try to run nvidia-gpu-device-plugin job on GPU nodes. It's only after this job finishes successfully that the node will have allocatable nvidia.com/gpu resources. However, this job is stuck, and no pod with GPU requests can be scheduled onto it.
$ kubectl -n kube-system get pod
kube-system nvidia-gpu-device-plugin-small-cos-4fpl4 0/2 Init:0/2 0 16m
$ kubectl -n kube-system logs nvidia-gpu-device-plugin-small-cos-hzvds -c nvidia-driver-installer
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 598 100 598 0 0 273k 0 --:--:-- --:--:-- --:--:-- 291k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.
zifeo, volodymyr-shpylka-aitastic and hongchaodeng
Metadata
Metadata
Assignees
Labels
No labels