`nvidia-device-plugin` failed to run on GPU nodes created by Node Auto-Provisioning 

## How to reproduce

* Create a GKE cluster in Standard mode
* Enable Node Auto-Provisioning with L4 GPU capacity
* Try to create a pod with `nvidia.com/gpu` resource request
* The pod will be stuck in PodInitializing state

## Analysis

Using GPU nodes with Node-Auto-Provisioning in GKE will try to run `nvidia-gpu-device-plugin` job on GPU nodes. It's only after this job finishes successfully that the node will have allocatable `nvidia.com/gpu` resources. However, this job is stuck, and no pod with GPU requests can be scheduled onto it.

```
$ kubectl -n kube-system get pod

kube-system       nvidia-gpu-device-plugin-small-cos-4fpl4                         0/2     Init:0/2   0          16m

$ kubectl -n kube-system logs nvidia-gpu-device-plugin-small-cos-hzvds -c nvidia-driver-installer

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   598  100   598    0     0   273k      0 --:--:-- --:--:-- --:--:--  291k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`nvidia-device-plugin` failed to run on GPU nodes created by Node Auto-Provisioning #407

How to reproduce

Analysis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nvidia-device-plugin failed to run on GPU nodes created by Node Auto-Provisioning #407

Description

How to reproduce

Analysis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`nvidia-device-plugin` failed to run on GPU nodes created by Node Auto-Provisioning #407