Skip to content

nvidia-device-plugin failed to run on GPU nodes created by Node Auto-Provisioning  #407

@hongchaodeng

Description

@hongchaodeng

How to reproduce

  • Create a GKE cluster in Standard mode
  • Enable Node Auto-Provisioning with L4 GPU capacity
  • Try to create a pod with nvidia.com/gpu resource request
  • The pod will be stuck in PodInitializing state

Analysis

Using GPU nodes with Node-Auto-Provisioning in GKE will try to run nvidia-gpu-device-plugin job on GPU nodes. It's only after this job finishes successfully that the node will have allocatable nvidia.com/gpu resources. However, this job is stuck, and no pod with GPU requests can be scheduled onto it.

$ kubectl -n kube-system get pod

kube-system       nvidia-gpu-device-plugin-small-cos-4fpl4                         0/2     Init:0/2   0          16m

$ kubectl -n kube-system logs nvidia-gpu-device-plugin-small-cos-hzvds -c nvidia-driver-installer

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   598  100   598    0     0   273k      0 --:--:-- --:--:-- --:--:--  291k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions