Skip to content

Node Auto-Provisioning failing for certain GPU nodes (T4) #402

@agam

Description

@agam

How to re-create

A job that is marked as requiring nvidia.com/gpu, if results in a new node being spun up in GKE, will fail to be scheduled on that node.

Why is this bad

  • Using GPU nodes with Node-Auto-Provisioning in GKE is broken (at least for T4s, not sure which other GPU types are affected)
  • It feels strange that such a core "elasticity behavior" is unacknowledged -- hoping this issue gets attention and results in at least an ETA for the fix

Details on error

The provisioned node has a nvidia-device-plugin pod
This pod has a nvidia-driver-installer container which is an init container
This container is stuck on startup

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 100   720  100   720    0     0   113k      0 --:--:-- --:--:-- --:--:--  117k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.

As a result, the kubelet never registers the nvidia.com/gpu resource, which means that the job (which triggered the node in the first place!) can't get its pods scheduled on it.

Prior context:

This is based off the following issue, which is no longer fixed (but which I cannot reopen)

#356

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions