Skip to content

CUDA unknown error when checking torch.cuda.is_available #248

@csaroff

Description

@csaroff

I'm running torchserve in GKE and I've installed the nvidia-driver-installer according to the torchserve gpu installation instructions for GKE.

Unfortunately, after a recent reboot of a kubernetes GPU node, my torchserve models failed to start. At startup, they check if a GPU is available which results in the following error:

>>> torch.cuda.is_available()
/home/venv/lib/python3.8/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0

For context, I'm running pytorch 1.11.0+cu102:

>>> torch.__version__
'1.11.0+cu102'

I forgot to copy-paste this part, but when I checked the cuda version via cat /usr/local/cuda/version.txt it was 10.2. I don't recall the patch version.

I was able to resolve the issue by scaling my gpu node pool down to zero, and then rescaling it back up. Luckily, this issue impacted a node in our staging cluster, but it could just as easily have been a production node so it would be great to understand what went wrong here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions