CUDA unknown error when checking torch.cuda.is_available

I'm running torchserve in GKE and I've installed the [nvidia-driver-installer](https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml) according to the [torchserve gpu installation instructions](https://github.com/pytorch/serve/tree/master/kubernetes/GKE#23-install-nvidia-device-plugin) for GKE.

Unfortunately, after a recent reboot of a kubernetes GPU node, my torchserve models failed to start. At startup, they check if a GPU is available which results in the following error:

```python
>>> torch.cuda.is_available()
/home/venv/lib/python3.8/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
```

For context, I'm running pytorch 1.11.0+cu102:
```python
>>> torch.__version__
'1.11.0+cu102'
```

I forgot to copy-paste this part, but when I checked the cuda version via `cat /usr/local/cuda/version.txt` it was 10.2. I don't recall the patch version.

I was able to resolve the issue by scaling my gpu node pool down to zero, and then rescaling it back up. Luckily, this issue impacted a node in our staging cluster, but it could just as easily have been a production node so it would be great to understand what went wrong here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA unknown error when checking torch.cuda.is_available #248

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA unknown error when checking torch.cuda.is_available #248

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions