-
Notifications
You must be signed in to change notification settings - Fork 176
Description
I'm running torchserve in GKE and I've installed the nvidia-driver-installer according to the torchserve gpu installation instructions for GKE.
Unfortunately, after a recent reboot of a kubernetes GPU node, my torchserve models failed to start. At startup, they check if a GPU is available which results in the following error:
>>> torch.cuda.is_available()
/home/venv/lib/python3.8/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0For context, I'm running pytorch 1.11.0+cu102:
>>> torch.__version__
'1.11.0+cu102'I forgot to copy-paste this part, but when I checked the cuda version via cat /usr/local/cuda/version.txt it was 10.2. I don't recall the patch version.
I was able to resolve the issue by scaling my gpu node pool down to zero, and then rescaling it back up. Luckily, this issue impacted a node in our staging cluster, but it could just as easily have been a production node so it would be great to understand what went wrong here.