Description
The GPUOperatorNodeDeploymentFailed alert fires a false positive when the device plugin is intentionally disabled in the ClusterPolicy (spec.devicePlugin.enabled: false). This is a valid configuration — for example, when GPU allocation is managed externally or via MIG partitioning through a third-party operator.
Steps to Reproduce
- Deploy the GPU Operator with the following ClusterPolicy configuration:
spec:
devicePlugin:
enabled: false
nodeStatusExporter:
enabled: true
mig:
strategy: mixed
migManager:
enabled: true
- Wait 30+ minutes
- Observe the
GPUOperatorNodeDeploymentFailed alert firing
Root Cause
The nvidia-node-status-exporter monitors the metric gpu_operator_node_device_plugin_devices_total. When the device plugin is disabled, no device plugin pods run, so this metric reports 0. The alert rule:
gpu_operator_node_device_plugin_devices_total == 0
fires after 30 minutes, even though the device plugin is intentionally disabled and GPUs are fully functional.
The exporter does not check the ClusterPolicy to determine whether the device plugin is supposed to be running.
Expected Behavior
The GPUOperatorNodeDeploymentFailed alert should not fire when devicePlugin.enabled: false is set in the ClusterPolicy. Possible fixes:
- The node-status-exporter checks the ClusterPolicy and skips device plugin validation when it is disabled
- The alert rule includes a condition that excludes nodes/clusters where the device plugin is intentionally disabled
- The node-status-exporter does not emit the
gpu_operator_node_device_plugin_devices_total metric when the device plugin is disabled
Workaround
Manually silence the GPUOperatorNodeDeploymentFailed alert in the cluster monitoring configuration.
Description
The
GPUOperatorNodeDeploymentFailedalert fires a false positive when the device plugin is intentionally disabled in the ClusterPolicy (spec.devicePlugin.enabled: false). This is a valid configuration — for example, when GPU allocation is managed externally or via MIG partitioning through a third-party operator.Steps to Reproduce
GPUOperatorNodeDeploymentFailedalert firingRoot Cause
The
nvidia-node-status-exportermonitors the metricgpu_operator_node_device_plugin_devices_total. When the device plugin is disabled, no device plugin pods run, so this metric reports 0. The alert rule:fires after 30 minutes, even though the device plugin is intentionally disabled and GPUs are fully functional.
The exporter does not check the ClusterPolicy to determine whether the device plugin is supposed to be running.
Expected Behavior
The
GPUOperatorNodeDeploymentFailedalert should not fire whendevicePlugin.enabled: falseis set in the ClusterPolicy. Possible fixes:gpu_operator_node_device_plugin_devices_totalmetric when the device plugin is disabledWorkaround
Manually silence the
GPUOperatorNodeDeploymentFailedalert in the cluster monitoring configuration.