GPUOperatorNodeDeploymentFailed alert false positive when devicePlugin is disabled in ClusterPolicy

## Description

The `GPUOperatorNodeDeploymentFailed` alert fires a false positive when the device plugin is intentionally disabled in the ClusterPolicy (`spec.devicePlugin.enabled: false`). This is a valid configuration — for example, when GPU allocation is managed externally or via MIG partitioning through a third-party operator.

## Steps to Reproduce

1. Deploy the GPU Operator with the following ClusterPolicy configuration:
   ```yaml
   spec:
     devicePlugin:
       enabled: false
     nodeStatusExporter:
       enabled: true
     mig:
       strategy: mixed
     migManager:
       enabled: true
   ```
2. Wait 30+ minutes
3. Observe the `GPUOperatorNodeDeploymentFailed` alert firing

## Root Cause

The `nvidia-node-status-exporter` monitors the metric `gpu_operator_node_device_plugin_devices_total`. When the device plugin is disabled, no device plugin pods run, so this metric reports 0. The alert rule:

```
gpu_operator_node_device_plugin_devices_total == 0
```

fires after 30 minutes, even though the device plugin is intentionally disabled and GPUs are fully functional.

The exporter does not check the ClusterPolicy to determine whether the device plugin is supposed to be running.

## Expected Behavior

The `GPUOperatorNodeDeploymentFailed` alert should not fire when `devicePlugin.enabled: false` is set in the ClusterPolicy. Possible fixes:

1. The node-status-exporter checks the ClusterPolicy and skips device plugin validation when it is disabled
2. The alert rule includes a condition that excludes nodes/clusters where the device plugin is intentionally disabled
3. The node-status-exporter does not emit the `gpu_operator_node_device_plugin_devices_total` metric when the device plugin is disabled

## Workaround

Manually silence the `GPUOperatorNodeDeploymentFailed` alert in the cluster monitoring configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPUOperatorNodeDeploymentFailed alert false positive when devicePlugin is disabled in ClusterPolicy #2237

Description

Steps to Reproduce

Root Cause

Expected Behavior

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPUOperatorNodeDeploymentFailed alert false positive when devicePlugin is disabled in ClusterPolicy #2237

Description

Description

Steps to Reproduce

Root Cause

Expected Behavior

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions