Skip to content

backport rbac permissions for dcgm-exporter pod#2472

Closed
rahulait wants to merge 1 commit into
release-26.3from
fix-dcgm-exporter-rbac
Closed

backport rbac permissions for dcgm-exporter pod#2472
rahulait wants to merge 1 commit into
release-26.3from
fix-dcgm-exporter-rbac

Conversation

@rahulait
Copy link
Copy Markdown
Contributor

@rahulait rahulait commented May 21, 2026

This is required so that newer version of dcgm-exporter can have correct rbac permissions if needed

Description

This PR does a partial backport of #2406

We added support to have rbac installed only when the appropriate flags are set. This PR brings in that change to 26.3 release branch so that newer dcgm-exporter pod can run without permission issues.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Brought up a cluster using gpu-operator controller image built for this PR. It came up fine and shows no errors.

Without fix:

time=2026-05-21T21:16:14.185Z level=INFO msg="Initializing Pod Informer" nodeName=x11-0553
time=2026-05-21T21:16:14.186Z level=INFO msg="HTTP server started - ready to serve metrics"
time=2026-05-21T21:16:14.186Z level=INFO msg="Starting webserver"
time=2026-05-21T21:16:14.186Z level=INFO msg="Watching for changes in file" file=/etc/dcgm-exporter/dcp-metrics-included.csv debounce=200ms
time=2026-05-21T21:16:14.186Z level=INFO msg="Listening on" address=[::]:9400
time=2026-05-21T21:16:14.186Z level=INFO msg="TLS is disabled." http2=false address=[::]:9400
E0521 21:16:14.198212       1 reflector.go:227] "Failed to watch" err="failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:gpu-operator:nvidia-dcgm-exporter\" cannot list resource \"pods\" in API group \"\" at the cluster scope" logger="UnhandledError" reflector="k8s.io/client-go@v0.36.0/tools/cache/reflector.go:343" type="*v1.Pod"
E0521 21:16:15.644340       1 reflector.go:227] "Failed to watch" err="failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:gpu-operator:nvidia-dcgm-exporter\" cannot list resource \"pods\" in API group \"\" at the cluster scope" logger="UnhandledError" reflector="k8s.io/client-go@v0.36.0/tools/cache/reflector.go:343" type="*v1.Pod"

With fix:

time=2026-05-21T21:18:18.356Z level=INFO msg="Initializing system entities of type 'CPU Core'"
time=2026-05-21T21:18:18.356Z level=INFO msg="Not collecting CPU Core metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2026-05-21T21:18:18.437Z level=INFO msg="Registry built successfully" collector_count=2
time=2026-05-21T21:18:18.437Z level=INFO msg="Kubernetes metrics collection enabled!"
time=2026-05-21T21:18:18.437Z level=WARN msg="Failed to get in-cluster config, pod labels will not be available" error="open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"
time=2026-05-21T21:18:18.438Z level=INFO msg="HTTP server started - ready to serve metrics"
time=2026-05-21T21:18:18.438Z level=INFO msg="Starting webserver"
time=2026-05-21T21:18:18.438Z level=INFO msg="Watching for changes in file" file=/etc/dcgm-exporter/dcp-metrics-included.csv debounce=200ms
time=2026-05-21T21:18:18.438Z level=INFO msg="Listening on" address=[::]:9400
time=2026-05-21T21:18:18.438Z level=INFO msg="TLS is disabled." http2=false address=[::]:9400

This is required so that newer version of dcgm-exporter can have correct rbac permissions if needed

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
@rahulait rahulait force-pushed the fix-dcgm-exporter-rbac branch from 714ac61 to 0fa652f Compare May 21, 2026 20:48
@coveralls
Copy link
Copy Markdown

Coverage Status

coverage: 28.201% (+0.2%) from 28.023% — fix-dcgm-exporter-rbac into release-26.3

@tariq1890
Copy link
Copy Markdown
Contributor

Can we explore doing a complete backport instead? I'd prefer keeping branches easily comparable by cherrypicking exact commits

@rahulait
Copy link
Copy Markdown
Contributor Author

We can close this in favor of #2475

@tariq1890
Copy link
Copy Markdown
Contributor

Thanks @rahulait. I'll close this as #2475 has been merged

@tariq1890 tariq1890 closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants