Skip to content

Add weekly forward compatibility testing#1884

Open
ArangoGutierrez wants to merge 11 commits intoNVIDIA:mainfrom
ArangoGutierrez:integration-tests
Open

Add weekly forward compatibility testing#1884
ArangoGutierrez wants to merge 11 commits intoNVIDIA:mainfrom
ArangoGutierrez:integration-tests

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator

@ArangoGutierrez ArangoGutierrez commented Nov 10, 2025

Implement automated forward compatibility tests that validate GPU Operator
against the latest published images from NVIDIA component repositories.

Changes

Workflow Refactoring:

  • Refactor monolithic ci.yaml into reusable workflow modules (variables.yaml, golang-checks.yaml, config-checks.yaml, image-builds.yaml, e2e-tests.yaml, release.yaml)

Forward Compatibility Testing:

  • Add forward-compatibility.yaml workflow (weekly + manual trigger)
  • Create get-latest-images.sh to fetch latest commit-based images with retry/backoff
  • Create generate-values-overrides.sh to produce Helm values override file
  • Use artifact-based values override approach (per @cdesiniotis review)

E2E Test Enhancements:

  • Extend e2e-tests.yaml with optional use_values_override input
  • Create env-to-values.sh to convert env vars to Helm values file
  • Update install-operator.sh with proper YAML merging via yq
  • Handle vGPU options correctly with values file approach
  • Pass VALUES_FILE to remote test instances (per @rajathagasthya review)

Notifications:

  • Add Slack notifications for test failures (JSON payload, configurable mentions)
  • Guard notification step against missing secrets

Components Tested

  • ghcr.io/nvidia/container-toolkit
  • ghcr.io/nvidia/k8s-device-plugin
  • ghcr.io/nvidia/k8s-mig-manager

We could add other operands later, but I wanted to start with the core ones.

Schedule: Every Monday at 2 AM UTC

Review Feedback Addressed

  • Use values override file instead of individual input vars (@cdesiniotis)
  • Move operator_image/operator_version to variables workflow (@cdesiniotis)
  • Remove unnecessary workflow inputs from forward-compatibility (@cdesiniotis)
  • Remove year from copyright headers (@rajathagasthya)
  • Pass VALUES_FILE to remote test instance (@rajathagasthya)
  • Fix Slack JSON payload format
  • Use yq for proper YAML merging
  • Handle vGPU with values file approach
  • Add retry with exponential backoff for image verification
  • Remove push trigger from e2e-tests (secrets not inherited)
  • Restore golangci-lint-action to v9
  • Write SSH key before SCP step
  • Restore Holodeck to v0.2.18
  • Remove unused operator_version input from release workflow
  • Guard Slack notification against missing secrets

@github-advanced-security
Copy link
Copy Markdown

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

@ArangoGutierrez ArangoGutierrez force-pushed the integration-tests branch 2 times, most recently from d55dd82 to 9dab00e Compare November 10, 2025 17:54
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements automated weekly forward compatibility testing for the GPU Operator against the latest published component images from NVIDIA repositories. The changes refactor the existing monolithic CI workflow into reusable workflow modules and add forward compatibility testing infrastructure.

Key changes:

  • Refactored CI workflows into separate, reusable workflow files (variables, golang-checks, config-checks, image-builds, e2e-tests, release)
  • Added forward-compatibility.yaml workflow with scheduled weekly runs and manual trigger support
  • Created get-latest-images.sh script to fetch latest commit-based images from component repositories
  • Extended install-operator.sh to support device-plugin and mig-manager image overrides
  • Added Slack notifications for test failures

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
.github/workflows/forward-compatibility.yaml New workflow for weekly forward compatibility testing with Slack notifications
.github/workflows/ci.yaml Refactored to use reusable workflows instead of monolithic job definitions
.github/workflows/variables.yaml New reusable workflow for shared CI variables
.github/workflows/e2e-tests.yaml New reusable e2e test workflow with optional component image inputs
.github/workflows/image-builds.yaml New reusable workflow for GPU operator image builds
.github/workflows/release.yaml New reusable workflow for release processes
.github/workflows/golang-checks.yaml New reusable workflow for Go checks and tests
.github/workflows/config-checks.yaml New reusable workflow for configuration validation
.github/workflows/code-scanning.yaml New reusable workflow for CodeQL scanning
.github/scripts/get-latest-images.sh Script to fetch latest component images from NVIDIA repositories
tests/scripts/install-operator.sh Extended to support device-plugin and mig-manager image overrides

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ArangoGutierrez ArangoGutierrez force-pushed the integration-tests branch 3 times, most recently from 84a0c31 to 54d6fc9 Compare November 11, 2025 13:27
@ArangoGutierrez
Copy link
Copy Markdown
Collaborator Author

PING @rahulait @rajathagasthya @tariq1890

env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
# Use workflow_dispatch inputs if provided, otherwise fetch latest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: when would someone set these variables as inputs? If the scope of this workflow is to always test the top-of-tree images for all operands, then maybe we can remove the input variables and move the entire run: block into a shell script? Thoughts?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is to allow manual triggering of the workflow in case we want to test a very specific image without having to wait for the "weekly" run. But I'll set it in a way that even during manual triggers, we will simply fetch all the latests images from operands.

Comment on lines +101 to +103
toolkit_image: ${{ needs.fetch-latest-images.outputs.toolkit_image }}
device_plugin_image: ${{ needs.fetch-latest-images.outputs.device_plugin_image }}
mig_manager_image: ${{ needs.fetch-latest-images.outputs.mig_manager_image }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the current tests/scripts/install-operator.sh has separate input variables for the various operand images, but would it simplify our CI definition if we passed a values override file instead? Just an idea -- the previous step fetch-latest-images can craft the values override file that points to all the latest operand images. This file is passed as input to the run-e2e-tests step which launches our e2e test script with this overrides file. Thoughts? The reason I am suggesting this is because I see a lot of boilerplate in forward-compatibility.yaml and e2e-tests.yaml for declaring all of these input variables.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a stab at this comment, please let me know what you think

Comment on lines +22 to +33
toolkit_image:
description: 'Override container-toolkit image'
required: false
type: string
device_plugin_image:
description: 'Override device-plugin image'
required: false
type: string
mig_manager_image:
description: 'Override mig-manager image'
required: false
type: string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these variables need to be inputs? Aren't we always resolving the images in this workflow itself?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines +54 to +56
with:
operator_image: ${{ needs.variables.outputs.operator_image_base }}
operator_version: ${{ needs.variables.outputs.operator_version }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- could we potentially move this to the variables workflow itself? As in, make operator_image and operator_version as outputs of the variables worflow?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, done!

Comment on lines +63 to +66
with:
commit_short_sha: ${{ needs.variables.outputs.commit_short_sha }}
operator_version: ${{ needs.variables.outputs.operator_version }}
operator_image_base: ${{ needs.variables.outputs.operator_image_base }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above -- is this really needed? Can we move these variables as outputs of the variables workflow itself?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, done!

Comment on lines +80 to +85
with:
operator_image: ${{ inputs.operator_image }}
operator_version: ${{ inputs.operator_version }}
toolkit_image: ${{ inputs.toolkit_image }}
device_plugin_image: ${{ inputs.device_plugin_image }}
mig_manager_image: ${{ inputs.mig_manager_image }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for passing these variables as input to the variables workflow?

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator Author

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +88 to +100
channel: ${{ secrets.SLACK_CHANNEL_ID }}
text: |
:x: *Forward Compatibility Test Failed for GPU Operator*

*Workflow:* ${{ github.workflow }}
*Repository:* ${{ github.repository }}
*Trigger:* ${{ github.event_name }}

*Tested Components:*
Download `values-overrides` artifact to see tested component versions

*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>
<@S095E7BNGJU>
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Slack notification payload format is incorrect. The action expects a JSON payload, but this is using YAML format with a nested structure. The correct format should use JSON with the payload structured as: {"channel": "${{ secrets.SLACK_CHANNEL_ID }}", "text": "message"}. The current format will cause the Slack notification to fail.

Suggested change
channel: ${{ secrets.SLACK_CHANNEL_ID }}
text: |
:x: *Forward Compatibility Test Failed for GPU Operator*
*Workflow:* ${{ github.workflow }}
*Repository:* ${{ github.repository }}
*Trigger:* ${{ github.event_name }}
*Tested Components:*
Download `values-overrides` artifact to see tested component versions
*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>
<@S095E7BNGJU>
{"channel": "${{ secrets.SLACK_CHANNEL_ID }}", "text": ":x: *Forward Compatibility Test Failed for GPU Operator*\n\n*Workflow:* ${{ github.workflow }}\n*Repository:* ${{ github.repository }}\n*Trigger:* ${{ github.event_name }}\n\n*Tested Components:*\nDownload `values-overrides` artifact to see tested component versions\n\n*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>\n<@S095E7BNGJU>"}

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +58
# Clear individual options since we're using values file
TOOLKIT_CONTAINER_OPTIONS=""
DEVICE_PLUGIN_OPTIONS=""
MIG_MANAGER_OPTIONS=""
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When USE_VALUES_FILE is true, OPERATOR_OPTIONS is cleared/never set, but it's still being appended to in the vGPU section (lines 81-92). The vGPU driver options added to OPERATOR_OPTIONS won't be applied when the values file approach is used (line 101-104), causing vGPU configurations to be silently ignored. The vGPU options should be added to the values file instead when USE_VALUES_FILE is true.

Suggested change
# Clear individual options since we're using values file
TOOLKIT_CONTAINER_OPTIONS=""
DEVICE_PLUGIN_OPTIONS=""
MIG_MANAGER_OPTIONS=""
# Ensure individual options are defined; avoid clobbering any existing values
: ${TOOLKIT_CONTAINER_OPTIONS:=""}
: ${DEVICE_PLUGIN_OPTIONS:=""}
: ${MIG_MANAGER_OPTIONS:=""}

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +63
variables:
uses: ./.github/workflows/variables.yaml
with:
operator_image: ${{ inputs.operator_image }}
operator_version: ${{ inputs.operator_version }}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the workflow is triggered by a push event (lines 18-22), the inputs object is undefined, which will cause the variables workflow call to receive empty/undefined values. This will break the e2e tests when run from push events. Push events should either not call the variables workflow with inputs, or the workflow should handle the undefined inputs case properly.

Copilot uses AI. Check for mistakes.
notify-failure:
runs-on: ubuntu-latest
needs: [fetch-latest-images, run-e2e-tests]
if: ${{ failure() }}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notify-failure job will run only when ANY of the previous jobs fail (failure() checks all needed jobs). However, if fetch-latest-images fails, the run-e2e-tests job will be skipped (not failed). The current condition if: ${{ failure() }} will trigger even if only fetch-latest-images fails, which may not be the intended behavior. Consider using if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') }} for more precise control.

Suggested change
if: ${{ failure() }}
if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') }}

Copilot uses AI. Check for mistakes.
with:
go-version: ${{ env.GOLANG_VERSION }}
- name: Lint
uses: golangci/golangci-lint-action@v8
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The golangci-lint-action version has been downgraded from v9 (in the old CI workflow) to v8 (in the new golang-checks workflow). This may introduce compatibility issues or missing features. Unless there's a specific reason for the downgrade, it should be kept at v9 for consistency.

Suggested change
uses: golangci/golangci-lint-action@v8
uses: golangci/golangci-lint-action@v9

Copilot uses AI. Check for mistakes.
Download `values-overrides` artifact to see tested component versions

*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>
<@S095E7BNGJU>
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Slack notification contains a hardcoded user ID <@S095E7BNGJU>. This appears to be a specific user mention that will only notify one person. Consider making this configurable via a secret or input parameter, or documenting who this user is and why they're hardcoded. If this is meant to notify a team or channel, use a more appropriate mechanism.

Suggested change
<@S095E7BNGJU>
${{ secrets.SLACK_MENTION }}

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +86
# Determine if we should push images
PUSH_ON_BUILD="false"
if [[ "${{ github.actor }}" != "dependabot[bot]" ]]; then
if [[ "${{ github.event_name }}" == "pull_request" && "${{ github.event.pull_request.head.repo.full_name }}" == "${{ github.repository }}" ]]; then
PUSH_ON_BUILD="true"
elif [[ "${{ github.event_name }}" == "push" ]]; then
PUSH_ON_BUILD="true"
elif [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
PUSH_ON_BUILD="true"
fi
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the variables workflow is called from the ci.yaml push event (line 30-31 in ci.yaml), the github.event_name will be "push" inside the variables workflow, not "workflow_call". Line 84 checks for workflow_dispatch, but that won't match when called as a reusable workflow. The PUSH_ON_BUILD logic may not work correctly. Consider checking github.event_name == 'workflow_call' or relying solely on the calling workflow's event type.

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +49
# Create a combined values file
COMBINED_VALUES=$(mktemp)
cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply concatenating YAML files with cat does not properly merge them. If both files define the same keys (e.g., both define toolkit:), the resulting YAML will have duplicate keys, which may cause unpredictable behavior or errors when parsed by Helm. Consider using a proper YAML merge tool like yq to merge the values files, or ensure that the files don't have overlapping keys.

Suggested change
# Create a combined values file
COMBINED_VALUES=$(mktemp)
cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"
# Create a combined values file by merging YAML documents
COMBINED_VALUES=$(mktemp)
if command -v yq >/dev/null 2>&1; then
yq ea '. as $item ireduce ({}; . * $item )' "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"
else
echo "Error: yq is required to merge YAML values files but was not found in PATH." >&2
exit 1
fi

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +59
TOOLKIT=$(.github/scripts/get-latest-images.sh toolkit)

echo "::notice::Fetching latest device-plugin image..."
DEVICE_PLUGIN=$(.github/scripts/get-latest-images.sh device-plugin)

echo "::notice::Fetching latest mig-manager image..."
MIG_MANAGER=$(.github/scripts/get-latest-images.sh mig-manager)

# Generate values override file
.github/scripts/generate-values-overrides.sh \
values-overrides.yaml \
"${TOOLKIT}" \
"${DEVICE_PLUGIN}" \
"${MIG_MANAGER}"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new shell scripts (get-latest-images.sh, generate-values-overrides.sh) are being executed directly in the workflow without a chmod command. These files must be committed with executable permissions set (chmod +x), otherwise the workflow will fail with a "Permission denied" error. Ensure these files have the executable bit set in git before merging.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +22
push:
branches:
- "pull-request/[0-9]+"
- main
- release-*
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When e2e-tests.yaml is triggered via push event (lines 18-22), secrets are not inherited automatically and the workflow will fail when trying to access AWS_ACCESS_KEY_ID and other required secrets. The workflow expects these secrets to be passed explicitly (as done in ci.yaml line 57 with secrets: inherit), but when triggered directly via push, no secrets context exists. Either remove the push trigger or make secrets available for push events.

Suggested change
push:
branches:
- "pull-request/[0-9]+"
- main
- release-*

Copilot uses AI. Check for mistakes.
@ArangoGutierrez
Copy link
Copy Markdown
Collaborator Author

@rajathagasthya @cdesiniotis PTAL

@ArangoGutierrez ArangoGutierrez force-pushed the integration-tests branch 2 times, most recently from 134ffdd to c2a6a6d Compare February 13, 2026 12:27
@ArangoGutierrez
Copy link
Copy Markdown
Collaborator Author

@rajathagasthya I did some last minute fixes before merging, but looks like some E2E are broken - Not related to this PR.

ArangoGutierrez added a commit to ArangoGutierrez/gpu-operator that referenced this pull request Feb 13, 2026
- Remove push: triggers from reusable workflows (config-checks,
  golang-checks, image-builds, release) to prevent double-execution
  when called from ci.yaml
- Eliminate ~80 lines of duplicated variable calculation logic in
  image-builds.yaml and release.yaml
- Make yq a hard requirement for YAML merging instead of falling back
  to cat concatenation which produces invalid YAML
- Replace echo -e with printf '%b' in env-to-values.sh for portability
- Add use_values_override input to e2e-tests workflow_dispatch
- Fix shebang consistency in get-latest-images.sh (#!/usr/bin/env bash)
- Document 8-char SHA truncation assumption in get-latest-images.sh
- Remove non-deterministic date from generated values file headers
- Remove trailing whitespace in code-scanning.yaml

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez force-pushed the integration-tests branch 2 times, most recently from e428763 to 484588d Compare February 13, 2026 20:22
ArangoGutierrez added a commit to ArangoGutierrez/gpu-operator that referenced this pull request Mar 11, 2026
- Remove push: triggers from reusable workflows (config-checks,
  golang-checks, image-builds, release) to prevent double-execution
  when called from ci.yaml
- Eliminate ~80 lines of duplicated variable calculation logic in
  image-builds.yaml and release.yaml
- Make yq a hard requirement for YAML merging instead of falling back
  to cat concatenation which produces invalid YAML
- Replace echo -e with printf '%b' in env-to-values.sh for portability
- Add use_values_override input to e2e-tests workflow_dispatch
- Fix shebang consistency in get-latest-images.sh (#!/usr/bin/env bash)
- Document 8-char SHA truncation assumption in get-latest-images.sh
- Remove non-deterministic date from generated values file headers
- Remove trailing whitespace in code-scanning.yaml

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
rajathagasthya pushed a commit to rajathagasthya/gpu-operator that referenced this pull request Mar 24, 2026
- Remove push: triggers from reusable workflows (config-checks,
  golang-checks, image-builds, release) to prevent double-execution
  when called from ci.yaml
- Eliminate ~80 lines of duplicated variable calculation logic in
  image-builds.yaml and release.yaml
- Make yq a hard requirement for YAML merging instead of falling back
  to cat concatenation which produces invalid YAML
- Replace echo -e with printf '%b' in env-to-values.sh for portability
- Add use_values_override input to e2e-tests workflow_dispatch
- Fix shebang consistency in get-latest-images.sh (#!/usr/bin/env bash)
- Document 8-char SHA truncation assumption in get-latest-images.sh
- Remove non-deterministic date from generated values file headers
- Remove trailing whitespace in code-scanning.yaml

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Break the monolithic ci.yaml into focused, reusable workflow files:
code-scanning, config-checks, golang-checks, image-builds, and
release. Centralize shared variables in variables.yaml. Standardize
copyright headers across all workflow files.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add e2e-tests.yaml reusable workflow for end-to-end GPU operator
testing. Introduce env-to-values.sh to convert environment variables
to Helm values overrides. Update install-operator.sh to use yq-based
YAML merging for component image configuration.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add forward-compatibility.yaml workflow that runs weekly to test the
GPU operator against latest upstream component images (toolkit,
device-plugin, mig-manager). Includes get-latest-images.sh with
retry/backoff for image verification and generate-values-overrides.sh
for Helm values generation.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Restore Holodeck to v0.2.18 matching the original ci.yaml
- Remove unused operator_version input from release workflow and caller
- Skip Slack notification when SLACK_BOT_TOKEN secret is not configured

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Remove push: triggers from reusable workflows (config-checks,
  golang-checks, image-builds, release) to prevent double-execution
  when called from ci.yaml
- Eliminate ~80 lines of duplicated variable calculation logic in
  image-builds.yaml and release.yaml
- Make yq a hard requirement for YAML merging instead of falling back
  to cat concatenation which produces invalid YAML
- Replace echo -e with printf '%b' in env-to-values.sh for portability
- Add use_values_override input to e2e-tests workflow_dispatch
- Fix shebang consistency in get-latest-images.sh (#!/usr/bin/env bash)
- Document 8-char SHA truncation assumption in get-latest-images.sh
- Remove non-deterministic date from generated values file headers
- Remove trailing whitespace in code-scanning.yaml

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Pin mikefarah/yq action to v4 version tag
- Restore SHA-pinned regctl action (v0.11.1)
- Add workflow-level permissions: {} for least privilege
- Move expression interpolations from run: to env: blocks

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Restore bundle CSV update and commit-SHA-tagged bundle image
- Derive OPERATOR_IMAGE_BASE from GITHUB_REPOSITORY_OWNER
- Decouple coverage from blocking image builds
- Remove duplicate pull_request trigger from code-scanning

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Add timeout-minutes to all workflow jobs

- Add set -euo pipefail and quote variables in install-operator.sh

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The addition of `set -euo pipefail` made `${SKIP_INSTALL}` fail when
unset. Use `${SKIP_INSTALL:-}` for safe default expansion.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The `set -euo pipefail` in install-operator.sh propagates to sourced
.definitions.sh where `${DEBUG}` is used without a default value.
Use `${DEBUG:-}` for safe expansion under nounset.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@rajathagasthya rajathagasthya force-pushed the integration-tests branch 2 times, most recently from b22d666 to d6fece6 Compare March 24, 2026 19:26
@rajathagasthya
Copy link
Copy Markdown
Contributor

Forward compatibility was tested in #2248 with a pull request trigger. Successful run: https://github.com/NVIDIA/gpu-operator/actions/runs/23505641426

Several issues prevented the forward-compatibility workflow from
running successfully:

- Add permissions (contents:read, id-token:write) to run-e2e-tests
  job, required by the reusable e2e-tests workflow.

- Fix mig-manager GitHub repo name: the image is published from
  NVIDIA/mig-parted, not NVIDIA/k8s-mig-manager (which doesn't
  exist).

- Replace yq-based YAML merging with Helm's native multi-file
  support (-f file1 -f file2). The remote test machines don't have
  yq installed.

- Pass OPERATOR_OPTIONS through in the values-file install path so
  test-case-specific flags (e.g. driver.nvidiaDriverCRD.enabled)
  are not silently dropped.

Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
@rahulait
Copy link
Copy Markdown
Contributor

Overall, LGTM

with:
commit_short_sha: ${{ needs.variables.outputs.commit_short_sha }}
operator_image_base: ${{ needs.variables.outputs.operator_image_base }}
secrets: inherit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing all the repo/org secrets to the called workflow, we should restrict it to only needed ones.

with:
operator_image: ${{ needs.variables.outputs.operator_image }}
operator_version: ${{ needs.variables.outputs.operator_version }}
secrets: inherit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, lets use least-privilege and pass only secrets which are required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants