Add weekly forward compatibility testing#1884
Add weekly forward compatibility testing#1884ArangoGutierrez wants to merge 11 commits intoNVIDIA:mainfrom
Conversation
|
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation. |
d55dd82 to
9dab00e
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR implements automated weekly forward compatibility testing for the GPU Operator against the latest published component images from NVIDIA repositories. The changes refactor the existing monolithic CI workflow into reusable workflow modules and add forward compatibility testing infrastructure.
Key changes:
- Refactored CI workflows into separate, reusable workflow files (variables, golang-checks, config-checks, image-builds, e2e-tests, release)
- Added forward-compatibility.yaml workflow with scheduled weekly runs and manual trigger support
- Created get-latest-images.sh script to fetch latest commit-based images from component repositories
- Extended install-operator.sh to support device-plugin and mig-manager image overrides
- Added Slack notifications for test failures
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/forward-compatibility.yaml | New workflow for weekly forward compatibility testing with Slack notifications |
| .github/workflows/ci.yaml | Refactored to use reusable workflows instead of monolithic job definitions |
| .github/workflows/variables.yaml | New reusable workflow for shared CI variables |
| .github/workflows/e2e-tests.yaml | New reusable e2e test workflow with optional component image inputs |
| .github/workflows/image-builds.yaml | New reusable workflow for GPU operator image builds |
| .github/workflows/release.yaml | New reusable workflow for release processes |
| .github/workflows/golang-checks.yaml | New reusable workflow for Go checks and tests |
| .github/workflows/config-checks.yaml | New reusable workflow for configuration validation |
| .github/workflows/code-scanning.yaml | New reusable workflow for CodeQL scanning |
| .github/scripts/get-latest-images.sh | Script to fetch latest component images from NVIDIA repositories |
| tests/scripts/install-operator.sh | Extended to support device-plugin and mig-manager image overrides |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
84a0c31 to
54d6fc9
Compare
| env: | ||
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| run: | | ||
| # Use workflow_dispatch inputs if provided, otherwise fetch latest |
There was a problem hiding this comment.
Question: when would someone set these variables as inputs? If the scope of this workflow is to always test the top-of-tree images for all operands, then maybe we can remove the input variables and move the entire run: block into a shell script? Thoughts?
There was a problem hiding this comment.
My intention is to allow manual triggering of the workflow in case we want to test a very specific image without having to wait for the "weekly" run. But I'll set it in a way that even during manual triggers, we will simply fetch all the latests images from operands.
| toolkit_image: ${{ needs.fetch-latest-images.outputs.toolkit_image }} | ||
| device_plugin_image: ${{ needs.fetch-latest-images.outputs.device_plugin_image }} | ||
| mig_manager_image: ${{ needs.fetch-latest-images.outputs.mig_manager_image }} |
There was a problem hiding this comment.
I understand the current tests/scripts/install-operator.sh has separate input variables for the various operand images, but would it simplify our CI definition if we passed a values override file instead? Just an idea -- the previous step fetch-latest-images can craft the values override file that points to all the latest operand images. This file is passed as input to the run-e2e-tests step which launches our e2e test script with this overrides file. Thoughts? The reason I am suggesting this is because I see a lot of boilerplate in forward-compatibility.yaml and e2e-tests.yaml for declaring all of these input variables.
There was a problem hiding this comment.
I took a stab at this comment, please let me know what you think
| toolkit_image: | ||
| description: 'Override container-toolkit image' | ||
| required: false | ||
| type: string | ||
| device_plugin_image: | ||
| description: 'Override device-plugin image' | ||
| required: false | ||
| type: string | ||
| mig_manager_image: | ||
| description: 'Override mig-manager image' | ||
| required: false | ||
| type: string |
There was a problem hiding this comment.
Why do these variables need to be inputs? Aren't we always resolving the images in this workflow itself?
| with: | ||
| operator_image: ${{ needs.variables.outputs.operator_image_base }} | ||
| operator_version: ${{ needs.variables.outputs.operator_version }} |
There was a problem hiding this comment.
Question -- could we potentially move this to the variables workflow itself? As in, make operator_image and operator_version as outputs of the variables worflow?
There was a problem hiding this comment.
good catch, done!
| with: | ||
| commit_short_sha: ${{ needs.variables.outputs.commit_short_sha }} | ||
| operator_version: ${{ needs.variables.outputs.operator_version }} | ||
| operator_image_base: ${{ needs.variables.outputs.operator_image_base }} |
There was a problem hiding this comment.
Same comment as above -- is this really needed? Can we move these variables as outputs of the variables workflow itself?
There was a problem hiding this comment.
same, done!
.github/workflows/e2e-tests.yaml
Outdated
| with: | ||
| operator_image: ${{ inputs.operator_image }} | ||
| operator_version: ${{ inputs.operator_version }} | ||
| toolkit_image: ${{ inputs.toolkit_image }} | ||
| device_plugin_image: ${{ inputs.device_plugin_image }} | ||
| mig_manager_image: ${{ inputs.mig_manager_image }} |
There was a problem hiding this comment.
What is the rationale for passing these variables as input to the variables workflow?
54d6fc9 to
203941a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| channel: ${{ secrets.SLACK_CHANNEL_ID }} | ||
| text: | | ||
| :x: *Forward Compatibility Test Failed for GPU Operator* | ||
|
|
||
| *Workflow:* ${{ github.workflow }} | ||
| *Repository:* ${{ github.repository }} | ||
| *Trigger:* ${{ github.event_name }} | ||
|
|
||
| *Tested Components:* | ||
| Download `values-overrides` artifact to see tested component versions | ||
|
|
||
| *Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run> | ||
| <@S095E7BNGJU> |
There was a problem hiding this comment.
The Slack notification payload format is incorrect. The action expects a JSON payload, but this is using YAML format with a nested structure. The correct format should use JSON with the payload structured as: {"channel": "${{ secrets.SLACK_CHANNEL_ID }}", "text": "message"}. The current format will cause the Slack notification to fail.
| channel: ${{ secrets.SLACK_CHANNEL_ID }} | |
| text: | | |
| :x: *Forward Compatibility Test Failed for GPU Operator* | |
| *Workflow:* ${{ github.workflow }} | |
| *Repository:* ${{ github.repository }} | |
| *Trigger:* ${{ github.event_name }} | |
| *Tested Components:* | |
| Download `values-overrides` artifact to see tested component versions | |
| *Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run> | |
| <@S095E7BNGJU> | |
| {"channel": "${{ secrets.SLACK_CHANNEL_ID }}", "text": ":x: *Forward Compatibility Test Failed for GPU Operator*\n\n*Workflow:* ${{ github.workflow }}\n*Repository:* ${{ github.repository }}\n*Trigger:* ${{ github.event_name }}\n\n*Tested Components:*\nDownload `values-overrides` artifact to see tested component versions\n\n*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>\n<@S095E7BNGJU>"} |
| # Clear individual options since we're using values file | ||
| TOOLKIT_CONTAINER_OPTIONS="" | ||
| DEVICE_PLUGIN_OPTIONS="" | ||
| MIG_MANAGER_OPTIONS="" |
There was a problem hiding this comment.
When USE_VALUES_FILE is true, OPERATOR_OPTIONS is cleared/never set, but it's still being appended to in the vGPU section (lines 81-92). The vGPU driver options added to OPERATOR_OPTIONS won't be applied when the values file approach is used (line 101-104), causing vGPU configurations to be silently ignored. The vGPU options should be added to the values file instead when USE_VALUES_FILE is true.
| # Clear individual options since we're using values file | |
| TOOLKIT_CONTAINER_OPTIONS="" | |
| DEVICE_PLUGIN_OPTIONS="" | |
| MIG_MANAGER_OPTIONS="" | |
| # Ensure individual options are defined; avoid clobbering any existing values | |
| : ${TOOLKIT_CONTAINER_OPTIONS:=""} | |
| : ${DEVICE_PLUGIN_OPTIONS:=""} | |
| : ${MIG_MANAGER_OPTIONS:=""} |
| variables: | ||
| uses: ./.github/workflows/variables.yaml | ||
| with: | ||
| operator_image: ${{ inputs.operator_image }} | ||
| operator_version: ${{ inputs.operator_version }} |
There was a problem hiding this comment.
When the workflow is triggered by a push event (lines 18-22), the inputs object is undefined, which will cause the variables workflow call to receive empty/undefined values. This will break the e2e tests when run from push events. Push events should either not call the variables workflow with inputs, or the workflow should handle the undefined inputs case properly.
| notify-failure: | ||
| runs-on: ubuntu-latest | ||
| needs: [fetch-latest-images, run-e2e-tests] | ||
| if: ${{ failure() }} |
There was a problem hiding this comment.
The notify-failure job will run only when ANY of the previous jobs fail (failure() checks all needed jobs). However, if fetch-latest-images fails, the run-e2e-tests job will be skipped (not failed). The current condition if: ${{ failure() }} will trigger even if only fetch-latest-images fails, which may not be the intended behavior. Consider using if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') }} for more precise control.
| if: ${{ failure() }} | |
| if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') }} |
.github/workflows/golang-checks.yaml
Outdated
| with: | ||
| go-version: ${{ env.GOLANG_VERSION }} | ||
| - name: Lint | ||
| uses: golangci/golangci-lint-action@v8 |
There was a problem hiding this comment.
The golangci-lint-action version has been downgraded from v9 (in the old CI workflow) to v8 (in the new golang-checks workflow). This may introduce compatibility issues or missing features. Unless there's a specific reason for the downgrade, it should be kept at v9 for consistency.
| uses: golangci/golangci-lint-action@v8 | |
| uses: golangci/golangci-lint-action@v9 |
| Download `values-overrides` artifact to see tested component versions | ||
|
|
||
| *Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run> | ||
| <@S095E7BNGJU> |
There was a problem hiding this comment.
The Slack notification contains a hardcoded user ID <@S095E7BNGJU>. This appears to be a specific user mention that will only notify one person. Consider making this configurable via a secret or input parameter, or documenting who this user is and why they're hardcoded. If this is meant to notify a team or channel, use a more appropriate mechanism.
| <@S095E7BNGJU> | |
| ${{ secrets.SLACK_MENTION }} |
| # Determine if we should push images | ||
| PUSH_ON_BUILD="false" | ||
| if [[ "${{ github.actor }}" != "dependabot[bot]" ]]; then | ||
| if [[ "${{ github.event_name }}" == "pull_request" && "${{ github.event.pull_request.head.repo.full_name }}" == "${{ github.repository }}" ]]; then | ||
| PUSH_ON_BUILD="true" | ||
| elif [[ "${{ github.event_name }}" == "push" ]]; then | ||
| PUSH_ON_BUILD="true" | ||
| elif [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then | ||
| PUSH_ON_BUILD="true" | ||
| fi |
There was a problem hiding this comment.
When the variables workflow is called from the ci.yaml push event (line 30-31 in ci.yaml), the github.event_name will be "push" inside the variables workflow, not "workflow_call". Line 84 checks for workflow_dispatch, but that won't match when called as a reusable workflow. The PUSH_ON_BUILD logic may not work correctly. Consider checking github.event_name == 'workflow_call' or relying solely on the calling workflow's event type.
tests/scripts/install-operator.sh
Outdated
| # Create a combined values file | ||
| COMBINED_VALUES=$(mktemp) | ||
| cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}" |
There was a problem hiding this comment.
Simply concatenating YAML files with cat does not properly merge them. If both files define the same keys (e.g., both define toolkit:), the resulting YAML will have duplicate keys, which may cause unpredictable behavior or errors when parsed by Helm. Consider using a proper YAML merge tool like yq to merge the values files, or ensure that the files don't have overlapping keys.
| # Create a combined values file | |
| COMBINED_VALUES=$(mktemp) | |
| cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}" | |
| # Create a combined values file by merging YAML documents | |
| COMBINED_VALUES=$(mktemp) | |
| if command -v yq >/dev/null 2>&1; then | |
| yq ea '. as $item ireduce ({}; . * $item )' "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}" | |
| else | |
| echo "Error: yq is required to merge YAML values files but was not found in PATH." >&2 | |
| exit 1 | |
| fi |
| TOOLKIT=$(.github/scripts/get-latest-images.sh toolkit) | ||
|
|
||
| echo "::notice::Fetching latest device-plugin image..." | ||
| DEVICE_PLUGIN=$(.github/scripts/get-latest-images.sh device-plugin) | ||
|
|
||
| echo "::notice::Fetching latest mig-manager image..." | ||
| MIG_MANAGER=$(.github/scripts/get-latest-images.sh mig-manager) | ||
|
|
||
| # Generate values override file | ||
| .github/scripts/generate-values-overrides.sh \ | ||
| values-overrides.yaml \ | ||
| "${TOOLKIT}" \ | ||
| "${DEVICE_PLUGIN}" \ | ||
| "${MIG_MANAGER}" |
There was a problem hiding this comment.
The new shell scripts (get-latest-images.sh, generate-values-overrides.sh) are being executed directly in the workflow without a chmod command. These files must be committed with executable permissions set (chmod +x), otherwise the workflow will fail with a "Permission denied" error. Ensure these files have the executable bit set in git before merging.
.github/workflows/e2e-tests.yaml
Outdated
| push: | ||
| branches: | ||
| - "pull-request/[0-9]+" | ||
| - main | ||
| - release-* |
There was a problem hiding this comment.
When e2e-tests.yaml is triggered via push event (lines 18-22), secrets are not inherited automatically and the workflow will fail when trying to access AWS_ACCESS_KEY_ID and other required secrets. The workflow expects these secrets to be passed explicitly (as done in ci.yaml line 57 with secrets: inherit), but when triggered directly via push, no secrets context exists. Either remove the push trigger or make secrets available for push events.
| push: | |
| branches: | |
| - "pull-request/[0-9]+" | |
| - main | |
| - release-* |
203941a to
764a66e
Compare
134ffdd to
c2a6a6d
Compare
|
@rajathagasthya I did some last minute fixes before merging, but looks like some E2E are broken - Not related to this PR. |
- Remove push: triggers from reusable workflows (config-checks, golang-checks, image-builds, release) to prevent double-execution when called from ci.yaml - Eliminate ~80 lines of duplicated variable calculation logic in image-builds.yaml and release.yaml - Make yq a hard requirement for YAML merging instead of falling back to cat concatenation which produces invalid YAML - Replace echo -e with printf '%b' in env-to-values.sh for portability - Add use_values_override input to e2e-tests workflow_dispatch - Fix shebang consistency in get-latest-images.sh (#!/usr/bin/env bash) - Document 8-char SHA truncation assumption in get-latest-images.sh - Remove non-deterministic date from generated values file headers - Remove trailing whitespace in code-scanning.yaml Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
e428763 to
484588d
Compare
484588d to
3697a1f
Compare
- Remove push: triggers from reusable workflows (config-checks, golang-checks, image-builds, release) to prevent double-execution when called from ci.yaml - Eliminate ~80 lines of duplicated variable calculation logic in image-builds.yaml and release.yaml - Make yq a hard requirement for YAML merging instead of falling back to cat concatenation which produces invalid YAML - Replace echo -e with printf '%b' in env-to-values.sh for portability - Add use_values_override input to e2e-tests workflow_dispatch - Fix shebang consistency in get-latest-images.sh (#!/usr/bin/env bash) - Document 8-char SHA truncation assumption in get-latest-images.sh - Remove non-deterministic date from generated values file headers - Remove trailing whitespace in code-scanning.yaml Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Remove push: triggers from reusable workflows (config-checks, golang-checks, image-builds, release) to prevent double-execution when called from ci.yaml - Eliminate ~80 lines of duplicated variable calculation logic in image-builds.yaml and release.yaml - Make yq a hard requirement for YAML merging instead of falling back to cat concatenation which produces invalid YAML - Replace echo -e with printf '%b' in env-to-values.sh for portability - Add use_values_override input to e2e-tests workflow_dispatch - Fix shebang consistency in get-latest-images.sh (#!/usr/bin/env bash) - Document 8-char SHA truncation assumption in get-latest-images.sh - Remove non-deterministic date from generated values file headers - Remove trailing whitespace in code-scanning.yaml Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Break the monolithic ci.yaml into focused, reusable workflow files: code-scanning, config-checks, golang-checks, image-builds, and release. Centralize shared variables in variables.yaml. Standardize copyright headers across all workflow files. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add e2e-tests.yaml reusable workflow for end-to-end GPU operator testing. Introduce env-to-values.sh to convert environment variables to Helm values overrides. Update install-operator.sh to use yq-based YAML merging for component image configuration. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add forward-compatibility.yaml workflow that runs weekly to test the GPU operator against latest upstream component images (toolkit, device-plugin, mig-manager). Includes get-latest-images.sh with retry/backoff for image verification and generate-values-overrides.sh for Helm values generation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Restore Holodeck to v0.2.18 matching the original ci.yaml - Remove unused operator_version input from release workflow and caller - Skip Slack notification when SLACK_BOT_TOKEN secret is not configured Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Remove push: triggers from reusable workflows (config-checks, golang-checks, image-builds, release) to prevent double-execution when called from ci.yaml - Eliminate ~80 lines of duplicated variable calculation logic in image-builds.yaml and release.yaml - Make yq a hard requirement for YAML merging instead of falling back to cat concatenation which produces invalid YAML - Replace echo -e with printf '%b' in env-to-values.sh for portability - Add use_values_override input to e2e-tests workflow_dispatch - Fix shebang consistency in get-latest-images.sh (#!/usr/bin/env bash) - Document 8-char SHA truncation assumption in get-latest-images.sh - Remove non-deterministic date from generated values file headers - Remove trailing whitespace in code-scanning.yaml Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Pin mikefarah/yq action to v4 version tag
- Restore SHA-pinned regctl action (v0.11.1)
- Add workflow-level permissions: {} for least privilege
- Move expression interpolations from run: to env: blocks
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Restore bundle CSV update and commit-SHA-tagged bundle image - Derive OPERATOR_IMAGE_BASE from GITHUB_REPOSITORY_OWNER - Decouple coverage from blocking image builds - Remove duplicate pull_request trigger from code-scanning Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Add timeout-minutes to all workflow jobs - Add set -euo pipefail and quote variables in install-operator.sh Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The addition of `set -euo pipefail` made `${SKIP_INSTALL}` fail when
unset. Use `${SKIP_INSTALL:-}` for safe default expansion.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The `set -euo pipefail` in install-operator.sh propagates to sourced
.definitions.sh where `${DEBUG}` is used without a default value.
Use `${DEBUG:-}` for safe expansion under nounset.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
b22d666 to
d6fece6
Compare
|
Forward compatibility was tested in #2248 with a pull request trigger. Successful run: https://github.com/NVIDIA/gpu-operator/actions/runs/23505641426 |
Several issues prevented the forward-compatibility workflow from running successfully: - Add permissions (contents:read, id-token:write) to run-e2e-tests job, required by the reusable e2e-tests workflow. - Fix mig-manager GitHub repo name: the image is published from NVIDIA/mig-parted, not NVIDIA/k8s-mig-manager (which doesn't exist). - Replace yq-based YAML merging with Helm's native multi-file support (-f file1 -f file2). The remote test machines don't have yq installed. - Pass OPERATOR_OPTIONS through in the values-file install path so test-case-specific flags (e.g. driver.nvidiaDriverCRD.enabled) are not silently dropped. Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
d6fece6 to
1afd483
Compare
|
Overall, LGTM |
| with: | ||
| commit_short_sha: ${{ needs.variables.outputs.commit_short_sha }} | ||
| operator_image_base: ${{ needs.variables.outputs.operator_image_base }} | ||
| secrets: inherit |
There was a problem hiding this comment.
Instead of passing all the repo/org secrets to the called workflow, we should restrict it to only needed ones.
| with: | ||
| operator_image: ${{ needs.variables.outputs.operator_image }} | ||
| operator_version: ${{ needs.variables.outputs.operator_version }} | ||
| secrets: inherit |
There was a problem hiding this comment.
Same here, lets use least-privilege and pass only secrets which are required
Implement automated forward compatibility tests that validate GPU Operator
against the latest published images from NVIDIA component repositories.
Changes
Workflow Refactoring:
ci.yamlinto reusable workflow modules (variables.yaml,golang-checks.yaml,config-checks.yaml,image-builds.yaml,e2e-tests.yaml,release.yaml)Forward Compatibility Testing:
forward-compatibility.yamlworkflow (weekly + manual trigger)get-latest-images.shto fetch latest commit-based images with retry/backoffgenerate-values-overrides.shto produce Helm values override fileE2E Test Enhancements:
e2e-tests.yamlwith optionaluse_values_overrideinputenv-to-values.shto convert env vars to Helm values fileinstall-operator.shwith proper YAML merging viayqVALUES_FILEto remote test instances (per @rajathagasthya review)Notifications:
Components Tested
ghcr.io/nvidia/container-toolkitghcr.io/nvidia/k8s-device-pluginghcr.io/nvidia/k8s-mig-managerWe could add other operands later, but I wanted to start with the core ones.
Schedule: Every Monday at 2 AM UTC
Review Feedback Addressed
operator_image/operator_versionto variables workflow (@cdesiniotis)VALUES_FILEto remote test instance (@rajathagasthya)yqfor proper YAML mergingoperator_versioninput from release workflow