Skip to content

TCL-4378: Add Kind-based operator lifecycle E2E test#14

Open
jplimack-ai wants to merge 26 commits intojplimack/tcl-4373-update-sriov-network-operator-fork-deps-to-k8s-v0342from
jplimack/tcl-4378-add-kind-e2e-test
Open

TCL-4378: Add Kind-based operator lifecycle E2E test#14
jplimack-ai wants to merge 26 commits intojplimack/tcl-4373-update-sriov-network-operator-fork-deps-to-k8s-v0342from
jplimack/tcl-4378-add-kind-e2e-test

Conversation

@jplimack-ai
Copy link
Copy Markdown
Collaborator

Adds a pure Go E2E test suite that validates the SR-IOV operator's full deployment pipeline in a real Kind cluster, no SR-IOV hardware or KVM needed. The config daemon reaches SyncStatus: "Succeeded" with no devices via shouldSkipReconciliation(), so we can test the entire operator lifecycle on standard CI runners.

What's new:

  • test/kind/ test suite — Ginkgo tests behind //go:build kind that create a Kind cluster, build and load Docker images, deploy cert-manager, install the Helm chart, and assert the operator is healthy
  • Test cases — operator deployment available, config daemon running, CRDs registered, SriovNetworkNodeState synced, SriovOperatorConfig/default exists, webhook config present
  • make test-e2e-kind-virtual — new Makefile target to run the suite
  • CI jobkind-e2e job on ubuntu-24.04-4core, runs after build/test/golangci with no special runner requirements

Uses Kind SDK, Docker SDK, and Helm SDK as Go libraries (pinned to versions compatible with the project's k8s v0.28.x deps). The only os/exec call is for kubectl apply of the cert-manager manifest.

Ticket: TCL-4378

@jplimack-ai jplimack-ai self-assigned this Mar 5, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 5, 2026

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Pure Go E2E test that validates operator deployment and reconciliation
in a real Kind cluster without SR-IOV hardware. Uses Kind SDK, Docker
SDK, and Helm SDK as Go libraries. Adds CI job on ubuntu-24.04-4core.

TCL-4378
@jplimack-ai jplimack-ai changed the base branch from master to jplimack/tcl-4373-update-sriov-network-operator-fork-deps-to-k8s-v0342 March 5, 2026 06:43
Comment thread go.mod
Enable the `modernize` golangci-lint checker and auto-fix all 12 issues:
- interface{} -> any
- for loops -> range over int
- manual contains loops -> slices.Contains
- []byte(fmt.Sprintf...) -> fmt.Appendf

Also fix 3 staticcheck SA5011 nil-deref warnings in conformance tests.
@jplimack-ai jplimack-ai force-pushed the jplimack/tcl-4378-add-kind-e2e-test branch from ca935fc to 9df8902 Compare March 5, 2026 06:45
Jake Plimack added 6 commits March 5, 2026 07:42
Add validation-style tests for network-resources-injector and
operator-webhook DaemonSets. Also fix setup-go to use go-version-file
instead of hardcoded version.
Add a new `virtual-k8s-conformance` CI job that runs the existing
kcli-based virtual cluster conformance tests on GitHub-hosted
ubuntu-24.04-4core runners with KVM, removing the dependency on
self-hosted [sriov] runners. Gated behind ENABLE_VIRTUAL_E2E repo var.

Also includes remaining modernize linter fixes.
The IsKernelArgsSet mock expectations in the daemon plugin_test.go
BeforeEach block need .AnyTimes() since not all test cases exercise
the generic plugin (e.g., VirtualOpenstack only loads the virtual
plugin). Also quote $USER in workflow to fix shellcheck SC2086.
- udev_test.go: add gomock.Any() for context arg in LoadUdevRules
  "Failed to trigger rules" test
- generic_plugin_test.go: fix RunCommand mock to match 5 args
  (ctx + command + 3 variadic) instead of 6
The BeforeEach .AnyTimes() expectation for SetRDMASubsystem("")
consumes all matching calls, making the exact-once expectation
in the "should not configure RDMA kernel args" test always fail
as "missing call".
t.Setenv cannot be used in parallel tests (panics on Go 1.22+).
Remove t.Parallel() from TestStaticValidateSriovNetworkNodePolicyWithInvalidVendorDevMode.
@coveralls
Copy link
Copy Markdown

coveralls commented Mar 5, 2026

Pull Request Test Coverage Report for Build 22747019920

Details

  • 47 of 83 (56.63%) changed or added relevant lines in 30 files are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.03%) to 47.361%

Changes Missing Coverage Covered Lines Changed/Added Lines %
controllers/sriovoperatorconfig_controller.go 1 2 50.0%
pkg/host/internal/lib/netlink/netlink.go 0 1 0.0%
pkg/host/internal/udev/udev.go 3 4 75.0%
pkg/platforms/openshift/openshift.go 0 1 0.0%
pkg/render/funcs.go 1 2 50.0%
pkg/systemd/systemd.go 0 1 0.0%
pkg/webhook/webhook.go 0 1 0.0%
pkg/host/internal/service/service.go 0 2 0.0%
pkg/daemon/daemon.go 3 6 50.0%
pkg/vendors/mellanox/mellanox.go 0 3 0.0%
Files with Coverage Reduction New Missed Lines %
controllers/drain_controller_helper.go 1 63.59%
controllers/drain_controller.go 3 78.57%
Totals Coverage Status
Change from base Build 22705733528: -0.03%
Covered Lines: 7269
Relevant Lines: 15348

💛 - Coveralls

Jake Plimack added 5 commits March 5, 2026 08:49
These tests share global state (interfaceSelected, snclient) and
cannot safely run in parallel. The modernize linter incorrectly
added t.Parallel() to them.
Tests in validate_test.go share global state (interfaceSelected,
snclient) and cannot safely run in parallel.
ubuntu-24.04-4core runners are not available in this org.
Switch kind-e2e and virtual-k8s-conformance to ubuntu-22.04-4core.
Kind v0.31.0 defaults to k8s 1.33 which uses kubeadm v1beta4,
but the operator targets k8s 1.28. Pin the node image to match
and increase WaitForReady to 5 minutes for slower CI runners.
Add DisplayUsage/DisplaySalutation options and log full error
before asserting to diagnose kubeadm init failures in CI.
@jplimack-ai jplimack-ai marked this pull request as ready for review March 5, 2026 22:46
Use v1.31.4 node image (default for Kind v0.31.0) instead of
v1.28.15 which fails kubeadm init on ubuntu-22.04 runners.
Add verbose logging for Kind cluster creation failures.
@jplimack-ai jplimack-ai force-pushed the jplimack/tcl-4378-add-kind-e2e-test branch from dd934b4 to dc8344d Compare March 5, 2026 22:49
@jplimack-ai jplimack-ai closed this Mar 5, 2026
@jplimack-ai jplimack-ai reopened this Mar 5, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 5, 2026

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Jake Plimack added 4 commits March 5, 2026 16:19
Remove explicit node image pin (let Kind use its default) and add
docker container log dump when kubeadm init fails to get the actual
error output instead of Gomega-truncated byte arrays.
Kubernetes v1.35 (default for Kind v0.31.0) rejects
node-role.kubernetes.io/worker as a --node-labels kubelet flag
because it's not in the allowed kubernetes.io label prefix set.
Apply the label post-creation via kubectl instead.
The build scripts (hack/build-go.sh) use git rev-parse inside
the Docker build. Excluding .git from the tar context caused
'git rev-parse --show-cdup' to fail with exit code 128.
Jake Plimack added 7 commits March 5, 2026 17:44
- Dump pod status and daemon logs when SriovNetworkNodeState timeout
- Set disableDrain=true for single-node Kind cluster
The config daemon's init containers require external CNI images
(sriov-cni, infiniband-cni) that aren't loaded into Kind, causing
them to hang in Init state. Relax BeforeSuite to only wait for
SriovNetworkNodeState objects to exist (created by the operator
controller) rather than waiting for SyncStatus=Succeeded which
requires the daemon pod to fully start.
- Config daemon: check DaemonSet is scheduled (not ready), since
  init containers need external images not loaded into Kind
- Webhook: use Eventually to wait for MutatingWebhookConfiguration
  since it depends on cert-manager issuing certificates
The MutatingWebhookConfiguration depends on cert-manager issuing
certificates, which is unreliable in Kind. Check that the
operator-webhook DaemonSet is created instead.
DaemonSet pods may not reach Ready state in Kind due to init
containers pulling external images. Check DesiredNumberScheduled > 0
instead of DesiredNumberScheduled == NumberReady. Remove duplicate
operator-webhook test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants