NO-JIRA: feat(e2e): add OIDC IAM fix and NodePool validation to backup/restore #7837
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a full HyperShift OADP backup/restore E2E suite: Makefile/CI targets and env vars, exported name helpers, OADP CLI wrappers and builders, Velero probes/helpers, prober manager, cleanup/disaster utilities, extensive e2e tests and docs, and a minor .codespellignore addition. Changes
Sequence Diagram(s)sequenceDiagram
participant TestRunner as "E2E Test Runner"
participant HypershiftCLI as "hypershift CLI"
participant Kubernetes as "Kubernetes API"
participant Velero as "Velero (in-cluster)"
TestRunner->>HypershiftCLI: RunOADPBackup(args)
HypershiftCLI->>Kubernetes: create Backup / Schedule
Kubernetes-->>HypershiftCLI: acknowledge
TestRunner->>Velero: WaitForScheduleCompletion / WaitForBackupCompletion
Velero->>Kubernetes: list/watch Backup/Restore resources
Kubernetes-->>Velero: resource states (Pending -> Completed)
Velero-->>TestRunner: completion status
TestRunner->>HypershiftCLI: RunOADPRestore(from-backup)
HypershiftCLI->>Kubernetes: create Restore
Kubernetes-->>HypershiftCLI: acknowledge
TestRunner->>Velero: WaitForRestoreCompletion
Velero-->>TestRunner: restore completion status
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
| return strings.Join(conditions, "\n\t\t") | ||
| } | ||
|
|
||
| g.Expect(found).To(gomega.BeTrue(), |
There was a problem hiding this comment.
This is an example error message that is produced by this code:
FAILED]
incorrect condition: wanted ReachedIgnitionEndpoint=True, got ReachedIgnitionEndpoint=False: ignitionNotReached
*v1beta1.NodePool clusters/mgencur-hc1-us-east-1a conditions at RV 152188:
AutoscalingEnabled=False: AsExpected
UpdateManagementEnabled=True: AsExpected
ValidReleaseImage=True: AsExpected(Using release image: quay.io/openshift-release-dev/ocp-release-nightly@sha256:ec0824c60a1460dcaa2b4f2c261594115b0dc31a4b6b2a286bb8f1ae4cba32ed)
ValidArchPlatform=True: AsExpected
ReconciliationActive=True: ReconciliationActive(Reconciliation active on resource)
SupportedVersionSkew=True: AsExpected(Release image version is valid)
ValidMachineConfig=True: AsExpected
UpdatingConfig=False: AsExpected
UpdatingVersion=False: AsExpected
ValidGeneratedPayload=True: AsExpected(Payload generated successfully)
ReachedIgnitionEndpoint=False: ignitionNotReached
AllMachinesReady=True: AsExpected(All is well)
AllNodesHealthy=True: AsExpected(All is well)
ValidPlatformConfig=True: AsExpected(All is well)
ValidPlatformImage=True: AsExpected(Bootstrap AMI is "ami-04018496b0a1da2d2")
AWSSecurityGroupAvailable=True: AsExpected(NodePool has a default security group)
ValidTuningConfig=True: AsExpected
UpdatingPlatformMachineTemplate=False: AsExpected
Ready=True: AsExpected
AutorepairEnabled=False: AsExpected
| excludeWorkloads []string = []string{ | ||
| "router", "karpenter", "karpenter-operator", "aws-node-termination-handler", | ||
| } | ||
| machinesReadyCondition []util.Condition = []util.Condition{ |
There was a problem hiding this comment.
We can also check all desired conditions, not just MachinesReady, but we need clarification in this slack thread
There was a problem hiding this comment.
Pushed a commit to check all conditions.
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (4)
test/e2e/v2/backuprestore/cleanup.go (1)
142-145: Log partial discovery failures for diagnosability.At Line 143,
IsGroupDiscoveryFailedErroris tolerated silently. If a missed API group contains finalizers, cleanup can stall with weak diagnostics. Consider logging the partial discovery error before continuing.♻️ Suggested tweak
- apiResourceLists, err := discoveryClient.ServerPreferredNamespacedResources() - if err != nil && !discovery.IsGroupDiscoveryFailedError(err) { - return fmt.Errorf("failed to get API resources: %w", err) - } + apiResourceLists, err := discoveryClient.ServerPreferredNamespacedResources() + if err != nil { + if discovery.IsGroupDiscoveryFailedError(err) { + logger.Info("partial API discovery failure while collecting namespaced resources", "error", err) + } else { + return fmt.Errorf("failed to get API resources: %w", err) + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/backuprestore/cleanup.go` around lines 142 - 145, The code currently swallows partial discovery failures from discoveryClient.ServerPreferredNamespacedResources() when discovery.IsGroupDiscoveryFailedError(err) is true; change this to log the partial discovery error (including err and context like which groups failed) before continuing so cleanup diagnostics are available (refer to discoveryClient.ServerPreferredNamespacedResources, discovery.IsGroupDiscoveryFailedError, and variables apiResourceLists/err to locate the check and insert a processLogger/klog/t.Logf call that records the error details).test/e2e/v2/backuprestore/cleanup_test.go (1)
43-47: Rename the test case to match expected behavior.The case name says wildcard “does not match” while expected is
true. This can confuse future triage.🧹 Suggested rename
- name: "wildcard verb does not match specific verb", + name: "wildcard verb matches specific verb",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/backuprestore/cleanup_test.go` around lines 43 - 47, Update the test case name string to reflect the expected behavior: change the entry with name "wildcard verb does not match specific verb" (in the test table used by the cleanup_test.go test) to something like "wildcard verb matches specific verb" because the test's verbs {"*"} and verb "get" expect true; ensure only the name field is updated so the test logic (verbs/verb/expected) remains unchanged.test/e2e/v2/backuprestore/velero.go (2)
48-86:EnsureVeleroPodRunningis a one-shot check; consider polling + clarify namespace/label assumptions.Right now this can fail during rollout/restart windows even though Velero would become ready moments later. Also, the doc says “specified namespace” but the implementation is fixed to
DefaultOADPNamespace, and the label selector (deploy,component) may be cluster-version/OADP-version dependent.Proposed polling version (keeps the helpful “pod states” error)
-// EnsureVeleroPodRunning checks if at least one Velero pod is running and ready in the specified namespace. +// EnsureVeleroPodRunning checks if at least one Velero pod is running and ready in DefaultOADPNamespace. // This function tolerates multiple Velero pods (e.g., during rollouts or restarts) as long as at least one is healthy. func EnsureVeleroPodRunning(testCtx *internal.TestContext) error { client := testCtx.MgmtClient - podList := &corev1.PodList{} labels := map[string]string{ "deploy": "velero", "component": "velero", } - if err := client.List(testCtx.Context, podList, crclient.InNamespace(DefaultOADPNamespace), crclient.MatchingLabels(labels)); err != nil { - return fmt.Errorf("failed to list Velero pods: %w", err) - } - - if len(podList.Items) == 0 { - return fmt.Errorf("no Velero pod found in namespace %s", DefaultOADPNamespace) - } - - // Check if at least one pod is running and ready - // Multiple pods can exist during rollouts or restarts, which is acceptable - for _, pod := range podList.Items { - if pod.Status.Phase == corev1.PodRunning && isPodReady(&pod) { - return nil - } - } - - // No running and ready pod found - collect pod states for error message - var podStates []string - for _, pod := range podList.Items { + const ( + veleroPodPollInterval = 10 * time.Second + veleroPodTimeout = 5 * time.Minute + ) + + var lastPods []corev1.Pod + err := wait.PollUntilContextTimeout(testCtx.Context, veleroPodPollInterval, veleroPodTimeout, true, func(ctx context.Context) (bool, error) { + podList := &corev1.PodList{} + if err := client.List(ctx, podList, crclient.InNamespace(DefaultOADPNamespace), crclient.MatchingLabels(labels)); err != nil { + return false, err + } + lastPods = podList.Items + for _, pod := range podList.Items { + if pod.Status.Phase == corev1.PodRunning && isPodReady(&pod) { + return true, nil + } + } + return false, nil + }) + if err == nil { + return nil + } + + if len(lastPods) == 0 { + return fmt.Errorf("no Velero pod found in namespace %s within %v: %w", DefaultOADPNamespace, veleroPodTimeout, err) + } + + var podStates []string + for _, pod := range lastPods { readyStatus := "not ready" if isPodReady(&pod) { readyStatus = "ready" } podStates = append(podStates, fmt.Sprintf("%s: phase=%s, %s", pod.Name, pod.Status.Phase, readyStatus)) } - return fmt.Errorf("no running and ready Velero pod found in namespace %s (found %d pod(s): %v)", - DefaultOADPNamespace, len(podList.Items), podStates) + return fmt.Errorf("no running and ready Velero pod found in namespace %s within %v (found %d pod(s): %v): %w", + DefaultOADPNamespace, veleroPodTimeout, len(lastPods), podStates, err) }For the label selector, it’d be worth verifying against a real cluster with something like:
oc -n openshift-adp get pods --show-labels | rg velerooc -n openshift-adp get pods -l deploy=velero,component=velero🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/backuprestore/velero.go` around lines 48 - 86, EnsureVeleroPodRunning currently does a one-shot check and hardcodes DefaultOADPNamespace and specific labels; change it to poll until success or a configurable timeout to tolerate rollouts and make namespace/labels configurable: update the function signature EnsureVeleroPodRunning(testCtx *internal.TestContext, namespace string, labels map[string]string, timeout time.Duration) (or accept nil/empty to fall back to DefaultOADPNamespace and the current labels), then inside use a loop (or context-aware polling) that repeatedly calls client.List on podList and checks for any pod where pod.Status.Phase == corev1.PodRunning && isPodReady(&pod) until timeout/context done, retaining the existing podStates aggregation for the final error; reference the existing symbols EnsureVeleroPodRunning, DefaultOADPNamespace, labels map, testCtx.MgmtClient, isPodReady, and podList when making these changes.
396-428: Minor:DeleteOADPScheduleduplicates the generic Velero getter logic.You already have
getVeleroResource; using it here would reduce boilerplate and keep error-handling consistent.Possible simplification
func DeleteOADPSchedule(testCtx *internal.TestContext, scheduleName string) error { - - // Now delete the Schedule - schedule := &unstructured.Unstructured{} - schedule.SetGroupVersionKind(schema.GroupVersionKind{ - Group: "velero.io", - Version: "v1", - Kind: "Schedule", - }) - - err := testCtx.MgmtClient.Get(testCtx.Context, crclient.ObjectKey{ - Namespace: DefaultOADPNamespace, - Name: scheduleName, - }, schedule) + schedule, err := getVeleroResource(testCtx.Context, testCtx.MgmtClient, DefaultOADPNamespace, scheduleName, "Schedule") if err != nil { if apierrors.IsNotFound(err) { // Already deleted return nil } return fmt.Errorf("failed to get Schedule %s/%s: %w", DefaultOADPNamespace, scheduleName, err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/backuprestore/velero.go` around lines 396 - 428, DeleteOADPSchedule duplicates the Velero resource fetching logic; replace the manual unstructured creation and Get call with a call to the existing helper getVeleroResource (use scheduleName and the GroupVersionKind for Schedule) to retrieve the resource, handle apierrors.IsNotFound returned by getVeleroResource the same way (return nil), then proceed to delete the returned unstructured with testCtx.MgmtClient.Delete; keep the same error wrapping messages on Get and Delete but remove the duplicated Get logic in DeleteOADPSchedule.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/v2/backuprestore/cli.go`:
- Around line 130-236: Each exported runner (RunOADPBackup, RunOADPRestore,
RunOADPSchedule) currently dereferences their options pointer without a nil
check and repeats artifactDir validation/creation and framework.Options setup;
add a nil guard at the start of each function (e.g., if backupOpts == nil {
return fmt.Errorf("backup options required") }) and mirror for
restoreOpts/scheduleOpts, then consolidate the artifactDir logic into a small
helper (e.g., validateAndPrepareArtifactDir(artifactDir) used by all three) so
you validate non-empty artifactDir, call os.MkdirAll, and return a
*framework.Options with ArtifactDir set to reuse across RunOADPBackup,
RunOADPRestore, and RunOADPSchedule.
- Around line 398-434: RunFixDrOidcIam currently allows both HostedCluster mode
and manual mode together and allows both credential modes simultaneously; update
RunFixDrOidcIam (and buildFixDrOidcIamArgs) to enforce the same
mutual-exclusivity and fullness checks as the CLI: require either both HCName
and HCNamespace (hosted-cluster mode) OR both InfraID and Region (manual mode),
but not both; require either AWSCredsFile OR (STSCredsFile and RoleARN), but not
both modes; perform these validations before calling buildFixDrOidcIamArgs and
return descriptive errors matching CLI validation (refer to RunFixDrOidcIam and
buildFixDrOidcIamArgs symbols), then set logPath using mode-specific identifiers
(hosted: HCNamespace+HCName, manual: Region+InfraID) to avoid empty names; also
ensure buildFixDrOidcIamArgs does not append flags from the opposite mode when a
mode is chosen.
In `@test/e2e/v2/backuprestore/velero.go`:
- Around line 271-303: The function isVeleroResourceInFinalState currently fails
the poll when getResourceFunc returns an error; change the error handling in
isVeleroResourceInFinalState so that if the error is a Kubernetes NotFound (use
apierrors.IsNotFound(err) from k8s.io/apimachinery/pkg/api/errors) you return
(false, nil) to treat it as transient and let wait.PollUntilContextTimeout
continue polling, otherwise return the wrapped error as before; keep the rest
(including unstructured.NestedString checks) unchanged.
In `@test/e2e/v2/internal/env_vars.go`:
- Around line 156-161: The default for AWS_GUEST_INFRA_CREDENTIALS_FILE should
not be derived from os.Getenv("HOME") because that can be unset and yield a
relative path; replace that logic by calling os.UserHomeDir() and, if it returns
a non-empty home path, set the default to filepath.Join(userHome, ".aws",
"credentials"), otherwise pass an empty string (or no default) to
RegisterEnvVarWithDefault so we don’t produce a relative path; update the code
around the RegisterEnvVarWithDefault call for AWS_GUEST_INFRA_CREDENTIALS_FILE
accordingly.
In `@test/e2e/v2/internal/workload_registry.go`:
- Around line 571-579: The wait failure currently returns only lastErr which can
hide the real polling error from wait.PollUntilContextTimeout (e.g., context
deadline exceeded) and lastErr may be nil; change the error handling so you
preserve the poll error (the err returned by wait.PollUntilContextTimeout) and
append the validation error from lastErr for context (referencing
wait.PollUntilContextTimeout and validateControlPlaneWorkloadsByType and the
lastErr variable) — e.g., return the poll error as the primary wrapped error and
include lastErr in the message when present so callers see both the wait failure
and any validation failure.
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 81-84: Guard against a nil testCtx returned by
internal.GetTestContext() before calling ValidateControlPlaneNamespace(): after
assigning testCtx := internal.GetTestContext(), check if testCtx == nil and call
AbortSuite with a clear message (e.g., "test context is nil") to fail fast, and
only then call testCtx.ValidateControlPlaneNamespace() and handle its error by
AbortSuite(err.Error()).
- Around line 249-253: The hard assertions inside getNodePool and the
surrounding Eventually block prevent retries; refactor getNodePool to return
(nodePool, error) instead of calling Expect/Fail internally, remove
Expect(nodePoolList.Items).NotTo(BeEmpty()) and
Expect(err).NotTo(HaveOccurred()) from getNodePool and return an appropriate
error when nothing is found or on failure, then update the Eventually callback
to use the Gomega passed in (func(g Gomega)) to assert on the returned values
(e.g., g.Expect(err).NotTo(HaveOccurred(); g.Expect(nodePool).NotTo(BeNil()))
and call internal.ValidateConditions(g, nodePool, machinesReadyCondition) so
transient failures are retried by Eventually with polling.
---
Nitpick comments:
In `@test/e2e/v2/backuprestore/cleanup_test.go`:
- Around line 43-47: Update the test case name string to reflect the expected
behavior: change the entry with name "wildcard verb does not match specific
verb" (in the test table used by the cleanup_test.go test) to something like
"wildcard verb matches specific verb" because the test's verbs {"*"} and verb
"get" expect true; ensure only the name field is updated so the test logic
(verbs/verb/expected) remains unchanged.
In `@test/e2e/v2/backuprestore/cleanup.go`:
- Around line 142-145: The code currently swallows partial discovery failures
from discoveryClient.ServerPreferredNamespacedResources() when
discovery.IsGroupDiscoveryFailedError(err) is true; change this to log the
partial discovery error (including err and context like which groups failed)
before continuing so cleanup diagnostics are available (refer to
discoveryClient.ServerPreferredNamespacedResources,
discovery.IsGroupDiscoveryFailedError, and variables apiResourceLists/err to
locate the check and insert a processLogger/klog/t.Logf call that records the
error details).
In `@test/e2e/v2/backuprestore/velero.go`:
- Around line 48-86: EnsureVeleroPodRunning currently does a one-shot check and
hardcodes DefaultOADPNamespace and specific labels; change it to poll until
success or a configurable timeout to tolerate rollouts and make namespace/labels
configurable: update the function signature EnsureVeleroPodRunning(testCtx
*internal.TestContext, namespace string, labels map[string]string, timeout
time.Duration) (or accept nil/empty to fall back to DefaultOADPNamespace and the
current labels), then inside use a loop (or context-aware polling) that
repeatedly calls client.List on podList and checks for any pod where
pod.Status.Phase == corev1.PodRunning && isPodReady(&pod) until timeout/context
done, retaining the existing podStates aggregation for the final error;
reference the existing symbols EnsureVeleroPodRunning, DefaultOADPNamespace,
labels map, testCtx.MgmtClient, isPodReady, and podList when making these
changes.
- Around line 396-428: DeleteOADPSchedule duplicates the Velero resource
fetching logic; replace the manual unstructured creation and Get call with a
call to the existing helper getVeleroResource (use scheduleName and the
GroupVersionKind for Schedule) to retrieve the resource, handle
apierrors.IsNotFound returned by getVeleroResource the same way (return nil),
then proceed to delete the returned unstructured with testCtx.MgmtClient.Delete;
keep the same error wrapping messages on Get and Delete but remove the
duplicated Get logic in DeleteOADPSchedule.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (19)
.codespellignoreMakefilecmd/oadp/backup.gocmd/oadp/restore.gocmd/oadp/schedule.gotest/e2e/v2/backuprestore/README.mdtest/e2e/v2/backuprestore/cleanup.gotest/e2e/v2/backuprestore/cleanup_test.gotest/e2e/v2/backuprestore/cli.gotest/e2e/v2/backuprestore/cli_test.gotest/e2e/v2/backuprestore/prober.gotest/e2e/v2/backuprestore/velero.gotest/e2e/v2/internal/env_vars.gotest/e2e/v2/internal/test_context.gotest/e2e/v2/internal/validation.gotest/e2e/v2/internal/workload_registry.gotest/e2e/v2/tests/backup_restore_test.gotest/e2e/v2/tests/control_plane_workloads_test.gotest/e2e/v2/tests/suite_test.go
| // RunOADPBackup executes the "hypershift create oadp-backup" command | ||
| func RunOADPBackup(ctx context.Context, logger logr.Logger, artifactDir string, backupOpts *OADPBackupOptions) error { | ||
| if backupOpts.HCName == "" || backupOpts.HCNamespace == "" { | ||
| return fmt.Errorf("hc-name and hc-namespace are required") | ||
| } | ||
|
|
||
| hypershiftCLI, err := getHypershiftCLIPath() | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| if artifactDir == "" { | ||
| return fmt.Errorf("artifact directory is required") | ||
| } | ||
|
|
||
| if err := os.MkdirAll(artifactDir, 0755); err != nil { | ||
| return fmt.Errorf("failed to create artifact directory: %w", err) | ||
| } | ||
|
|
||
| args := buildBackupArgs(backupOpts) | ||
| cmd := exec.CommandContext(ctx, hypershiftCLI, args...) | ||
|
|
||
| logPath := fmt.Sprintf("oadp-backup-%s-%s.log", backupOpts.HCNamespace, backupOpts.HCName) | ||
|
|
||
| // Create minimal framework options for RunCommand | ||
| opts := &framework.Options{ | ||
| ArtifactDir: artifactDir, | ||
| } | ||
|
|
||
| return framework.RunCommand(logger, opts, logPath, cmd) | ||
| } | ||
|
|
||
| // RunOADPRestore executes the "hypershift create oadp-restore" command | ||
| func RunOADPRestore(ctx context.Context, logger logr.Logger, artifactDir string, restoreOpts *OADPRestoreOptions) error { | ||
| if restoreOpts.HCName == "" || restoreOpts.HCNamespace == "" { | ||
| return fmt.Errorf("hc-name and hc-namespace are required") | ||
| } | ||
|
|
||
| if restoreOpts.FromBackup == "" && restoreOpts.FromSchedule == "" { | ||
| return fmt.Errorf("either from-backup or from-schedule is required") | ||
| } | ||
|
|
||
| if restoreOpts.FromBackup != "" && restoreOpts.FromSchedule != "" { | ||
| return fmt.Errorf("from-backup and from-schedule are mutually exclusive") | ||
| } | ||
|
|
||
| hypershiftCLI, err := getHypershiftCLIPath() | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| if artifactDir == "" { | ||
| return fmt.Errorf("artifact directory is required") | ||
| } | ||
|
|
||
| if err := os.MkdirAll(artifactDir, 0755); err != nil { | ||
| return fmt.Errorf("failed to create artifact directory: %w", err) | ||
| } | ||
|
|
||
| args := buildRestoreArgs(restoreOpts) | ||
| cmd := exec.CommandContext(ctx, hypershiftCLI, args...) | ||
|
|
||
| logPath := fmt.Sprintf("oadp-restore-%s-%s.log", restoreOpts.HCNamespace, restoreOpts.HCName) | ||
|
|
||
| // Create minimal framework options for RunCommand | ||
| opts := &framework.Options{ | ||
| ArtifactDir: artifactDir, | ||
| } | ||
|
|
||
| return framework.RunCommand(logger, opts, logPath, cmd) | ||
| } | ||
|
|
||
| // RunOADPSchedule executes the "hypershift create oadp-schedule" command | ||
| func RunOADPSchedule(ctx context.Context, logger logr.Logger, artifactDir string, scheduleOpts *OADPScheduleOptions) error { | ||
| if scheduleOpts.HCName == "" || scheduleOpts.HCNamespace == "" { | ||
| return fmt.Errorf("hc-name and hc-namespace are required") | ||
| } | ||
|
|
||
| if scheduleOpts.Schedule == "" { | ||
| return fmt.Errorf("schedule is required") | ||
| } | ||
|
|
||
| hypershiftCLI, err := getHypershiftCLIPath() | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| if artifactDir == "" { | ||
| return fmt.Errorf("artifact directory is required") | ||
| } | ||
|
|
||
| if err := os.MkdirAll(artifactDir, 0755); err != nil { | ||
| return fmt.Errorf("failed to create artifact directory: %w", err) | ||
| } | ||
|
|
||
| args := buildScheduleArgs(scheduleOpts) | ||
| cmd := exec.CommandContext(ctx, hypershiftCLI, args...) | ||
|
|
||
| logPath := fmt.Sprintf("oadp-schedule-%s-%s.log", scheduleOpts.HCNamespace, scheduleOpts.HCName) | ||
|
|
||
| // Create minimal framework options for RunCommand | ||
| opts := &framework.Options{ | ||
| ArtifactDir: artifactDir, | ||
| } | ||
|
|
||
| return framework.RunCommand(logger, opts, logPath, cmd) | ||
| } |
There was a problem hiding this comment.
Guard against nil *Options to avoid panics + consider DRYing the artifactDir setup.
All three exported runners dereference backupOpts / restoreOpts / scheduleOpts without a nil check. Easy footgun if these helpers get reused outside current call sites. Also, the artifactDir validation/MkdirAll/framework.Options boilerplate is repeated.
Minimal nil-guard example (apply similarly to all runners)
func RunOADPBackup(ctx context.Context, logger logr.Logger, artifactDir string, backupOpts *OADPBackupOptions) error {
+ if backupOpts == nil {
+ return fmt.Errorf("backup options are required")
+ }
if backupOpts.HCName == "" || backupOpts.HCNamespace == "" {
return fmt.Errorf("hc-name and hc-namespace are required")
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/v2/backuprestore/cli.go` around lines 130 - 236, Each exported
runner (RunOADPBackup, RunOADPRestore, RunOADPSchedule) currently dereferences
their options pointer without a nil check and repeats artifactDir
validation/creation and framework.Options setup; add a nil guard at the start of
each function (e.g., if backupOpts == nil { return fmt.Errorf("backup options
required") }) and mirror for restoreOpts/scheduleOpts, then consolidate the
artifactDir logic into a small helper (e.g.,
validateAndPrepareArtifactDir(artifactDir) used by all three) so you validate
non-empty artifactDir, call os.MkdirAll, and return a *framework.Options with
ArtifactDir set to reuse across RunOADPBackup, RunOADPRestore, and
RunOADPSchedule.
b53820d to
0391508
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/e2e/v2/internal/workload_registry.go (1)
3-4: 🛠️ Refactor suggestion | 🟠 MajorMove new helper logic out of this generated file.
This file is marked generated (“Do not edit manually”). Functions added at Line 488–Line 594 can be lost on regeneration, causing silent breakage. Please keep generated data here and move runtime helpers (
ShouldSkipWorkloadForPlatform, readiness validators/waiters) into a separate non-generated file (for example,test/e2e/v2/internal/workload_validation.go).Also applies to: 488-594
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/internal/workload_registry.go` around lines 3 - 4, The generated file contains runtime helper logic that will be lost on regeneration; move the non-generated helpers (e.g., ShouldSkipWorkloadForPlatform and the readiness validator/waiter functions added near the end of the file) out of the generated file into a new non-generated source file in the same package (e.g., workload_validation.go), delete them from the generated file, and update any imports/visibility so callers still compile; ensure the helpers remain package-scoped (not generated) and run `go build`/tests to verify there are no missing references.
♻️ Duplicate comments (2)
test/e2e/v2/backuprestore/velero.go (1)
279-291:⚠️ Potential issue | 🟠 MajorHandle
IsNotFoundat theGetcall, not atNestedString.
wait.PollUntilContextTimeoutwill stop on the error returned at Line 282. A transient NotFound fromgetResourceFuncshould return(false, nil)so polling continues.🐛 Proposed fix
return func(ctx context.Context) (bool, error) { resource, err := getResourceFunc(ctx, client, namespace, name) if err != nil { + if apierrors.IsNotFound(err) { + return false, nil + } return false, fmt.Errorf("failed to get %s %s: %w", kind, name, err) } phase, found, err := unstructured.NestedString(resource.Object, "status", "phase") if err != nil { - if apierrors.IsNotFound(err) { - return false, nil - } return false, fmt.Errorf("failed to get %s phase: %w", kind, err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/backuprestore/velero.go` around lines 279 - 291, The code must treat a NotFound returned by getResourceFunc as a transient condition so polling continues: in the closure returned by the function that calls getResourceFunc(ctx, client, namespace, name) (the block around getResourceFunc and unstructured.NestedString), change the error handling immediately after the getResourceFunc call to return (false, nil) when apierrors.IsNotFound(err); only treat other errors as fatal. Do not check apierrors.IsNotFound on the error returned from unstructured.NestedString; leave that error handling to return a wrapped error for unexpected failures. This ensures wait.PollUntilContextTimeout keeps polling on transient GET 404s.test/e2e/v2/backuprestore/cli.go (1)
131-134:⚠️ Potential issue | 🟠 MajorAdd nil guards for
*Optionsin all exportedRun*wrappers.Each function dereferences its options pointer before checking it. Passing
nilwill panic.🐛 Proposed fix
func RunOADPBackup(ctx context.Context, logger logr.Logger, artifactDir string, backupOpts *OADPBackupOptions) error { + if backupOpts == nil { + return fmt.Errorf("backup options are required") + } if backupOpts.HCName == "" || backupOpts.HCNamespace == "" { return fmt.Errorf("hc-name and hc-namespace are required") } @@ func RunOADPRestore(ctx context.Context, logger logr.Logger, artifactDir string, restoreOpts *OADPRestoreOptions) error { + if restoreOpts == nil { + return fmt.Errorf("restore options are required") + } if restoreOpts.HCName == "" || restoreOpts.HCNamespace == "" { @@ func RunOADPSchedule(ctx context.Context, logger logr.Logger, artifactDir string, scheduleOpts *OADPScheduleOptions) error { + if scheduleOpts == nil { + return fmt.Errorf("schedule options are required") + } if scheduleOpts.HCName == "" || scheduleOpts.HCNamespace == "" { @@ func RunFixDrOidcIam(ctx context.Context, logger logr.Logger, artifactDir string, fixOpts *FixDrOidcIamOptions) error { + if fixOpts == nil { + return fmt.Errorf("fix dr-oidc-iam options are required") + } // Validate mode...Also applies to: 164-166, 204-206, 399-401
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/backuprestore/cli.go` around lines 131 - 134, Add a nil-check for the options pointer before any dereference in each exported Run* wrapper: e.g., in RunOADPBackup validate that backupOpts != nil and return a descriptive error (such as "backup options cannot be nil") before accessing backupOpts.HCName/HCNamespace; do the same for the other exported Run* functions that take pointer options (for example RunOADPRestore and any other Run* wrappers using *Options types) so none will panic when passed nil.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 106-109: getNodePool currently returns only the first matching
NodePool which hides failures; change the test to retrieve all NodePools (either
by using or adding a helper like getNodePools that returns []NodePool) and then
iterate over the slice, asserting err == nil and each nodePool != nil, and call
internal.ValidateConditions(NewGomegaWithT(GinkgoT()), nodePool,
expectedConditions) for each element; apply the same change wherever getNodePool
is used for validation (the other validation sites) so every NodePool for the
HostedCluster is checked.
---
Outside diff comments:
In `@test/e2e/v2/internal/workload_registry.go`:
- Around line 3-4: The generated file contains runtime helper logic that will be
lost on regeneration; move the non-generated helpers (e.g.,
ShouldSkipWorkloadForPlatform and the readiness validator/waiter functions added
near the end of the file) out of the generated file into a new non-generated
source file in the same package (e.g., workload_validation.go), delete them from
the generated file, and update any imports/visibility so callers still compile;
ensure the helpers remain package-scoped (not generated) and run `go
build`/tests to verify there are no missing references.
---
Duplicate comments:
In `@test/e2e/v2/backuprestore/cli.go`:
- Around line 131-134: Add a nil-check for the options pointer before any
dereference in each exported Run* wrapper: e.g., in RunOADPBackup validate that
backupOpts != nil and return a descriptive error (such as "backup options cannot
be nil") before accessing backupOpts.HCName/HCNamespace; do the same for the
other exported Run* functions that take pointer options (for example
RunOADPRestore and any other Run* wrappers using *Options types) so none will
panic when passed nil.
In `@test/e2e/v2/backuprestore/velero.go`:
- Around line 279-291: The code must treat a NotFound returned by
getResourceFunc as a transient condition so polling continues: in the closure
returned by the function that calls getResourceFunc(ctx, client, namespace,
name) (the block around getResourceFunc and unstructured.NestedString), change
the error handling immediately after the getResourceFunc call to return (false,
nil) when apierrors.IsNotFound(err); only treat other errors as fatal. Do not
check apierrors.IsNotFound on the error returned from unstructured.NestedString;
leave that error handling to return a wrapped error for unexpected failures.
This ensures wait.PollUntilContextTimeout keeps polling on transient GET 404s.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (7)
test/e2e/v2/backuprestore/cli.gotest/e2e/v2/backuprestore/cli_test.gotest/e2e/v2/backuprestore/velero.gotest/e2e/v2/internal/env_vars.gotest/e2e/v2/internal/validation.gotest/e2e/v2/internal/workload_registry.gotest/e2e/v2/tests/backup_restore_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/v2/internal/env_vars.go
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/e2e/v2/tests/backup_restore_test.go (1)
280-295:⚠️ Potential issue | 🟠 MajorValidate all NodePools for the HostedCluster, not only the first match.
Line 292 returns the first matching NodePool, so failures in additional NodePools are never validated (impacts the checks around Line 100 and Line 254), which can yield false-positive test passes.
🐛 Proposed fix
-func getNodePool(testCtx *internal.TestContext) (*hyperv1.NodePool, error) { +func getNodePools(testCtx *internal.TestContext) ([]*hyperv1.NodePool, error) { nodePoolList := &hyperv1.NodePoolList{} err := testCtx.MgmtClient.List(testCtx.Context, nodePoolList, crclient.InNamespace(testCtx.ClusterNamespace)) if err != nil { return nil, err } if len(nodePoolList.Items) == 0 { return nil, fmt.Errorf("no NodePools found in namespace %s", testCtx.ClusterNamespace) } + var out []*hyperv1.NodePool for i := range nodePoolList.Items { if nodePoolList.Items[i].Spec.ClusterName == testCtx.ClusterName { - return &nodePoolList.Items[i], nil + out = append(out, &nodePoolList.Items[i]) } } - return nil, fmt.Errorf("no NodePool found for cluster %s", testCtx.ClusterName) + if len(out) == 0 { + return nil, fmt.Errorf("no NodePool found for cluster %s", testCtx.ClusterName) + } + return out, nil }- nodePool, err := getNodePool(testCtx) + nodePools, err := getNodePools(testCtx) Expect(err).NotTo(HaveOccurred()) - Expect(nodePool).NotTo(BeNil()) - var expectedConditions []util.Condition - for conditionType, conditionStatus := range conditions.ExpectedNodePoolConditions(nodePool) { - expectedConditions = append(expectedConditions, util.Condition{ - Type: conditionType, - Status: metav1.ConditionStatus(conditionStatus), - }) - } - internal.ValidateConditions(NewWithT(GinkgoT()), nodePool, expectedConditions) + for _, nodePool := range nodePools { + var expectedConditions []util.Condition + for conditionType, conditionStatus := range conditions.ExpectedNodePoolConditions(nodePool) { + expectedConditions = append(expectedConditions, util.Condition{ + Type: conditionType, + Status: metav1.ConditionStatus(conditionStatus), + }) + } + internal.ValidateConditions(NewWithT(GinkgoT()), nodePool, expectedConditions) + }- Eventually(func(g Gomega) { - nodePool, err := getNodePool(testCtx) + Eventually(func(g Gomega) { + nodePools, err := getNodePools(testCtx) g.Expect(err).NotTo(HaveOccurred()) - g.Expect(nodePool).NotTo(BeNil()) - conditions := conditions.ExpectedNodePoolConditions(nodePool) - // See https://issues.redhat.com/browse/OCPBUGS-77621 for why we delete the conditions - delete(conditions, hyperv1.NodePoolReachedIgnitionEndpoint) - delete(conditions, hyperv1.NodePoolAutorepairEnabledConditionType) - var expectedConditions []util.Condition - for conditionType, conditionStatus := range conditions { - expectedConditions = append(expectedConditions, util.Condition{ - Type: conditionType, - Status: metav1.ConditionStatus(conditionStatus), - }) + g.Expect(nodePools).NotTo(BeEmpty()) + for _, nodePool := range nodePools { + expectedByType := conditions.ExpectedNodePoolConditions(nodePool) + // See https://issues.redhat.com/browse/OCPBUGS-77621 for why we delete the conditions + delete(expectedByType, hyperv1.NodePoolReachedIgnitionEndpoint) + delete(expectedByType, hyperv1.NodePoolAutorepairEnabledConditionType) + var expectedConditions []util.Condition + for conditionType, conditionStatus := range expectedByType { + expectedConditions = append(expectedConditions, util.Condition{ + Type: conditionType, + Status: metav1.ConditionStatus(conditionStatus), + }) + } + internal.ValidateConditions(g, nodePool, expectedConditions) } - internal.ValidateConditions(g, nodePool, expectedConditions) }).WithPolling(backuprestore.PollInterval).WithTimeout(backuprestore.OIDCTimeout).Should(Succeed())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/tests/backup_restore_test.go` around lines 280 - 295, The getNodePool helper currently returns only the first NodePool matching the HostedCluster, causing other NodePools to be ignored; change getNodePool to return all matching NodePools (e.g., rename to getNodePools and return []*hyperv1.NodePool) and update its callers in backup_restore_test.go to iterate over and validate each returned NodePool (so checks that currently assume a single NodePool run for every NodePool). Ensure the helper still errors if none are found and update any type/signature usages accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 280-295: The getNodePool helper currently returns only the first
NodePool matching the HostedCluster, causing other NodePools to be ignored;
change getNodePool to return all matching NodePools (e.g., rename to
getNodePools and return []*hyperv1.NodePool) and update its callers in
backup_restore_test.go to iterate over and validate each returned NodePool (so
checks that currently assume a single NodePool run for every NodePool). Ensure
the helper still errors if none are found and update any type/signature usages
accordingly.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
test/e2e/v2/tests/backup_restore_test.go
ae4081e to
153e561
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/e2e/v2/tests/backup_restore_test.go (1)
231-238: Fail fast when AWS credentials file env var is unset.Line 231 reads
AWS_GUEST_INFRA_CREDENTIALS_FILEbut does not validate it before invokingRunFixDrOidcIam. Add an explicit non-empty assertion for clearer failure diagnostics.Proposed diff
By("Fixing OIDC identity provider after restore") awsCredsFile := internal.GetEnvVarValue("AWS_GUEST_INFRA_CREDENTIALS_FILE") + Expect(awsCredsFile).NotTo(BeEmpty(), "AWS_GUEST_INFRA_CREDENTIALS_FILE must be set") fixOpts := &backuprestore.FixDrOidcIamOptions{ HCName: testCtx.ClusterName, HCNamespace: testCtx.ClusterNamespace, AWSCredsFile: awsCredsFile, Timeout: backuprestore.OIDCTimeout, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/tests/backup_restore_test.go` around lines 231 - 238, The test currently reads AWS_GUEST_INFRA_CREDENTIALS_FILE into awsCredsFile and proceeds to call backuprestore.RunFixDrOidcIam without validating it; add a guard that asserts awsCredsFile is non-empty (e.g., using the test framework's fail/expect helper) and fail fast with a clear message if the env var is unset before constructing backuprestore.FixDrOidcIamOptions and calling RunFixDrOidcIam; reference the awsCredsFile variable, the AWS_GUEST_INFRA_CREDENTIALS_FILE env var, and the RunFixDrOidcIam/backuprestore.FixDrOidcIamOptions usage so the check is colocated immediately prior to creating fixOpts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/v2/backuprestore/cleanup.go`:
- Around line 61-74: The HostedCluster namespace deletion is not being awaited:
currently deleteHostedClusterNamespace(testCtx, false) returns immediately which
can leave the namespace in Terminating and race restores; change the flow to
wait for the HostedCluster namespace to be fully deleted before proceeding
(either call deleteHostedClusterNamespace(testCtx, true) or add an explicit
wait-after-delete step) so that when removeNamespaceObjectFinalizers and
deleteControlPlaneNamespace run the HostedCluster namespace is gone; update
references in this block (deleteHostedClusterNamespace,
removeNamespaceObjectFinalizers, deleteControlPlaneNamespace) accordingly.
---
Nitpick comments:
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 231-238: The test currently reads AWS_GUEST_INFRA_CREDENTIALS_FILE
into awsCredsFile and proceeds to call backuprestore.RunFixDrOidcIam without
validating it; add a guard that asserts awsCredsFile is non-empty (e.g., using
the test framework's fail/expect helper) and fail fast with a clear message if
the env var is unset before constructing backuprestore.FixDrOidcIamOptions and
calling RunFixDrOidcIam; reference the awsCredsFile variable, the
AWS_GUEST_INFRA_CREDENTIALS_FILE env var, and the
RunFixDrOidcIam/backuprestore.FixDrOidcIamOptions usage so the check is
colocated immediately prior to creating fixOpts.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (2)
test/e2e/v2/backuprestore/cleanup.gotest/e2e/v2/tests/backup_restore_test.go
153e561 to
f2f6dbc
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
test/e2e/v2/backuprestore/velero.go (1)
280-290:⚠️ Potential issue | 🟠 MajorHandle
NotFoundfrom resource fetch as transient in polling.
NotFoundis currently checked onNestedStringerrors, but the transient case happens ongetResourceFunc(...). This still causes premature poll failure.🔧 Proposed fix
return func(ctx context.Context) (bool, error) { resource, err := getResourceFunc(ctx, client, namespace, name) if err != nil { + if apierrors.IsNotFound(err) { + return false, nil + } return false, fmt.Errorf("failed to get %s %s: %w", kind, name, err) } phase, found, err := unstructured.NestedString(resource.Object, "status", "phase") if err != nil { - if apierrors.IsNotFound(err) { - return false, nil - } return false, fmt.Errorf("failed to get %s phase: %w", kind, err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/backuprestore/velero.go` around lines 280 - 290, The getResourceFunc(...) error should treat apierrors.IsNotFound as a transient polling condition; after calling getResourceFunc(ctx, client, namespace, name) check if err != nil and if apierrors.IsNotFound(err) then return false, nil instead of failing the poll, otherwise return the formatted error as now. Keep the existing handling of unstructured.NestedString(resource.Object, "status", "phase") unchanged for its own errors.test/e2e/v2/backuprestore/cli.go (1)
131-134:⚠️ Potential issue | 🟠 MajorAdd nil guards for options in exported runners to prevent panics.
Each runner dereferences
*Optionsimmediately; passingnilcauses a panic instead of a controlled error path.🔧 Proposed fix
func RunOADPBackup(ctx context.Context, logger logr.Logger, artifactDir string, backupOpts *OADPBackupOptions) error { + if backupOpts == nil { + return fmt.Errorf("backup options are required") + } if backupOpts.HCName == "" || backupOpts.HCNamespace == "" { return fmt.Errorf("hc-name and hc-namespace are required") } @@ func RunOADPRestore(ctx context.Context, logger logr.Logger, artifactDir string, restoreOpts *OADPRestoreOptions) error { + if restoreOpts == nil { + return fmt.Errorf("restore options are required") + } if restoreOpts.HCName == "" || restoreOpts.HCNamespace == "" { return fmt.Errorf("hc-name and hc-namespace are required") } @@ func RunOADPSchedule(ctx context.Context, logger logr.Logger, artifactDir string, scheduleOpts *OADPScheduleOptions) error { + if scheduleOpts == nil { + return fmt.Errorf("schedule options are required") + } if scheduleOpts.HCName == "" || scheduleOpts.HCNamespace == "" { return fmt.Errorf("hc-name and hc-namespace are required") } @@ func RunFixDrOidcIam(ctx context.Context, logger logr.Logger, artifactDir string, fixOpts *FixDrOidcIamOptions) error { + if fixOpts == nil { + return fmt.Errorf("fix dr-oidc-iam options are required") + } // Validate mode: either hosted-cluster mode or manual mode, but not both if fixOpts.HCName != "" {Also applies to: 163-166, 203-206, 399-403
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/v2/backuprestore/cli.go` around lines 131 - 134, The exported runner functions (e.g., RunOADPBackup) currently dereference their options pointer immediately and will panic if passed nil; add a nil-check at the start of each exported runner (RunOADPBackup and the other runners referenced) that returns a clear error like "options are required" when the options pointer is nil, and only proceed to dereference fields (e.g., backupOpts.HCName / HCNamespace) after the nil-guard to prevent panics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/fix/dr_oidc_iam.go`:
- Around line 617-618: The code currently formats restartTime from
o.RestartDelay which can be negative; add a check in the validate() function to
reject negative durations by returning an error when o.RestartDelay < 0
(reference the o.RestartDelay field and the validate() method for the options
struct used to produce restartTime), e.g., validate that o.RestartDelay is
non-negative and surface a clear error message so callers cannot schedule
restarts in the past.
In `@test/e2e/v2/internal/validation.go`:
- Around line 56-60: The second assertion calls
expectedCondition.Matches(actualCondition) even when the condition was not found
(found == false), causing comparisons against a zero-value actualCondition;
guard the second assertion by checking the found flag (the same variable used in
the first Expect) and only perform the Matches/G.Expect check when found is
true, referencing expectedCondition, actualCondition, Matches and the existing
g.Expect call so the noisy/misleading diagnostics are avoided.
In `@test/e2e/v2/tests/backup_restore_test.go`:
- Around line 87-90: The one-shot call to
backuprestore.EnsureVeleroPodRunning(testCtx) should be made resilient by
retrying before failing: replace the direct call with a retrying preflight
(either add a new helper like backuprestore.EnsureVeleroPodRunningWithRetry(ctx,
attempts, interval) or wrap EnsureVeleroPodRunning in a simple retry loop) that
re-invokes EnsureVeleroPodRunning using the same testCtx with a small interval
(e.g., a few seconds) for a configurable number of attempts, and only call
Fail(fmt.Sprintf("Velero is not running: %v", err)) after retries are exhausted
so transient pod restarts do not flake the suite.
---
Duplicate comments:
In `@test/e2e/v2/backuprestore/cli.go`:
- Around line 131-134: The exported runner functions (e.g., RunOADPBackup)
currently dereference their options pointer immediately and will panic if passed
nil; add a nil-check at the start of each exported runner (RunOADPBackup and the
other runners referenced) that returns a clear error like "options are required"
when the options pointer is nil, and only proceed to dereference fields (e.g.,
backupOpts.HCName / HCNamespace) after the nil-guard to prevent panics.
In `@test/e2e/v2/backuprestore/velero.go`:
- Around line 280-290: The getResourceFunc(...) error should treat
apierrors.IsNotFound as a transient polling condition; after calling
getResourceFunc(ctx, client, namespace, name) check if err != nil and if
apierrors.IsNotFound(err) then return false, nil instead of failing the poll,
otherwise return the formatted error as now. Keep the existing handling of
unstructured.NestedString(resource.Object, "status", "phase") unchanged for its
own errors.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (9)
cmd/fix/dr_oidc_iam.gotest/e2e/v2/backuprestore/cleanup.gotest/e2e/v2/backuprestore/cli.gotest/e2e/v2/backuprestore/cli_test.gotest/e2e/v2/backuprestore/velero.gotest/e2e/v2/internal/env_vars.gotest/e2e/v2/internal/validation.gotest/e2e/v2/internal/workload_registry.gotest/e2e/v2/tests/backup_restore_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/v2/internal/env_vars.go
| restartTime := time.Now().UTC().Add(o.RestartDelay).Format(time.RFC3339) | ||
| fmt.Printf(" - Scheduled rolling restart at %s UTC (in %s)\n", restartTime, o.RestartDelay) |
There was a problem hiding this comment.
Validate --restart-delay to prevent scheduling restarts in the past.
On Line 617, a negative duration will create a past timestamp and effectively trigger immediate restart behavior. Please reject negative values in validate().
Suggested fix
func (o *DrOidcIamOptions) validate() error {
+ if o.RestartDelay < 0 {
+ return fmt.Errorf("--restart-delay must be greater than or equal to 0")
+ }
+
if err := o.validateCredentials(); err != nil {
return err
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/fix/dr_oidc_iam.go` around lines 617 - 618, The code currently formats
restartTime from o.RestartDelay which can be negative; add a check in the
validate() function to reject negative durations by returning an error when
o.RestartDelay < 0 (reference the o.RestartDelay field and the validate() method
for the options struct used to produce restartTime), e.g., validate that
o.RestartDelay is non-negative and surface a clear error message so callers
cannot schedule restarts in the past.
| g.Expect(found).To(gomega.BeTrue(), | ||
| fmt.Sprintf("condition %s not found\n\t%s", expectedCondition.Type, printConditions())) | ||
| g.Expect(expectedCondition.Matches(actualCondition)).To(gomega.BeTrue(), | ||
| fmt.Sprintf("incorrect condition: wanted %s, got %s\n\t%s", | ||
| expectedCondition.String(), actualCondition.String(), printConditions())) |
There was a problem hiding this comment.
Guard the second assertion when the condition type is missing.
When the condition is not found, the next Matches assertion compares against a zero-value condition and can emit noisy/misleading diagnostics.
🔧 Proposed fix
g.Expect(found).To(gomega.BeTrue(),
fmt.Sprintf("condition %s not found\n\t%s", expectedCondition.Type, printConditions()))
+ if !found {
+ continue
+ }
g.Expect(expectedCondition.Matches(actualCondition)).To(gomega.BeTrue(),
fmt.Sprintf("incorrect condition: wanted %s, got %s\n\t%s",
expectedCondition.String(), actualCondition.String(), printConditions()))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/v2/internal/validation.go` around lines 56 - 60, The second
assertion calls expectedCondition.Matches(actualCondition) even when the
condition was not found (found == false), causing comparisons against a
zero-value actualCondition; guard the second assertion by checking the found
flag (the same variable used in the first Expect) and only perform the
Matches/G.Expect check when found is true, referencing expectedCondition,
actualCondition, Matches and the existing g.Expect call so the noisy/misleading
diagnostics are avoided.
| err := backuprestore.EnsureVeleroPodRunning(testCtx) | ||
| if err != nil { | ||
| Fail(fmt.Sprintf("Velero is not running: %v", err)) | ||
| } |
There was a problem hiding this comment.
Use a retrying preflight for Velero readiness to reduce test flakiness.
The one-shot readiness check can fail the suite on transient pod restarts even though the system would recover moments later.
🔧 Proposed fix
- err := backuprestore.EnsureVeleroPodRunning(testCtx)
- if err != nil {
- Fail(fmt.Sprintf("Velero is not running: %v", err))
- }
+ Eventually(func() error {
+ return backuprestore.EnsureVeleroPodRunning(testCtx)
+ }).WithPolling(backuprestore.PollInterval).
+ WithTimeout(2 * time.Minute).
+ Should(Succeed(), "Velero should be running before backup/restore tests")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/v2/tests/backup_restore_test.go` around lines 87 - 90, The one-shot
call to backuprestore.EnsureVeleroPodRunning(testCtx) should be made resilient
by retrying before failing: replace the direct call with a retrying preflight
(either add a new helper like backuprestore.EnsureVeleroPodRunningWithRetry(ctx,
attempts, interval) or wrap EnsureVeleroPodRunning in a simple retry loop) that
re-invokes EnsureVeleroPodRunning using the same testCtx with a small interval
(e.g., a few seconds) for a configurable number of attempts, and only call
Fail(fmt.Sprintf("Velero is not running: %v", err)) after retries are exhausted
so transient pod restarts do not flake the suite.
jparrill
left a comment
There was a problem hiding this comment.
Hey Martin, I've sent some comments to the dependent PR and I lose track of what I've reviewed or not. For now I just put a comment in the current PR.
|
/retest-required |
f2f6dbc to
46a0510
Compare
Add RunFixDrOidcIam CLI wrapper, ValidateConditions helper, and integrate both into the backup/restore e2e test with unit tests. OCPBUGS-60185
…annotation Replace Update with MergeFrom+Patch in updateHostedClusterRestartAnnotation to avoid "object has been modified" conflict errors when other controllers concurrently modify the HostedCluster resource. Related: OCPBUGS-60185 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
46a0510 to
c6f30d4
Compare
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jparrill, mgencur The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
|
Scheduling tests matching the |
|
/verified by e2e |
|
@jparrill: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
2 similar comments
|
/retest |
|
/retest |
|
/test e2e-azure-self-managed |
|
@mgencur: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/jira refresh |
2 similar comments
|
/jira refresh |
|
/jira refresh |
|
@mgencur: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
What this PR does / why we need it:
Add RunFixDrOidcIam CLI wrapper, ValidateConditions helper, and
integrate both into the backup/restore e2e test with unit tests.
Which issue(s) this PR fixes:
Related to https://issues.redhat.com/browse/CNTRLPLANE-2031
Implements an E2E test for #7210
Special notes for your reviewer:
Checklist:
Note
Medium Risk
Expands the backup/restore E2E flow to run a DR OIDC/IAM remediation command and to assert NodePool condition stability across restore, which can introduce new test flake/failure modes and requires AWS credentials. Also changes the
fix dr-oidc-iamCLI to patch (not update) HostedCluster annotations, slightly altering how it writes cluster state.Overview
Adds a new backup/restore E2E post-restore flow that runs
hypershift fix dr-oidc-iam(using a newRunFixDrOidcIamwrapper) and validates thatNodePoolconditions observed pre-backup are restored to the same state after recovery.Introduces shared helpers to support this:
internal.ValidateConditions, a defaultedAWS_GUEST_INFRA_CREDENTIALS_FILEenv var, improved control-plane workload wait error reporting, and a cleanup tweak to also wait for hosted cluster namespace deletion. Minor fixes include switchingdr-oidc-iamHostedCluster updates toPatch, updating backup/restore docs for ADP 1.5+ plugin configuration, and correcting a path typo inDockerfile.e2e.Written by Cursor Bugbot for commit c6f30d4. This will update automatically on new commits. Configure here.
Summary by CodeRabbit
New Features
Documentation
Tests