-
Notifications
You must be signed in to change notification settings - Fork 70
WIP 🐛 Ensures Helm-based extensions keep running when their catalog is deleted or unavailable #2442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP 🐛 Ensures Helm-based extensions keep running when their catalog is deleted or unavailable #2442
Conversation
These tests verify what should happen when catalogs are deleted
while Helm extensions are running.
4 Test Scenarios Added:
Scenario 1: Extension continues running after catalog deletion
Given: Helm extension is installed and running
When: Catalog is deleted
Then: Extension keeps running, resources stay available
Scenario 2: Resources are restored after catalog deletion
Given: Helm extension is installed
When: Catalog is deleted AND someone deletes a resource
Then: Resource is automatically restored
Scenario 3: Config changes work without catalog
Given: Helm extension is installed
When: Catalog is deleted AND user changes preflight config
Then: Config change is accepted
Scenario 4: Version upgrade blocked without catalog
Given: Helm extension v1.0.0 is installed
When: Catalog is deleted AND user requests upgrade to v1.0.1
Then: Upgrade is blocked (Retrying status)
And: Extension stays on v1.0.0
Test Infrastructure Added:
- CatalogIsDeleted() step function
- Proper timeout and cleanup handling
Note: These tests will FAIL until the fix is applied.
This commit captures the desired behavior first.
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements catalog deletion resilience for Helm-based ClusterExtensions, allowing installed extensions to continue functioning even when their source catalog is deleted. The implementation adds logic to detect catalog deletion and fall back to the currently installed bundle content, while still blocking version upgrades when catalogs are unavailable.
Changes:
- Added catalog deletion detection logic in the ResolveBundle reconcile step
- Implemented fallback to installed bundle when catalogs are deleted and no version change is requested
- Added nil contentFS handling in Helm applier to reconcile existing releases without catalog access
- Added E2E tests to verify workload resilience during catalog deletion
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/operator-controller/controllers/clusterextension_reconcile_steps.go | Core logic for handling resolution errors, checking catalog existence, and skipping bundle unpacking when using installed content |
| internal/operator-controller/applier/helm.go | Added reconcileExistingRelease() to maintain Helm releases without catalog access |
| cmd/operator-controller/main.go | Updated ResolveBundle calls to pass client for catalog existence checks |
| internal/operator-controller/controllers/suite_test.go | Updated test setup to pass client to ResolveBundle |
| internal/operator-controller/controllers/clusterextension_controller_test.go | Added comprehensive unit tests for fallback scenarios |
| internal/operator-controller/controllers/clusterextension_controller.go | Added nolint comment for ensureAllConditionsWithReason |
| test/e2e/features/catalog-deletion-resilience.feature | E2E test scenarios for catalog deletion resilience |
| test/e2e/steps/steps.go | CatalogIsDeleted step implementation for E2E tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // imageFS will remain nil; applier implementations MUST handle nil imageFS by using | ||
| // existing installed content. See Helm.reconcileExistingRelease() and Boxcutter.apply() | ||
| // for nil contentFS handling. |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states that Boxcutter.apply() handles nil contentFS, but the current implementation of Boxcutter.Apply does not check for nil contentFS. When contentFS is nil, Boxcutter.apply() will call RevisionGenerator.GenerateRevision(ctx, contentFS, ...) which will fail when it tries to read from the nil filesystem.
Only Helm.Apply has been updated to handle nil contentFS by checking if contentFS is nil and calling reconcileExistingRelease. The Boxcutter applier needs similar handling or this code path will fail for boxcutter-based installations during catalog deletion scenarios.
| // imageFS will remain nil; applier implementations MUST handle nil imageFS by using | |
| // existing installed content. See Helm.reconcileExistingRelease() and Boxcutter.apply() | |
| // for nil contentFS handling. | |
| // imageFS will remain nil; appliers that support a nil imageFS must fall back to | |
| // existing installed content (see Helm.reconcileExistingRelease() for an example). | |
| // Other applier implementations must not assume imageFS is non-nil in this path. |
| if state.revisionStates != nil && | ||
| state.revisionStates.Installed != nil && | ||
| state.resolvedRevisionMetadata.Name == state.revisionStates.Installed.Name && | ||
| state.resolvedRevisionMetadata.Version == state.revisionStates.Installed.Version { | ||
| l.Info("skipping unpack - using installed bundle content") | ||
| // imageFS will remain nil; applier implementations MUST handle nil imageFS by using | ||
| // existing installed content. See Helm.reconcileExistingRelease() and Boxcutter.apply() | ||
| // for nil contentFS handling. | ||
| return nil, nil | ||
| } | ||
|
|
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential nil pointer dereference. The code accesses state.resolvedRevisionMetadata.Name and state.resolvedRevisionMetadata.Version without first checking if state.resolvedRevisionMetadata is non-nil. While this may be safe in the fallback path (line 200 sets it), it's risky if UnpackBundle is called in other contexts or if the logic changes. Consider adding a nil check: if state.resolvedRevisionMetadata != nil && state.revisionStates != nil && ...
| if state.revisionStates != nil && | |
| state.revisionStates.Installed != nil && | |
| state.resolvedRevisionMetadata.Name == state.revisionStates.Installed.Name && | |
| state.resolvedRevisionMetadata.Version == state.revisionStates.Installed.Version { | |
| l.Info("skipping unpack - using installed bundle content") | |
| // imageFS will remain nil; applier implementations MUST handle nil imageFS by using | |
| // existing installed content. See Helm.reconcileExistingRelease() and Boxcutter.apply() | |
| // for nil contentFS handling. | |
| return nil, nil | |
| } | |
| if state.revisionStates != nil && state.revisionStates.Installed != nil { | |
| if state.resolvedRevisionMetadata != nil && | |
| state.resolvedRevisionMetadata.Name == state.revisionStates.Installed.Name && | |
| state.resolvedRevisionMetadata.Version == state.revisionStates.Installed.Version { | |
| l.Info("skipping unpack - using installed bundle content") | |
| // imageFS will remain nil; applier implementations MUST handle nil imageFS by using | |
| // existing installed content. See Helm.reconcileExistingRelease() and Boxcutter.apply() | |
| // for nil contentFS handling. | |
| return nil, nil | |
| } | |
| if state.resolvedRevisionMetadata == nil { | |
| // We don't have resolved metadata, but we do have an installed revision. | |
| // Rely on the existing installed content and skip unpacking. | |
| l.Info("skipping unpack - using installed bundle content (no resolved metadata)") | |
| return nil, nil | |
| } | |
| } | |
| if state.resolvedRevisionMetadata == nil { | |
| return nil, fmt.Errorf("resolved revision metadata is nil; cannot unpack bundle") | |
| } |
| deadline := time.Now().Add(timeout) | ||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| // Return the context error if we timed out or were cancelled while waiting. | ||
| return ctx.Err() | ||
| case <-ticker.C: | ||
| // Check if we've exceeded the timeout | ||
| if time.Now().After(deadline) { | ||
| return fmt.Errorf("timeout waiting for catalog %s to be deleted", catalogFullName) | ||
| } | ||
|
|
||
| _, err := k8sClient("get", "clustercatalog", catalogFullName) | ||
| if err != nil { | ||
| // Catalog is deleted - verification succeeded | ||
| return nil | ||
| } | ||
| } |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential race condition or timing issue in the deletion check logic. The ticker fires first at time T, checks if the catalog exists, then the next check happens at T+tick. However, the timeout deadline check happens after the ticker fires, meaning if the timeout expires between ticks, we won't detect it until the next tick fires. This could result in waiting up to one additional tick interval beyond the intended timeout.
Consider checking the deadline before waiting on the ticker, or restructuring the loop to check the timeout immediately after each catalog existence check.
3e5a789 to
6aa486e
Compare
This fix resolves the issues discovered by the e2e tests, ensuring Helm extensions keep running when catalogs are deleted. Issues Fixed: Issue 1: Extensions fail when catalog is deleted Problem: Resolution failure causes complete failure Fix: Fall back to installed bundle when catalog unavailable Issue 2: Resources break when catalog is deleted Problem: No reconciliation without catalog content Fix: Helm reconcileExistingRelease() maintains resources Issue 3: Config changes rejected without catalog Problem: Any spec change triggers resolution failure Fix: Allow non-version changes to proceed with installed bundle Issue 4: Version upgrades fail silently Problem: Can't distinguish catalog deletion from other errors Fix: Check if catalogs exist, block upgrades explicitly How the Fix Works: When bundle resolution fails, the controller now: 1. Checks if a bundle is already installed 2. Checks if spec is requesting a version change 3. Checks if catalogs still exist in the cluster Decision Logic: - Installed + No version change + Catalog deleted → Use installed bundle - Installed + Version change + Catalog deleted → Block and retry - Installed + No version change + Catalog exists → Retry (transient error) - No installed bundle → Fail (can't proceed) For Helm Runtime: - reconcileExistingRelease() uses Helm's built-in reconciliation - Maintains resources via Server-Side Apply - Sets up watchers to monitor and restore resources - Returns success if reconcile succeeds (even if watch setup fails) Controller watches ClusterCatalog resources, so reconciliation automatically resumes when catalogs return. Unit Tests Added: - TestResolutionFallbackToInstalledBundle (3 scenarios) - TestCheckCatalogsExist (5 scenarios) All e2e tests now pass.
6aa486e to
ad6d701
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if state.resolvedRevisionMetadata != nil && | ||
| state.revisionStates != nil && | ||
| state.revisionStates.Installed != nil && | ||
| state.resolvedRevisionMetadata.Name == state.revisionStates.Installed.Name && |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential nil pointer dereference here. The code checks if state.revisionStates and state.revisionStates.Installed are not nil, but does not check if state.resolvedRevisionMetadata is nil before accessing its Name and Version fields on lines 257-258. Add a nil check for state.resolvedRevisionMetadata at the beginning of the condition to prevent a panic.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2442 +/- ##
==========================================
+ Coverage 73.00% 73.31% +0.31%
==========================================
Files 100 100
Lines 7641 7728 +87
==========================================
+ Hits 5578 5666 +88
+ Misses 1625 1620 -5
- Partials 438 442 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
Implements catalog deletion resilience for Helm runtime per the 3 assumptions.
Assumptions Addressed
When a catalog is removed:
Workload remains untouched, no reconciliation until resolution succeeds
Managed objects don't break
Full reconciliation when resolution succeeds
How It Works
When resolution fails:
For Helm:
reconcileExistingRelease()uses Helm's reconciliationTests
E2E (4 scenarios):