Skip to content

feat: graceful drain for workflow engine deletion#21993

Draft
wentzeld wants to merge 2 commits intodevelopfrom
feat/graceful-drain-workflow-deletion
Draft

feat: graceful drain for workflow engine deletion#21993
wentzeld wants to merge 2 commits intodevelopfrom
feat/graceful-drain-workflow-deletion

Conversation

@wentzeld
Copy link
Copy Markdown
Contributor

Summary

  • Adds non-blocking drain semantics to workflow engine v2 so delete/pause events don't block the reconciliation loop
  • Two-phase deletion: Drain() (fast, stops new executions) → retry until ActiveExecutions() == 0 → Close() (now fast)
  • Fixes pre-existing herr/err variable shadowing bugs that hid errors from status change event emission
  • Fixes pre-existing ErrAlreadyStopped permanent retry loop when Close succeeds but artifact deletion fails
  • Detects draining engines in registration handler to allow replacement on re-activation

Test plan

  • Drain() prevents new executions from starting
  • Drain() sets health condition so Healthy() returns error
  • ActiveExecutions counter increments/decrements correctly
  • workflowDeletedEvent with active executions returns ErrDrainInProgress
  • workflowDeletedEvent with zero active executions completes successfully
  • workflowRegisteredEvent detects draining engine and replaces it
  • ErrAlreadyStopped ignored on retry, deletion completes
  • Shadowing fix: EmitWorkflowStatusChangedEventV2 receives errors
  • Existing tests pass without modification

  Prevent new executions from starting when a delete/pause event is pending,
  and defer deletion until all in-flight executions complete. This avoids
  blocking the reconciliation loop (1 of 12 semaphore slots) while waiting
  for long-running executions to finish.

  Changes:
  - Add Drain(), IsDraining(), ActiveExecutions() to workflow engine v2
  - Add DrainableService interface for type-safe drain detection
  - Two-phase delete: drain (non-blocking) then close (fast once drained)
  - Fix herr/err variable shadowing in WorkflowDeleted/Paused handlers
  - Ignore ErrAlreadyStopped on Close() retry to prevent permanent loops
  - Detect draining engines in registration handler to allow replacement
  - Add drainingWorkflows metric gauge for operator visibility
@github-actions
Copy link
Copy Markdown
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 13, 2026

✅ No conflicts with other open PRs targeting develop

  - revive: simplify early-return in workflowRegisteredEvent drain check
  - perfsprint: use errors.New instead of fmt.Errorf without format args
  - unused: remove unwired recordDrainingWorkflow method (metric gauge kept)
@trunk-io
Copy link
Copy Markdown

trunk-io bot commented Apr 13, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

Failed Test Failure Summary Logs
Test_CCIPTokenTransfer_EVM2Sui_ManagedTokenPool_NoRateLimit Logs ↗︎

View Full Report ↗︎Docs

@cl-sonarqube-production
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant