-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Python: fix(core): prevent WorkflowExecutor from re-sending answered requests after checkpoint restore #3293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is needed and what the root cause actually is.
When a request is fulfilled, it's taken off record immediately and won't be available in the next checkpoint. My understanding is that a workflow emits a request, the executor captures it and send it out as an event or a message, followed by a checkpoint (A), which will have all the pending requests. When a respond comes back, the request will be taken off record, and processing will begin, followed by the next checkpoint (B).
If the workflow is resumed from checkpoint A, the request will be reemitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix is needed because
on_checkpoint_restore()re-adds pending requests to the sub-workflow's event queue. When a response arrives, we remove the request from WorkflowExecutor's tracking, but it's still in the sub-workflow's event stream.So when the sub-workflow continues and makes another
request_info()call,result.get_request_info_events()returns both the old (answered) request and the new one, causing duplicate SubWorkflowRequestMessages and incorrectexpected_response_count._responded_request_idsfilters these out. The proper fix would be sub-workflow-level checkpoint tracking (Issue #1614), but this works until then.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be in the sub workflow's event queue because the event has been emitted. When a checkpoint is created, the event queue should be empty. This is guaranteed by the runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that the event queue is empty at checkpoint time. The issue is
on_checkpoint_restore()(lines 519-527) which explicitly re-adds pending requests to the sub-workflow's event queue:When
_handle_response()later callssend_responses()on the sub-workflow,run_until_convergence()drains these pre-loop events (lines 88-92 in _runner.py) and they end up in theWorkflowRunResult.The rehydration is intentional (marked as "temporary solution" with TODO #1614), so we need
_responded_request_idsto filter them out.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see an issue with the flow you're describing here. If the checkpoint contains the pending requests, they should be re-emitted when the checkpoint is loaded.