Skip to content

Python: fix(core): prevent WorkflowExecutor from re-sending answered requests after checkpoint restore#3293

Closed
moonbox3 wants to merge 1 commit intomicrosoft:mainfrom
moonbox3:3255-fix
Closed

Python: fix(core): prevent WorkflowExecutor from re-sending answered requests after checkpoint restore#3293
moonbox3 wants to merge 1 commit intomicrosoft:mainfrom
moonbox3:3255-fix

Conversation

@moonbox3
Copy link
Contributor

Motivation and Context

When a sub-workflow managed by WorkflowExecutor was resumed from a checkpoint and responded to a pending request, any subsequent request_info() calls would cause the already-answered request to be re-sent to the parent workflow alongside the new request. This resulted in duplicate requests and incorrect expected_response_count, causing workflows to hang or throw errors.

  • Fix bug where WorkflowExecutor re-sent already-answered RequestInfoEvents after checkpoint restore
  • Add _responded_request_ids set to track which requests have been answered
  • Filter out duplicate requests in _process_workflow_result() before sending to parent workflow

Description

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

@moonbox3 moonbox3 self-assigned this Jan 20, 2026
Copilot AI review requested due to automatic review settings January 20, 2026 00:22
@moonbox3 moonbox3 added the workflows Related to Workflows in agent-framework label Jan 20, 2026
@markwallace-microsoft
Copy link
Member

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/core/agent_framework/_workflows
   _workflow_executor.py1933581%29, 95, 447, 472, 474, 482–483, 488, 490, 495, 497, 507, 509, 558, 575–576, 578–579, 603–608, 611, 614, 622, 627, 638, 648, 652, 658, 662, 676, 680
TOTAL17496266384% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
3160 213 💤 0 ❌ 0 🔥 1m 2s ⏱️

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where WorkflowExecutor would re-send already-answered RequestInfoEvents after checkpoint restore, causing duplicate requests and incorrect response counts that led to hanging workflows.

Changes:

  • Added _responded_request_ids set to track which requests have been answered
  • Updated checkpoint save/restore to persist this tracking state
  • Added filtering logic in _process_workflow_result() to skip duplicate requests
  • Added comprehensive regression tests for checkpoint restore scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
python/packages/core/agent_framework/_workflows/_workflow_executor.py Implements the fix by tracking responded request IDs and filtering duplicates after checkpoint restore
python/packages/core/tests/workflow/test_sub_workflow.py Adds comprehensive regression tests that verify the fix and full workflow completion after checkpoint restore

@moonbox3 moonbox3 moved this to In Review in Agent Framework Jan 20, 2026
@moonbox3 moonbox3 enabled auto-merge January 20, 2026 00:37
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is needed and what the root cause actually is.

When a request is fulfilled, it's taken off record immediately and won't be available in the next checkpoint. My understanding is that a workflow emits a request, the executor captures it and send it out as an event or a message, followed by a checkpoint (A), which will have all the pending requests. When a respond comes back, the request will be taken off record, and processing will begin, followed by the next checkpoint (B).

If the workflow is resumed from checkpoint A, the request will be reemitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is needed because on_checkpoint_restore() re-adds pending requests to the sub-workflow's event queue. When a response arrives, we remove the request from WorkflowExecutor's tracking, but it's still in the sub-workflow's event stream.

So when the sub-workflow continues and makes another request_info() call, result.get_request_info_events() returns both the old (answered) request and the new one, causing duplicate SubWorkflowRequestMessages and incorrect expected_response_count.

_responded_request_ids filters these out. The proper fix would be sub-workflow-level checkpoint tracking (Issue #1614), but this works until then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be in the sub workflow's event queue because the event has been emitted. When a checkpoint is created, the event queue should be empty. This is guaranteed by the runner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that the event queue is empty at checkpoint time. The issue is on_checkpoint_restore() (lines 519-527) which explicitly re-adds pending requests to the sub-workflow's event queue:

await asyncio.gather(*[
    self.workflow._runner_context.add_request_info_event(event)
    for event in request_info_events
])

When _handle_response() later calls send_responses() on the sub-workflow, run_until_convergence() drains these pre-loop events (lines 88-92 in _runner.py) and they end up in the WorkflowRunResult.

The rehydration is intentional (marked as "temporary solution" with TODO #1614), so we need _responded_request_ids to filter them out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an issue with the flow you're describing here. If the checkpoint contains the pending requests, they should be re-emitted when the checkpoint is loaded.

@moonbox3
Copy link
Contributor Author

moonbox3 commented Feb 5, 2026

We will handle this in a different manner.

@moonbox3 moonbox3 closed this Feb 5, 2026
auto-merge was automatically disabled February 5, 2026 02:47

Pull request was closed

@github-project-automation github-project-automation bot moved this from In Review to Done in Agent Framework Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python workflows Related to Workflows in agent-framework

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Python: [Bug]: WorkflowExecutor re-sends already-answered RequestInfoEvents after checkpoint restore

5 participants