Python: fix(core): prevent WorkflowExecutor from re-sending answered requests after checkpoint restore by moonbox3 · Pull Request #3293 · microsoft/agent-framework

moonbox3 · 2026-01-20T00:22:42Z

Motivation and Context

When a sub-workflow managed by WorkflowExecutor was resumed from a checkpoint and responded to a pending request, any subsequent request_info() calls would cause the already-answered request to be re-sent to the parent workflow alongside the new request. This resulted in duplicate requests and incorrect expected_response_count, causing workflows to hang or throw errors.

Fix bug where WorkflowExecutor re-sent already-answered RequestInfoEvents after checkpoint restore
Add _responded_request_ids set to track which requests have been answered
Filter out duplicate requests in _process_workflow_result() before sending to parent workflow

Description

Fixes Python: [Bug]: WorkflowExecutor re-sends already-answered RequestInfoEvents after checkpoint restore #3255

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the Contribution Guidelines
All unit tests pass, and I have added new tests where possible
Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

… after checkpoint restore

markwallace-microsoft · 2026-01-20T00:24:43Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
packages/core/agent_framework/_workflows
_workflow_executor.py	193	35	81%	29, 95, 447, 472, 474, 482–483, 488, 490, 495, 497, 507, 509, 558, 575–576, 578–579, 603–608, 611, 614, 622, 627, 638, 648, 652, 658, 662, 676, 680
TOTAL	17496	2663	84%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
3160	213 💤	0 ❌	0 🔥	1m 2s ⏱️

Copilot

Pull request overview

This PR fixes a bug where WorkflowExecutor would re-send already-answered RequestInfoEvents after checkpoint restore, causing duplicate requests and incorrect response counts that led to hanging workflows.

Changes:

Added _responded_request_ids set to track which requests have been answered
Updated checkpoint save/restore to persist this tracking state
Added filtering logic in _process_workflow_result() to skip duplicate requests
Added comprehensive regression tests for checkpoint restore scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
python/packages/core/agent_framework/_workflows/_workflow_executor.py	Implements the fix by tracking responded request IDs and filtering duplicates after checkpoint restore
python/packages/core/tests/workflow/test_sub_workflow.py	Adds comprehensive regression tests that verify the fix and full workflow completion after checkpoint restore

TaoChenOSU · 2026-01-21T00:24:43Z

python/packages/core/agent_framework/_workflows/_workflow_executor.py

I am not sure if this is needed and what the root cause actually is.

When a request is fulfilled, it's taken off record immediately and won't be available in the next checkpoint. My understanding is that a workflow emits a request, the executor captures it and send it out as an event or a message, followed by a checkpoint (A), which will have all the pending requests. When a respond comes back, the request will be taken off record, and processing will begin, followed by the next checkpoint (B).

If the workflow is resumed from checkpoint A, the request will be reemitted.

The fix is needed because on_checkpoint_restore() re-adds pending requests to the sub-workflow's event queue. When a response arrives, we remove the request from WorkflowExecutor's tracking, but it's still in the sub-workflow's event stream.

So when the sub-workflow continues and makes another request_info() call, result.get_request_info_events() returns both the old (answered) request and the new one, causing duplicate SubWorkflowRequestMessages and incorrect expected_response_count.

_responded_request_ids filters these out. The proper fix would be sub-workflow-level checkpoint tracking (Issue #1614), but this works until then.

It shouldn't be in the sub workflow's event queue because the event has been emitted. When a checkpoint is created, the event queue should be empty. This is guaranteed by the runner.

You're right that the event queue is empty at checkpoint time. The issue is on_checkpoint_restore() (lines 519-527) which explicitly re-adds pending requests to the sub-workflow's event queue:

await asyncio.gather(*[ self.workflow._runner_context.add_request_info_event(event) for event in request_info_events ])

When _handle_response() later calls send_responses() on the sub-workflow, run_until_convergence() drains these pre-loop events (lines 88-92 in _runner.py) and they end up in the WorkflowRunResult.

The rehydration is intentional (marked as "temporary solution" with TODO #1614), so we need _responded_request_ids to filter them out.

I don't see an issue with the flow you're describing here. If the checkpoint contains the pending requests, they should be re-emitted when the checkpoint is loaded.

moonbox3 · 2026-02-05T02:47:51Z

We will handle this in a different manner.

fix(core): prevent WorkflowExecutor from re-sending answered requests…

b5de099

… after checkpoint restore

moonbox3 self-assigned this Jan 20, 2026

moonbox3 added the python label Jan 20, 2026

Copilot AI review requested due to automatic review settings January 20, 2026 00:22

moonbox3 added the workflows Related to Workflows in agent-framework label Jan 20, 2026

moonbox3 added this to Agent Framework Jan 20, 2026

Copilot started reviewing on behalf of moonbox3 January 20, 2026 00:23 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

moonbox3 moved this to In Review in Agent Framework Jan 20, 2026

moonbox3 enabled auto-merge January 20, 2026 00:37

TaoChenOSU reviewed Jan 21, 2026

View reviewed changes

dmytrostruk approved these changes Jan 21, 2026

View reviewed changes

moonbox3 closed this Feb 5, 2026

auto-merge was automatically disabled February 5, 2026 02:47
Pull request was closed

github-project-automation bot moved this from In Review to Done in Agent Framework Feb 5, 2026

TaoChenOSU mentioned this pull request Feb 5, 2026

Python: Prevent WorkflowExecutor from re-sending answered requests after checkpoint restore #3689

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: fix(core): prevent WorkflowExecutor from re-sending answered requests after checkpoint restore#3293

Python: fix(core): prevent WorkflowExecutor from re-sending answered requests after checkpoint restore#3293
moonbox3 wants to merge 1 commit intomicrosoft:mainfrom
moonbox3:3255-fix

moonbox3 commented Jan 20, 2026

Uh oh!

markwallace-microsoft commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

TaoChenOSU Jan 21, 2026

Uh oh!

moonbox3 Jan 21, 2026

Uh oh!

TaoChenOSU Jan 21, 2026

Uh oh!

moonbox3 Jan 21, 2026

Uh oh!

TaoChenOSU Jan 21, 2026

Uh oh!

moonbox3 commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

moonbox3 commented Jan 20, 2026

Motivation and Context

Description

Contribution Checklist

Uh oh!

markwallace-microsoft commented Jan 20, 2026

Python Unit Test Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

TaoChenOSU Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

moonbox3 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

TaoChenOSU Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

moonbox3 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

TaoChenOSU Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

moonbox3 commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants