Skip to content

Python: Workflows background run#4274

Open
TaoChenOSU wants to merge 6 commits intomicrosoft:mainfrom
TaoChenOSU:taochen/python-workflows-background-run
Open

Python: Workflows background run#4274
TaoChenOSU wants to merge 6 commits intomicrosoft:mainfrom
TaoChenOSU:taochen/python-workflows-background-run

Conversation

@TaoChenOSU
Copy link
Contributor

@TaoChenOSU TaoChenOSU commented Feb 25, 2026

Motivation and Context

Currently, we have a very limited way of handling workflow runs and responding to events. Users have to wait until a workflow converges to process events, such as requests.

Description

  1. Create a run mode where clients can start a workflow run in the background and poll events as they desire.
  2. Allow clients to respond to requests while the workflow is still running in the background.

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

@TaoChenOSU TaoChenOSU self-assigned this Feb 25, 2026
@TaoChenOSU TaoChenOSU added python workflows Related to Workflows in agent-framework labels Feb 25, 2026
@github-actions github-actions bot changed the title Workflows background run Python: Workflows background run Feb 25, 2026
@markwallace-microsoft markwallace-microsoft added the documentation Improvements or additions to documentation label Feb 25, 2026
@TaoChenOSU TaoChenOSU marked this pull request as ready for review February 25, 2026 21:38
Copilot AI review requested due to automatic review settings February 25, 2026 21:38
@TaoChenOSU TaoChenOSU moved this to In Progress in Agent Framework Feb 25, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

finally:
await self._run_cleanup(checkpoint_storage)

async def _resume() -> asyncio.Task[None]: # noqa: RUF029
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The # noqa: RUF029 comment is unnecessary here. RUF029 warns about unused async functions, but _resume is clearly used on line 634 where it's passed to BackgroundRunHandle. This suppression should be removed.

Suggested change
async def _resume() -> asyncio.Task[None]: # noqa: RUF029
async def _resume() -> asyncio.Task[None]:

Copilot uses AI. Check for mistakes.
from ._events import WorkflowEvent
from ._runner_context import RunnerContext

logger = logging.getLogger(__name__)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger variable is imported but never used in this module. Consider removing the unused import or adding appropriate debug/error logging where it might be useful (e.g., in the respond method when resuming after idle).

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@moonbox3 moonbox3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts.

logger = logging.getLogger(__name__)


class BackgroundRunHandle:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this class is essentially a wrapper around asyncio primitives: create_task and an asyncio.Queue. Why do we need to wrap these well-known constructs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this also adds another concept that people have to learn, I do not think this is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree. We are doing a lot more than just wrapping a workflow in a background task. We are handling events, requests, errors for the users. And the concept is well-known. Here is an equivalent concept from OpenAI: https://developers.openai.com/api/docs/guides/background

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support background for agents as a keyword arg right now.

return response_stream
return response_stream.get_final_response()

def run_in_background(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we genuinely need a polling-based consumption pattern? Are we getting feedback that this is missing today?

We're now pushing more concerns onto the caller. Every consumer has to:

  1. Write the poll loop
  2. Pick a sleep interval (and get it "wrong", too slow adds latency, too fast wastes cycles)
  3. Route events by type manually
  4. Track which request IDs map to which responses
  5. Remember to drain after idle
  6. Handle the resume-after-idle edge case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much agree, this is not needed and leads to un-pythonic code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this un-pythonic?

logger = logging.getLogger(__name__)


class BackgroundRunHandle:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this also adds another concept that people have to learn, I do not think this is needed

# Single poll loop: process all events and respond to requests inline.
# The workflow continues running in the background while we process events.
outputs: list[str] = []
while not handle.is_idle:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see almost no value in this, because it makes people have to understand this whole idle, poll, etc. Also a user having to write asyncio.sleep does not seem right, a much simpler pattern to solve for this scenario is to use something like a callback for response required, that's a well known concept in Python and doesn't require as much complexity as this. And when I compare these samples to existing sample that just use streaming then that is enough, that already allows you to do other stuff in the meantime, and if you really need it, a user can call that in a separate thread and then you ahve this with Python primitives instead of another new thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is polling and other things not Pythonic? This is a well-known concept in Programming. OpenAI also has a similar primitive: https://developers.openai.com/api/docs/guides/background.

Using callbacks is a simpler approach but it's very limited and doesn't scale well. We had callbacks back in SK and it didn't work really well. For example, it will be difficult to integrate callbacks with a UI because the UI loop is driven by the callback and users can only handle one callback at a time. And if you don't want pre-determined actions, you will have to emit another thing from the callback and wait for user input.

With a background run and events, we are maintaining the original event systems we have in Workflows, which reduces the concepts that people have to learn, given that they are already familiar with polling, which again is a well-known concept.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Polling in most other cases, means you actually exit whatever you're doing and then every so often check in to see if it is done, not building this handle.is_idle loop, usually a while True loop suffices, this is also the pattern we have now with background responses for certain api's (openai responses and a2a I think), so adding this handle thing, is the new concept that people have to learn, if we want to do a background run, then it needs to follow such a pattern, where you start it, you get a id of some sort back (continuation token in our case) and then you call run again to see the status. That is also the pattern that openai has on the link that you showed, not with some other thing that needs to be managed. Another concern is can you actually serialize this handler, stop the app, then restart and start polling again, probably not, and we would probably send people to checkpoints if that is a requirement.

Callbacks are a somewhat different approach, but I think the guidance we should give, is that if you can respond directly, then use a callback, if you cannot respond quickly then do not use background, because you wont know when you'll be getting back, and then even a hot path might end up idling for long stretches, which would indicate to that the process you are building should be split into two anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to loop on handle.is_idle. That's just the sample. You can totally go do something else in your script and come back later to poll all the events that get produced during that period. You can also do a while True until an output event is produced. The OpenAI response is also a live handle to poll the status while resp.status in {"queued", "in_progress"}:.

Given that we are an SDK, returning an ID isn't a good pattern because we can give users direct access to the background task. If users are creating a service around this, they can map generated IDs to handles and create an API to allow clients to poll using IDs.

Another concern is can you actually serialize this handler, stop the app, then restart and start polling again, probably not, and we would probably send people to checkpoints if that is a requirement.

We already have checkpoints.

if you can respond directly, then use a callback, if you cannot respond quickly then do not use background, because you wont know when you'll be getting back, and then even a hot path might end up idling for long stretches, which would indicate to that the process you are building should be split into two anyway.

This is the hypothetical scenario we are trying to address. If we give this guidance, there is no need for callback either. The current event system will suffice.

return response_stream
return response_stream.get_final_response()

def run_in_background(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much agree, this is not needed and leads to un-pythonic code

@TaoChenOSU TaoChenOSU moved this from In Progress to In Review in Agent Framework Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation python workflows Related to Workflows in agent-framework

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

5 participants