Skip to content

Python: Stop emitting duplicate reasoning content from OpenAI response.reasoning_text.done and response.reasoning_summary_text.done events#5162

Open
moonbox3 wants to merge 10 commits intomicrosoft:mainfrom
moonbox3:agent/fix-5157-2
Open

Python: Stop emitting duplicate reasoning content from OpenAI response.reasoning_text.done and response.reasoning_summary_text.done events#5162
moonbox3 wants to merge 10 commits intomicrosoft:mainfrom
moonbox3:agent/fix-5157-2

Conversation

@moonbox3
Copy link
Copy Markdown
Contributor

@moonbox3 moonbox3 commented Apr 8, 2026

Motivation and Context

The OpenAI Responses API sends both .delta (incremental chunks) and .done (full accumulated text) events for reasoning content. The chat client was converting both into identical Content objects with type text_reasoning, causing downstream consumers like ag-ui to duplicate the reasoning text and emit extra events.

Fixes #5157

Description

The root cause was in _chat_client.py: the response.reasoning_text.done and response.reasoning_summary_text.done handlers created Content.from_text_reasoning objects identical to the delta events, so the full accumulated text was appended on top of already-streamed deltas. The fix removes the Content emission from both .done handlers (keeping only metadata extraction), since the delta events already stream the complete text incrementally. Additionally, a guard was added to _parse_structured_response_value to handle empty text, and regression tests were added in the ag-ui and core test suites to verify that reasoning deltas accumulate correctly without duplication.

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

Note: PR autogenerated by moonbox3's agent

…crosoft#5157)

The OpenAI Responses API sends both reasoning_text.delta (incremental
chunks) and reasoning_text.done (full accumulated text) events. The
chat client was emitting Content for both, causing ag-ui to append the
full done text onto already-accumulated delta text, producing
duplicated reasoning output.

Stop emitting Content for reasoning_text.done and
reasoning_summary_text.done events, matching how output_text.done is
already handled (not emitted). The deltas contain all the content;
the done event is redundant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 8, 2026 10:43
@moonbox3 moonbox3 self-assigned this Apr 8, 2026
@moonbox3 moonbox3 added the python label Apr 8, 2026
@moonbox3
Copy link
Copy Markdown
Contributor Author

moonbox3 commented Apr 8, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/ag-ui/agent_framework_ag_ui
   _agent_run.py4875189%150–157, 196–197, 204, 301, 313, 317, 319–320, 336, 363–364, 400–404, 518–520, 532–534, 632, 640, 753, 755–756, 809, 826–827, 834, 902, 925, 933, 935, 938, 944, 997, 1000, 1010–1011, 1018, 1064
   _run_common.py2861096%312–313, 319–324, 625–626
   _workflow_run.py5092894%181, 217–220, 248, 253, 281, 291, 302, 307, 310, 323, 363–364, 385, 434, 454, 462, 465, 470, 485, 558, 600–601, 672, 715, 734
packages/azure-cosmos/agent_framework_azure_cosmos
   _checkpoint_storage.py135199%396
packages/openai/agent_framework_openai
   _chat_client.py86712885%522–525, 529–530, 536–537, 547–548, 555, 570–576, 597, 605, 628, 746, 845, 904, 906, 908, 910, 976, 990, 1070, 1080, 1085, 1128, 1205, 1235, 1292, 1386, 1391, 1395–1397, 1401–1402, 1472, 1501, 1507, 1517, 1523, 1528, 1534, 1539–1540, 1601, 1623–1624, 1639–1640, 1658–1659, 1700–1703, 1865, 1903–1904, 1920, 1922, 2001–2009, 2039, 2146, 2181, 2196, 2216–2226, 2239, 2250–2254, 2268, 2282–2293, 2302, 2334–2337, 2345–2346, 2348–2350, 2364–2366, 2376–2377, 2383, 2398
TOTAL27202318788% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
5428 20 💤 0 ❌ 0 🔥 1m 26s ⏱️

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes duplicated reasoning text in the Python OpenAI streaming client by ensuring response.reasoning_text.done and response.reasoning_summary_text.done events no longer emit text_reasoning content (since the corresponding .delta events already stream the full text incrementally). This prevents downstream consumers (notably ag-ui) from double-appending reasoning content.

Changes:

  • Stop emitting Content.from_text_reasoning(...) for response.reasoning_text.done and response.reasoning_summary_text.done events; keep metadata extraction only.
  • Update OpenAI chat client tests to assert .done events produce no content (regression coverage for duplication).
  • Add ag-ui regression tests to verify reasoning delta accumulation behavior and per-delta event emission.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
python/packages/openai/agent_framework_openai/_chat_client.py Avoid duplicate text_reasoning content by not emitting reasoning content for .done events (metadata-only).
python/packages/openai/tests/openai/test_openai_chat_client.py Update tests to validate that reasoning .done events do not produce contents.
python/packages/ag-ui/tests/ag_ui/test_run.py Add regression tests ensuring reasoning deltas accumulate without duplication and emit one content event per delta.

Copy link
Copy Markdown
Contributor Author

@moonbox3 moonbox3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 88%

✓ Correctness

This PR correctly removes the content emission from response.reasoning_text.done and response.reasoning_summary_text.done event handlers in the OpenAI chat client. The .done events carry the full accumulated text that was already streamed via .delta events, so emitting content from both would cause duplication in downstream consumers like the ag-ui reasoning accumulation (_emit_text_reasoning), which appends each delta to flow.accumulated_reasoning. The tests are updated consistently, and new tests in ag-ui verify that delta accumulation works correctly without duplication. The metadata extraction from .done events is preserved, which is correct since metadata should still be captured.

✓ Security Reliability

This PR fixes a content duplication bug where both response.reasoning_text.delta and response.reasoning_text.done (and their summary counterparts) emitted Content.from_text_reasoning, causing downstream AG-UI reasoning accumulation to duplicate the full reasoning text. The fix correctly removes the content emission from .done events while preserving metadata extraction. The .done events in the OpenAI Responses API carry the complete accumulated text (which was already streamed via deltas), so emitting it again was redundant. The new tests in test_run.py properly validate accumulation behavior, and the updated tests in test_openai_chat_client.py correctly assert zero contents from done events. No security or reliability issues found.

✓ Test Coverage

The PR correctly removes duplicate content emission from reasoning done events in the OpenAI chat client and updates corresponding tests. The done-event tests are properly updated to assert empty contents while still verifying metadata propagation. Two new ag-ui tests validate accumulation behavior and per-delta event emission. However, there's no integration-level test that sends a realistic sequence (multiple deltas followed by a done event) through _parse_chunk_from_openai to verify that the combined output has no duplicated reasoning content—the existing tests verify each event type in isolation. The new test_reasoning_deltas_accumulate_correctly test is also somewhat redundant with the pre-existing test_reasoning_accumulates_incremental_deltas test which does nearly the same thing with 3 deltas.

✓ Design Approach

The fix correctly addresses a real duplication bug: the response.reasoning_text.done and response.reasoning_summary_text.done events were emitting the full accumulated text as a new Content.from_text_reasoning item, which the ag-ui _emit_text_reasoning accumulator then appended on top of the already-accumulated delta content, producing doubled output. Suppressing content emission from done events is pragmatic and correct for the common streaming case. However, the fix introduces a fragile assumption: if a done event arrives without any preceding delta events (e.g., if the stream resumes from a checkpoint or if the OpenAI API ever batches very short reasoning into only a done event), the reasoning content is silently dropped with no fallback path. A more robust design would be to track whether deltas were received per item_id and use the done event as a fallback source of truth. Additionally, the new ag-ui tests added in test_run.py verify delta accumulation behavior that was already tested by the surrounding test class, rather than directly testing the fix itself (which is in _chat_client.py). The fix also creates an asymetry with response.code_interpreter_call_code.done, which still emits content from the done event, potentially warranting a similar treatment.

Suggestions

  • Consider adding an integration-level test in test_openai_chat_client.py that sends a sequence of response.reasoning_text.delta events followed by a response.reasoning_text.done event through _parse_chunk_from_openai and asserts the total collected contents contain only the delta-produced items with no duplication from the done event. This would validate the bug-fix scenario end-to-end rather than testing each event type in isolation.
  • The new test_reasoning_deltas_accumulate_correctly test in test_run.py (line 1505) is nearly identical to the existing test_reasoning_accumulates_incremental_deltas (line ~1463). Consider removing the redundant test or differentiating it by testing an edge case (e.g., empty-string deltas or a single delta) to more directly demonstrate the deduplication fix.
  • The fix assumes done events are always preceded by delta events. A more robust approach would be to track whether any deltas were received for a given item_id and emit content from the done event only when no deltas were observed, avoiding silent content loss if the streaming model changes or the API is used in a non-standard way.
  • The response.code_interpreter_call_code.done case still emits content from the done event, creating an asymmetry. If the ag-ui layer ever adds accumulation support for code interpreter content similar to reasoning, this will cause the same duplication bug. A comment noting the intentional asymmetry would help future maintainers.

Automated review by moonbox3's agents

Copilot and others added 2 commits April 8, 2026 10:51
…bserved (microsoft#5157)

Address PR review feedback:
- Track item_ids that received reasoning deltas via seen_reasoning_delta_item_ids set
- Emit content from done events only when no deltas were received for the
  item_id, preventing silent content loss on stream resumption
- Add comment documenting code_interpreter done event asymmetry
- Replace redundant ag-ui test with deduplication-focused test
- Add integration test for delta+done sequence in OpenAI chat client tests
- Add fallback path tests for done events without preceding deltas

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…esponse.reasoning_text.delta" and "response.reasoning_text.done" both get exposed as "text_reasoning"
Copy link
Copy Markdown
Contributor Author

@moonbox3 moonbox3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 97% | Result: All clear

Reviewed: Correctness, Security Reliability, Test Coverage, Design Approach


Automated review by moonbox3's agents

Copilot and others added 7 commits April 9, 2026 02:39
…oft#5157)

_emit_text_reasoning now follows the same streaming pattern as _emit_text:
- Emits ReasoningStartEvent/ReasoningMessageStartEvent only on the first
  delta for a given message_id
- Emits only ReasoningMessageContentEvent for subsequent deltas
- Defers ReasoningMessageEndEvent/ReasoningEndEvent until
  _close_reasoning_block is called (on content type switch or end-of-run)

This produces the correct protocol pattern:
  ReasoningStartEvent
    ReasoningMessageStartEvent
    ReasoningMessageContentEvent(delta1)
    ReasoningMessageContentEvent(delta2)
    ReasoningMessageEndEvent
  ReasoningEndEvent

Instead of wrapping every delta in a full Start→End sequence.

Backward compatibility is preserved: calling _emit_text_reasoning without
a flow argument still produces the full sequence per call.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move inline import of TextMessageContentEvent to the top-level import
block and ensure alphabetical ordering to satisfy ruff I001 rule.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…flowEvent

The 'event' variable was already typed as WorkflowEvent[Any] from the
async for loop at line 590. Reusing it in the _close_reasoning_block
loop (which returns list[BaseEvent]) caused an incompatible assignment
error. Renamed to 'reasoning_evt' to avoid the conflict.

Fixes microsoft#5162

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@eavanvalkenburg eavanvalkenburg added this pull request to the merge queue Apr 9, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: [Bug]: "type": "response.reasoning_text.delta" and "response.reasoning_text.done" both get exposed as "text_reasoning"

5 participants