[Backport 12.9.x] Fix race condition in `test_cudart_cudaMemcpy3DPeerAsync` #1554

github-actions · 2026-01-30T22:46:34Z

Description

Backport of #1550 to 12.9.x.

The test was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1. Root cause: The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied. Why it appeared flaky: - In legacy default stream mode, blocking streams created with cudaStreamCreate() implicitly synchronize with the default stream, masking the race condition. - In PTDS mode, the per-thread default stream does not synchronize with other streams, exposing the race and causing intermittent failures. Fix: Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics. This fix aligns with documented CUDA behavior: synchronous cudaMemcpy from pageable host memory does not guarantee completion before return, and explicit synchronization is required when coordinating with work in other streams. (cherry picked from commit c94557c)

copy-pr-bot · 2026-01-30T22:46:37Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rwgk · 2026-01-30T23:22:16Z

/ok to test

copy-pr-bot · 2026-01-30T23:22:19Z

/ok to test

@rwgk, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

rwgk · 2026-01-30T23:22:43Z

/ok to test 3c84a05

github-actions bot mentioned this pull request Jan 30, 2026

Fix race condition in test_cudart_cudaMemcpy3DPeerAsync #1550

Merged

rwgk approved these changes Jan 31, 2026

View reviewed changes

rwgk merged commit d08f202 into 12.9.x Jan 31, 2026
46 checks passed

rwgk deleted the backport-1550-to-12.9.x branch January 31, 2026 01:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport 12.9.x] Fix race condition in `test_cudart_cudaMemcpy3DPeerAsync` #1554

[Backport 12.9.x] Fix race condition in `test_cudart_cudaMemcpy3DPeerAsync` #1554

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

copy-pr-bot bot commented Jan 30, 2026

Uh oh!

rwgk commented Jan 30, 2026

Uh oh!

copy-pr-bot bot commented Jan 30, 2026

Uh oh!

rwgk commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Backport 12.9.x] Fix race condition in test_cudart_cudaMemcpy3DPeerAsync #1554

[Backport 12.9.x] Fix race condition in test_cudart_cudaMemcpy3DPeerAsync #1554

Uh oh!

Conversation

github-actions bot commented Jan 30, 2026

Description

Uh oh!

copy-pr-bot bot commented Jan 30, 2026

Uh oh!

rwgk commented Jan 30, 2026

Uh oh!

copy-pr-bot bot commented Jan 30, 2026

Uh oh!

rwgk commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Backport 12.9.x] Fix race condition in `test_cudart_cudaMemcpy3DPeerAsync` #1554

[Backport 12.9.x] Fix race condition in `test_cudart_cudaMemcpy3DPeerAsync` #1554