Fix race condition in `test_cudart_cudaMemcpy3DPeerAsync` #1550

rwgk · 2026-01-30T19:41:06Z

Description

test_cudart_cudaMemcpy3DPeerAsync was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1.

Root Cause

The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied.

This race condition is masked in legacy default stream mode due to implicit synchronization between blocking streams and the default stream, but is exposed in PTDS mode where streams do not implicitly synchronize.

Test Order Dependency

The flakiness was exposed through test order randomization. The cuda_bindings test suite uses pytest-randomly (see #1268), which automatically randomizes test execution order. The test suite shares a module-scoped CUDA context fixture across all tests in test_cudart.py.

When certain tests (notably test_cudaGraphConditionalHandleCreate_v2, test_cudart_cudaGetTextureObjectTextureDesc, test_cudart_cudaMemcpy2DToArray_DtoD, and test_cudart_cudaGraphAddMemcpyNode1D) execute before test_cudart_cudaMemcpy3DPeerAsync, they leave the CUDA execution context in a state that makes the race condition more likely to manifest. Removing any one of these tests from the execution sequence before test_cudart_cudaMemcpy3DPeerAsync causes the test to pass, demonstrating the order dependency.

Solution

Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics.

Comprehensive flakiness testing (win-64 WDDM)

With fix: 1000 trials - ✓ All tests passed with no flakes, errors, or crashes
Without fix: 1000 trials - 58 failures of test_cudart_cudaMemcpy3DPeerAsync

Each trial was running these commands:

cd cuda_bindings\
python -m pytest -ra -s -vv "tests\test_cudart.py"
set CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1 && python -m pytest -ra -s -vv "tests\test_cudart.py" & set CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=

Internal-only xref with full logs: NVIDIA/cuda-python-private#239

copy-pr-bot · 2026-01-30T19:41:09Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

The test was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1. Root cause: The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied. Why it appeared flaky: - In legacy default stream mode, blocking streams created with cudaStreamCreate() implicitly synchronize with the default stream, masking the race condition. - In PTDS mode, the per-thread default stream does not synchronize with other streams, exposing the race and causing intermittent failures. Fix: Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics. This fix aligns with documented CUDA behavior: synchronous cudaMemcpy from pageable host memory does not guarantee completion before return, and explicit synchronization is required when coordinating with work in other streams.

rwgk · 2026-01-30T19:42:37Z

/ok to test

The test was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1. Root cause: The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied. Why it appeared flaky: - In legacy default stream mode, blocking streams created with cudaStreamCreate() implicitly synchronize with the default stream, masking the race condition. - In PTDS mode, the per-thread default stream does not synchronize with other streams, exposing the race and causing intermittent failures. Fix: Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics. This fix aligns with documented CUDA behavior: synchronous cudaMemcpy from pageable host memory does not guarantee completion before return, and explicit synchronization is required when coordinating with work in other streams. (cherry picked from commit c94557c)

github-actions · 2026-01-30T22:46:35Z

Successfully created backport PR for 12.9.x:

[Backport 12.9.x] Fix race condition in test_cudart_cudaMemcpy3DPeerAsync #1554

github-actions · 2026-01-30T22:57:12Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

The test was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1. Root cause: The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied. Why it appeared flaky: - In legacy default stream mode, blocking streams created with cudaStreamCreate() implicitly synchronize with the default stream, masking the race condition. - In PTDS mode, the per-thread default stream does not synchronize with other streams, exposing the race and causing intermittent failures. Fix: Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics. This fix aligns with documented CUDA behavior: synchronous cudaMemcpy from pageable host memory does not guarantee completion before return, and explicit synchronization is required when coordinating with work in other streams. (cherry picked from commit c94557c) Co-authored-by: Ralf W. Grosse-Kunstleve <rwgkio@gmail.com>

rwgk force-pushed the cuda_bindings_test_cudart_fix_flakiness branch from d6322b4 to 691a97f Compare January 30, 2026 19:42

kkraus14 approved these changes Jan 30, 2026

View reviewed changes

kkraus14 added the to-be-backported Trigger the bot to raise a backport PR upon merge label Jan 30, 2026

kkraus14 enabled auto-merge (squash) January 30, 2026 19:44

This comment has been minimized.

Sign in to view

kkraus14 merged commit c94557c into NVIDIA:main Jan 30, 2026
88 checks passed

github-actions bot mentioned this pull request Jan 30, 2026

[Backport 12.9.x] Fix race condition in test_cudart_cudaMemcpy3DPeerAsync #1554

Merged

rwgk deleted the cuda_bindings_test_cudart_fix_flakiness branch January 30, 2026 22:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in `test_cudart_cudaMemcpy3DPeerAsync` #1550

Fix race condition in `test_cudart_cudaMemcpy3DPeerAsync` #1550

rwgk commented Jan 30, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 30, 2026

Uh oh!

rwgk commented Jan 30, 2026

Uh oh!

This comment has been minimized.

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix race condition in test_cudart_cudaMemcpy3DPeerAsync #1550

Fix race condition in test_cudart_cudaMemcpy3DPeerAsync #1550

Conversation

rwgk commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Root Cause

Test Order Dependency

Solution

Comprehensive flakiness testing (win-64 WDDM)

Uh oh!

copy-pr-bot bot commented Jan 30, 2026

Uh oh!

rwgk commented Jan 30, 2026

Uh oh!

This comment has been minimized.

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix race condition in `test_cudart_cudaMemcpy3DPeerAsync` #1550

Fix race condition in `test_cudart_cudaMemcpy3DPeerAsync` #1550

rwgk commented Jan 30, 2026 •

edited

Loading