-
Notifications
You must be signed in to change notification settings - Fork 242
Fix race condition in test_cudart_cudaMemcpy3DPeerAsync
#1550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
kkraus14
merged 1 commit into
NVIDIA:main
from
rwgk:cuda_bindings_test_cudart_fix_flakiness
Jan 30, 2026
Merged
Fix race condition in test_cudart_cudaMemcpy3DPeerAsync
#1550
kkraus14
merged 1 commit into
NVIDIA:main
from
rwgk:cuda_bindings_test_cudart_fix_flakiness
Jan 30, 2026
+4
−0
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
The test was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1. Root cause: The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied. Why it appeared flaky: - In legacy default stream mode, blocking streams created with cudaStreamCreate() implicitly synchronize with the default stream, masking the race condition. - In PTDS mode, the per-thread default stream does not synchronize with other streams, exposing the race and causing intermittent failures. Fix: Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics. This fix aligns with documented CUDA behavior: synchronous cudaMemcpy from pageable host memory does not guarantee completion before return, and explicit synchronization is required when coordinating with work in other streams.
d6322b4 to
691a97f
Compare
Collaborator
Author
|
/ok to test |
kkraus14
approved these changes
Jan 30, 2026
This comment has been minimized.
This comment has been minimized.
github-actions bot
pushed a commit
that referenced
this pull request
Jan 30, 2026
The test was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1. Root cause: The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied. Why it appeared flaky: - In legacy default stream mode, blocking streams created with cudaStreamCreate() implicitly synchronize with the default stream, masking the race condition. - In PTDS mode, the per-thread default stream does not synchronize with other streams, exposing the race and causing intermittent failures. Fix: Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics. This fix aligns with documented CUDA behavior: synchronous cudaMemcpy from pageable host memory does not guarantee completion before return, and explicit synchronization is required when coordinating with work in other streams. (cherry picked from commit c94557c)
|
Successfully created backport PR for |
|
rwgk
added a commit
that referenced
this pull request
Jan 31, 2026
The test was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1. Root cause: The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied. Why it appeared flaky: - In legacy default stream mode, blocking streams created with cudaStreamCreate() implicitly synchronize with the default stream, masking the race condition. - In PTDS mode, the per-thread default stream does not synchronize with other streams, exposing the race and causing intermittent failures. Fix: Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics. This fix aligns with documented CUDA behavior: synchronous cudaMemcpy from pageable host memory does not guarantee completion before return, and explicit synchronization is required when coordinating with work in other streams. (cherry picked from commit c94557c) Co-authored-by: Ralf W. Grosse-Kunstleve <rwgkio@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
test_cudart_cudaMemcpy3DPeerAsyncwas experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled viaCUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1.Root Cause
The test performs a synchronous
cudaMemcpyfrom pageable host memory to device memory, followed immediately bycudaMemcpy3DPeerAsyncin an explicit stream. When using pageable host memory,cudaMemcpymay return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied.This race condition is masked in legacy default stream mode due to implicit synchronization between blocking streams and the default stream, but is exposed in PTDS mode where streams do not implicitly synchronize.
Test Order Dependency
The flakiness was exposed through test order randomization. The
cuda_bindingstest suite usespytest-randomly(see #1268), which automatically randomizes test execution order. The test suite shares a module-scoped CUDA context fixture across all tests intest_cudart.py.When certain tests (notably
test_cudaGraphConditionalHandleCreate_v2,test_cudart_cudaGetTextureObjectTextureDesc,test_cudart_cudaMemcpy2DToArray_DtoD, andtest_cudart_cudaGraphAddMemcpyNode1D) execute beforetest_cudart_cudaMemcpy3DPeerAsync, they leave the CUDA execution context in a state that makes the race condition more likely to manifest. Removing any one of these tests from the execution sequence beforetest_cudart_cudaMemcpy3DPeerAsynccauses the test to pass, demonstrating the order dependency.Solution
Add
cudaStreamSynchronize(0)aftercudaMemcpyto explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics.Comprehensive flakiness testing (win-64 WDDM)
test_cudart_cudaMemcpy3DPeerAsyncEach trial was running these commands:
Internal-only xref with full logs: NVIDIA/cuda-python-private#239