Skip to content

Conversation

@rwgk
Copy link
Collaborator

@rwgk rwgk commented Jan 30, 2026

Description

test_cudart_cudaMemcpy3DPeerAsync was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1.

Root Cause

The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied.

This race condition is masked in legacy default stream mode due to implicit synchronization between blocking streams and the default stream, but is exposed in PTDS mode where streams do not implicitly synchronize.

Test Order Dependency

The flakiness was exposed through test order randomization. The cuda_bindings test suite uses pytest-randomly (see #1268), which automatically randomizes test execution order. The test suite shares a module-scoped CUDA context fixture across all tests in test_cudart.py.

When certain tests (notably test_cudaGraphConditionalHandleCreate_v2, test_cudart_cudaGetTextureObjectTextureDesc, test_cudart_cudaMemcpy2DToArray_DtoD, and test_cudart_cudaGraphAddMemcpyNode1D) execute before test_cudart_cudaMemcpy3DPeerAsync, they leave the CUDA execution context in a state that makes the race condition more likely to manifest. Removing any one of these tests from the execution sequence before test_cudart_cudaMemcpy3DPeerAsync causes the test to pass, demonstrating the order dependency.

Solution

Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics.

Comprehensive flakiness testing (win-64 WDDM)

  • With fix: 1000 trials - ✓ All tests passed with no flakes, errors, or crashes
  • Without fix: 1000 trials - 58 failures of test_cudart_cudaMemcpy3DPeerAsync

Each trial was running these commands:

cd cuda_bindings\
python -m pytest -ra -s -vv "tests\test_cudart.py"
set CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1 && python -m pytest -ra -s -vv "tests\test_cudart.py" & set CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=

Internal-only xref with full logs: NVIDIA/cuda-python-private#239

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 30, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

The test was experiencing flaky failures, particularly on Windows when
per-thread default stream (PTDS) mode is enabled via
CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1.

Root cause:
The test performs a synchronous cudaMemcpy from pageable host memory to
device memory, followed immediately by cudaMemcpy3DPeerAsync in an
explicit stream. When using pageable host memory, cudaMemcpy may return
before the DMA transfer to device memory has completed. The subsequent
async copy operation in a different stream can then read from device
memory before the host-to-device transfer finishes, resulting in
incorrect (zero) data being copied.

Why it appeared flaky:
- In legacy default stream mode, blocking streams created with
  cudaStreamCreate() implicitly synchronize with the default stream,
  masking the race condition.
- In PTDS mode, the per-thread default stream does not synchronize with
  other streams, exposing the race and causing intermittent failures.

Fix:
Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the
DMA transfer to device memory completes before launching the async copy
operation. This establishes proper ordering between the default stream
and the explicit stream, making the test reliable under both legacy
and PTDS stream semantics.

This fix aligns with documented CUDA behavior: synchronous cudaMemcpy
from pageable host memory does not guarantee completion before return,
and explicit synchronization is required when coordinating with work in
other streams.
@rwgk rwgk force-pushed the cuda_bindings_test_cudart_fix_flakiness branch from d6322b4 to 691a97f Compare January 30, 2026 19:42
@rwgk
Copy link
Collaborator Author

rwgk commented Jan 30, 2026

/ok to test

@kkraus14 kkraus14 added the to-be-backported Trigger the bot to raise a backport PR upon merge label Jan 30, 2026
@kkraus14 kkraus14 enabled auto-merge (squash) January 30, 2026 19:44
@github-actions

This comment has been minimized.

@kkraus14 kkraus14 merged commit c94557c into NVIDIA:main Jan 30, 2026
88 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 30, 2026
The test was experiencing flaky failures, particularly on Windows when
per-thread default stream (PTDS) mode is enabled via
CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1.

Root cause:
The test performs a synchronous cudaMemcpy from pageable host memory to
device memory, followed immediately by cudaMemcpy3DPeerAsync in an
explicit stream. When using pageable host memory, cudaMemcpy may return
before the DMA transfer to device memory has completed. The subsequent
async copy operation in a different stream can then read from device
memory before the host-to-device transfer finishes, resulting in
incorrect (zero) data being copied.

Why it appeared flaky:
- In legacy default stream mode, blocking streams created with
  cudaStreamCreate() implicitly synchronize with the default stream,
  masking the race condition.
- In PTDS mode, the per-thread default stream does not synchronize with
  other streams, exposing the race and causing intermittent failures.

Fix:
Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the
DMA transfer to device memory completes before launching the async copy
operation. This establishes proper ordering between the default stream
and the explicit stream, making the test reliable under both legacy
and PTDS stream semantics.

This fix aligns with documented CUDA behavior: synchronous cudaMemcpy
from pageable host memory does not guarantee completion before return,
and explicit synchronization is required when coordinating with work in
other streams.

(cherry picked from commit c94557c)
@github-actions
Copy link

Successfully created backport PR for 12.9.x:

@rwgk rwgk deleted the cuda_bindings_test_cudart_fix_flakiness branch January 30, 2026 22:47
@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

rwgk added a commit that referenced this pull request Jan 31, 2026
The test was experiencing flaky failures, particularly on Windows when
per-thread default stream (PTDS) mode is enabled via
CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1.

Root cause:
The test performs a synchronous cudaMemcpy from pageable host memory to
device memory, followed immediately by cudaMemcpy3DPeerAsync in an
explicit stream. When using pageable host memory, cudaMemcpy may return
before the DMA transfer to device memory has completed. The subsequent
async copy operation in a different stream can then read from device
memory before the host-to-device transfer finishes, resulting in
incorrect (zero) data being copied.

Why it appeared flaky:
- In legacy default stream mode, blocking streams created with
  cudaStreamCreate() implicitly synchronize with the default stream,
  masking the race condition.
- In PTDS mode, the per-thread default stream does not synchronize with
  other streams, exposing the race and causing intermittent failures.

Fix:
Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the
DMA transfer to device memory completes before launching the async copy
operation. This establishes proper ordering between the default stream
and the explicit stream, making the test reliable under both legacy
and PTDS stream semantics.

This fix aligns with documented CUDA behavior: synchronous cudaMemcpy
from pageable host memory does not guarantee completion before return,
and explicit synchronization is required when coordinating with work in
other streams.

(cherry picked from commit c94557c)

Co-authored-by: Ralf W. Grosse-Kunstleve <rwgkio@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

to-be-backported Trigger the bot to raise a backport PR upon merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants