From 3c84a0570995100b77e2b4c1c6ee79f78233aacf Mon Sep 17 00:00:00 2001 From: "Ralf W. Grosse-Kunstleve" Date: Sat, 31 Jan 2026 05:46:19 +0700 Subject: [PATCH] Fix race condition in test_cudart_cudaMemcpy3DPeerAsync (#1550) The test was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1. Root cause: The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied. Why it appeared flaky: - In legacy default stream mode, blocking streams created with cudaStreamCreate() implicitly synchronize with the default stream, masking the race condition. - In PTDS mode, the per-thread default stream does not synchronize with other streams, exposing the race and causing intermittent failures. Fix: Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics. This fix aligns with documented CUDA behavior: synchronous cudaMemcpy from pageable host memory does not guarantee completion before return, and explicit synchronization is required when coordinating with work in other streams. (cherry picked from commit c94557c082c93bb7dfed7ce5f509cf8cf5a9deec) --- cuda_bindings/tests/test_cudart.py | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/cuda_bindings/tests/test_cudart.py b/cuda_bindings/tests/test_cudart.py index 70803c0777..990bc9412d 100644 --- a/cuda_bindings/tests/test_cudart.py +++ b/cuda_bindings/tests/test_cudart.py @@ -1146,6 +1146,10 @@ def test_cudart_cudaMemcpy3DPeerAsync(): (err,) = cudart.cudaMemcpy(dptr, h1, size, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice) assertSuccess(err) + # ensure the DMA to device memory has completed + (err,) = cudart.cudaStreamSynchronize(0) + assertSuccess(err) + # D to arr (err,) = cudart.cudaMemcpy3DPeerAsync(params, stream) assertSuccess(err)