Skip to content

Conversation

@github-actions
Copy link

Description

Backport of #1550 to 12.9.x.

The test was experiencing flaky failures, particularly on Windows when
per-thread default stream (PTDS) mode is enabled via
CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1.

Root cause:
The test performs a synchronous cudaMemcpy from pageable host memory to
device memory, followed immediately by cudaMemcpy3DPeerAsync in an
explicit stream. When using pageable host memory, cudaMemcpy may return
before the DMA transfer to device memory has completed. The subsequent
async copy operation in a different stream can then read from device
memory before the host-to-device transfer finishes, resulting in
incorrect (zero) data being copied.

Why it appeared flaky:
- In legacy default stream mode, blocking streams created with
  cudaStreamCreate() implicitly synchronize with the default stream,
  masking the race condition.
- In PTDS mode, the per-thread default stream does not synchronize with
  other streams, exposing the race and causing intermittent failures.

Fix:
Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the
DMA transfer to device memory completes before launching the async copy
operation. This establishes proper ordering between the default stream
and the explicit stream, making the test reliable under both legacy
and PTDS stream semantics.

This fix aligns with documented CUDA behavior: synchronous cudaMemcpy
from pageable host memory does not guarantee completion before return,
and explicit synchronization is required when coordinating with work in
other streams.

(cherry picked from commit c94557c)
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator

rwgk commented Jan 30, 2026

/ok to test

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 30, 2026

/ok to test

@rwgk, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@rwgk
Copy link
Collaborator

rwgk commented Jan 30, 2026

/ok to test 3c84a05

@rwgk rwgk merged commit d08f202 into 12.9.x Jan 31, 2026
46 checks passed
@rwgk rwgk deleted the backport-1550-to-12.9.x branch January 31, 2026 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant