-
Notifications
You must be signed in to change notification settings - Fork 242
Description
Summary
On platforms where Unified Memory has limited concurrency support (e.g., certain Windows platforms), destroying a stream without synchronizing can leave the system in a state where any subsequent host access to any Unified Memory causes a crash. This issue proposes adding configurable safeguards to stream destruction.
Problem Description
On affected platforms, the following sequence causes a crash:
- Allocate Unified Memory buffer
B - Launch any kernel on any stream (kernel need not touch
B) - Access
Bfrom the host - Crash (segfault / access violation)
Root cause: On platforms where attribute CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS is zero, the GPU holds exclusive access to all managed memory while any kernel is in flight. Host access to managed memory—even allocations unrelated to the running kernel—is forbidden until the stream is synchronized.
Key insight: Destroying the stream does not restore safe host access. Only synchronizing the stream before destruction does.
Impact on testing: Tests that launch kernels without synchronizing effectively "arm" a crash. Subsequent tests that access Unified Memory on the host will crash, even though they did nothing wrong. This makes failures difficult to diagnose.
Proposed Solution
Add logic to stream destruction that detects unsynchronized streams and optionally warns or synchronizes before the stream is destroyed.
Detection mechanisms:
cuStreamQuery()— check if stream has in-flight workCU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS— check if platform is affected
Configuration: CUDA_PYTHON_STREAM_DESTROY_SYNC_MODE
| Value | Behavior |
|---|---|
0 |
Do nothing (current behavior, default) |
1 |
Warn unconditionally when destroying an unsynchronized stream |
2 |
Warn only on affected platforms (concurrentManagedAccess == 0) |
3 |
Implicitly synchronize on affected platforms |
4 |
Warn + synchronize on affected platforms |
This gives users a spectrum from "purely diagnostic" to "safety-first" while preserving backward compatibility by default.