-
Notifications
You must be signed in to change notification settings - Fork 75
Fix 12.6 build with cudaMemcpyBatchAsync
#5799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
!test |
|
Review updated until commit c04a3b8 Description
|
| Relevant files | |||
|---|---|---|---|
| Bug fix |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 No relevant tests |
| ⚡ Recommended focus areas for review |
Missing Test Coverage
|
Greptile OverviewGreptile SummaryThis PR attempts to fix CUDA 12.6 build failures by adding version guards around The IssueThe PR adds outer version checks ( CUDA Version Behavior:
Root CauseAccording to both the PR description and error messages, the Required FixChange both inner version checks (lines 297 and 454) from Code Locations
Confidence Score: 1/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant App as Application
participant API as cudaMemcpyBatchAsync API
participant Compiler as CUDA Compiler
Note over App,Compiler: PR Changes: CUDA Version Handling
alt CUDA < 12.8
App->>API: Check CUDA_VERSION < 12080
API-->>App: NVF_THROW error
Note right of API: ✅ Correct: API not available
end
alt CUDA 12.8 - 12.9 (BROKEN)
App->>Compiler: Check CUDA_VERSION >= 12080
Compiler->>Compiler: Passes outer check
App->>Compiler: Check CUDA_VERSION >= 13000
Compiler->>Compiler: Fails (version < 13000)
App->>API: Call with failIdx parameter
API-->>Compiler: ❌ Compilation Error
Note right of API: Parameter removed in 12.8!
end
alt CUDA >= 13.0
App->>Compiler: Check CUDA_VERSION >= 12080
Compiler->>Compiler: Passes outer check
App->>Compiler: Check CUDA_VERSION >= 13000
Compiler->>Compiler: Passes inner check
App->>API: Call without failIdx parameter
API-->>App: ✅ Success
Note right of API: Correct API signature
end
Note over App,Compiler: Expected behavior (after fix):<br/>Inner check should be >= 12080<br/>to match API change version
|
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
csrc/multidevice/cuda_p2p.cpp
Outdated
| launchMulticastKernel( | ||
| multicast_handle->bufferMulticastPtr(), src_ptr, count, stream); | ||
| } else if (protocol == MulticastProtocol::BatchMemcpy) { | ||
| #if CUDA_VERSION >= 12080 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version check has been updated to >= 12080 (CUDA 12.8), but there's an inconsistency in the codebase: csrc/driver_api.h line 48 still declares cuMemcpyBatchAsync (driver API) with version 13000 (CUDA 13.0).
While this PR uses cudaMemcpyBatchAsync (runtime API) and not the driver API, the inconsistent version numbers could cause confusion. Consider either:
- Updating
driver_api.hto use version12080forcuMemcpyBatchAsyncif the driver API also changed signature in CUDA 12.8 - Removing
cuMemcpyBatchAsyncfromdriver_api.hif the codebase only uses the runtime API - Adding a comment in
driver_api.hexplaining why different versions are used if intentional
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
csrc/multidevice/cuda_p2p.cpp
Outdated
| NVF_THROW( | ||
| "cudaMemcpyBatchAsync backend is not supported for CUDA version < " | ||
| "12.8"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, users on CUDA < 12.8 will encounter a runtime error if they use the BatchMemcpy protocol. This is the correct behavior, but there's a potential usability issue: getMulticastProtocol() in csrc/multidevice/ipc_utils.cpp returns MulticastProtocol::BatchMemcpy as the default protocol when no explicit protocol is set.
This means users on CUDA 12.6 who don't explicitly set a protocol will hit this error at runtime rather than getting a compile-time failure or automatic fallback. Consider adding:
- A runtime check in
getMulticastProtocol()that automatically falls back toMemcpyorMultimemif CUDA version < 12.8 - Or a compile-time check that changes the default protocol based on CUDA_VERSION
- Or better documentation warning users about CUDA version requirements for BatchMemcpy
Example for option 1:
MulticastProtocol getMulticastProtocol() {
if (isOptionEnabled(EnableOption::MulticastProtocol)) {
if (hasEnableOptionArgument(EnableOption::MulticastProtocol, "multimem")) {
return MulticastProtocol::Multimem;
}
if (hasEnableOptionArgument(EnableOption::MulticastProtocol, "memcpy")) {
return MulticastProtocol::Memcpy;
}
}
#if CUDA_VERSION >= 12080
return MulticastProtocol::BatchMemcpy;
#else
return MulticastProtocol::Memcpy; // Fallback for older CUDA
#endif
}|
Thanks for working on it. Build error looks real. Once the cu130 builds in CI passed, I'll check builds for cu126 and 128 internally. |
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
Additional Comments (2)
This creates a logical error:
The version check on line 297 should be
The |
Yes sorry about that. Now it is fixed. I think it should be working, as far I understand, but please confirm. Somehow I cannot find the official documentation about the API introduction (12.6?) and the API change (12.8?) |
xwang233
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for the fix!
Fix 12.6 as described in #5734
@xwang233 is the fix ok ?