Skip to content

Conversation

@samnordmann
Copy link
Collaborator

Fix 12.6 as described in #5734
@xwang233 is the fix ok ?

@samnordmann samnordmann requested a review from xwang233 January 12, 2026 15:38
@samnordmann
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Jan 12, 2026

Review updated until commit c04a3b8

Description

  • Add CUDA version check for cudaMemcpyBatchAsync support (requires CUDA >= 12.8)

  • Throw clear exception when BatchMemcpy protocol used on unsupported CUDA versions

  • Wrap existing cudaMemcpyBatchAsync calls in version guard to prevent crashes

  • Apply fix to both postBroadcastWithCudaBackend and postAllgatherWithCudaBackend functions

Changes walkthrough

Relevant files
Bug fix
cuda_p2p.cpp
Add CUDA version guards for cudaMemcpyBatchAsync                 

csrc/multidevice/cuda_p2p.cpp

  • Added CUDA_VERSION < 12080 preprocessor checks in
    postBroadcastWithCudaBackend and postAllgatherWithCudaBackend
    functions
  • Throw NVF_THROW exception with descriptive message for unsupported
    CUDA versions
  • Wrapped existing cudaMemcpyBatchAsync calls in #else blocks to ensure
    they only execute on CUDA >= 12.8
  • Prevents crashes when BatchMemcpy protocol is used on older CUDA
    versions
  • +12/-0   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review
    Missing Test Coverage

    This PR adds CUDA version compatibility checks for cudaMemcpyBatchAsync functionality but doesn't include any tests to verify the version checking logic works correctly. Tests should be added to ensure proper error handling for CUDA versions < 12.8 and correct functionality for supported versions.

    #if CUDA_VERSION < 12080
          NVF_THROW(
              "cudaMemcpyBatchAsync backend is not supported for CUDA version < "
              "12.8");
    Limited Documentation

    The PR description is very brief and doesn't provide context about why this fix is needed or what specific issues it resolves. More detailed documentation would help reviewers understand the rationale behind the version check and the expected behavior.

    // clang-format off

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Jan 12, 2026

    Greptile Overview

    Greptile Summary

    This PR attempts to fix CUDA 12.6 build failures by adding version guards around cudaMemcpyBatchAsync API usage. However, the fix contains a critical logic error.

    The Issue

    The PR adds outer version checks (#if CUDA_VERSION < 12080) that correctly throw errors for CUDA < 12.8, but it retains the incorrect inner version checks (#if CUDA_VERSION >= 13000) that were present in the base branch. This creates a problematic scenario:

    CUDA Version Behavior:

    • < 12.8: Throws error (✅ correct - API not available)
    • 12.8 - 12.9: Attempts to use old API signature with failIdx parameter (❌ will fail - parameter removed in 12.8)
    • ≥ 13.0: Uses new API signature without failIdx (✅ correct)

    Root Cause

    According to both the PR description and error messages, the cudaMemcpyBatchAsync API signature changed in CUDA 12.8, removing the failIdx output parameter. However, the inner version checks still use >= 13000, creating a gap where CUDA 12.8 and 12.9 will try to compile code using an API signature that doesn't exist in those versions.

    Required Fix

    Change both inner version checks (lines 297 and 454) from #if CUDA_VERSION >= 13000 to #if CUDA_VERSION >= 12080 to correctly align with when the API actually changed.

    Code Locations

    • postBroadcastWithCudaBackend: Line 297 needs correction
    • postAllgatherWithCudaBackend: Line 454 needs correction

    Confidence Score: 1/5

    • This PR should NOT be merged - it contains critical logic errors that will cause compilation failures on CUDA 12.8-12.9
    • The PR only partially fixes the issue with CUDA 12.6 builds. While it correctly adds guards to prevent CUDA < 12.8 from attempting to use the API, it fails to update the inner version checks that distinguish between the old and new API signatures. CUDA 12.8 and 12.9 will fail to compile because they'll attempt to use the old API signature with the failIdx parameter that was removed in CUDA 12.8. The fix is straightforward (change >= 13000 to >= 12080 in two locations), but as written, this PR will break builds for CUDA 12.8 and 12.9.
    • csrc/multidevice/cuda_p2p.cpp requires immediate attention - both version check locations (lines 297 and 454) must be corrected

    Important Files Changed

    File Analysis

    Filename Score Overview
    csrc/multidevice/cuda_p2p.cpp 1/5 Incorrect CUDA version checks cause compilation failures for CUDA 12.8-12.9

    Sequence Diagram

    sequenceDiagram
        participant App as Application
        participant API as cudaMemcpyBatchAsync API
        participant Compiler as CUDA Compiler
        
        Note over App,Compiler: PR Changes: CUDA Version Handling
        
        alt CUDA < 12.8
            App->>API: Check CUDA_VERSION < 12080
            API-->>App: NVF_THROW error
            Note right of API: ✅ Correct: API not available
        end
        
        alt CUDA 12.8 - 12.9 (BROKEN)
            App->>Compiler: Check CUDA_VERSION >= 12080
            Compiler->>Compiler: Passes outer check
            App->>Compiler: Check CUDA_VERSION >= 13000
            Compiler->>Compiler: Fails (version < 13000)
            App->>API: Call with failIdx parameter
            API-->>Compiler: ❌ Compilation Error
            Note right of API: Parameter removed in 12.8!
        end
        
        alt CUDA >= 13.0
            App->>Compiler: Check CUDA_VERSION >= 12080
            Compiler->>Compiler: Passes outer check
            App->>Compiler: Check CUDA_VERSION >= 13000
            Compiler->>Compiler: Passes inner check
            App->>API: Call without failIdx parameter
            API-->>App: ✅ Success
            Note right of API: Correct API signature
        end
        
        Note over App,Compiler: Expected behavior (after fix):<br/>Inner check should be >= 12080<br/>to match API change version
    
    Loading

    @samnordmann
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    No files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, 2 comments

    Edit Code Review Agent Settings | Greptile

    launchMulticastKernel(
    multicast_handle->bufferMulticastPtr(), src_ptr, count, stream);
    } else if (protocol == MulticastProtocol::BatchMemcpy) {
    #if CUDA_VERSION >= 12080
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The version check has been updated to >= 12080 (CUDA 12.8), but there's an inconsistency in the codebase: csrc/driver_api.h line 48 still declares cuMemcpyBatchAsync (driver API) with version 13000 (CUDA 13.0).

    While this PR uses cudaMemcpyBatchAsync (runtime API) and not the driver API, the inconsistent version numbers could cause confusion. Consider either:

    1. Updating driver_api.h to use version 12080 for cuMemcpyBatchAsync if the driver API also changed signature in CUDA 12.8
    2. Removing cuMemcpyBatchAsync from driver_api.h if the codebase only uses the runtime API
    3. Adding a comment in driver_api.h explaining why different versions are used if intentional

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Comment on lines 303 to 305
    NVF_THROW(
    "cudaMemcpyBatchAsync backend is not supported for CUDA version < "
    "12.8");
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    With this change, users on CUDA < 12.8 will encounter a runtime error if they use the BatchMemcpy protocol. This is the correct behavior, but there's a potential usability issue: getMulticastProtocol() in csrc/multidevice/ipc_utils.cpp returns MulticastProtocol::BatchMemcpy as the default protocol when no explicit protocol is set.

    This means users on CUDA 12.6 who don't explicitly set a protocol will hit this error at runtime rather than getting a compile-time failure or automatic fallback. Consider adding:

    1. A runtime check in getMulticastProtocol() that automatically falls back to Memcpy or Multimem if CUDA version < 12.8
    2. Or a compile-time check that changes the default protocol based on CUDA_VERSION
    3. Or better documentation warning users about CUDA version requirements for BatchMemcpy

    Example for option 1:

    MulticastProtocol getMulticastProtocol() {
      if (isOptionEnabled(EnableOption::MulticastProtocol)) {
        if (hasEnableOptionArgument(EnableOption::MulticastProtocol, "multimem")) {
          return MulticastProtocol::Multimem;
        }
        if (hasEnableOptionArgument(EnableOption::MulticastProtocol, "memcpy")) {
          return MulticastProtocol::Memcpy;
        }
      }
    #if CUDA_VERSION >= 12080
      return MulticastProtocol::BatchMemcpy;
    #else
      return MulticastProtocol::Memcpy;  // Fallback for older CUDA
    #endif
    }

    @xwang233
    Copy link
    Collaborator

    Thanks for working on it. Build error looks real. Once the cu130 builds in CI passed, I'll check builds for cu126 and 128 internally.

    @samnordmann
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, 2 comments

    Edit Code Review Agent Settings | Greptile

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Jan 13, 2026

    Additional Comments (2)

    csrc/multidevice/cuda_p2p.cpp
    The inner version check is incorrect. According to the PR description and error message, the cudaMemcpyBatchAsync API signature changed in CUDA 12.8 (removing the failIdx parameter), but this code still uses CUDA_VERSION >= 13000 to distinguish between API signatures.

    This creates a logical error:

    • CUDA 12.8-12.9: Code will attempt to call the API with the failIdx parameter (lines 308-318), but this parameter was removed in CUDA 12.8, causing compilation failures.
    • CUDA 13.0+: Code correctly uses the new API without failIdx (lines 298-306).

    The version check on line 297 should be #if CUDA_VERSION >= 12080 to match when the API actually changed. The same issue exists in the postAllgatherWithCudaBackend function at line 454.

    #if CUDA_VERSION >= 12080
    

    csrc/multidevice/cuda_p2p.cpp
    Same issue as in postBroadcastWithCudaBackend: the inner version check should be >= 12080, not >= 13000.

    The cudaMemcpyBatchAsync API signature changed in CUDA 12.8, so CUDA versions 12.8 and 12.9 would fail to compile when trying to use the old API signature with the failIdx parameter (lines 465-475) that no longer exists.

    #if CUDA_VERSION >= 12080
    

    @samnordmann
    Copy link
    Collaborator Author

    Thanks for working on it. Build error looks real. Once the cu130 builds in CI passed, I'll check builds for cu126 and 128 internally.

    Yes sorry about that. Now it is fixed. I think it should be working, as far I understand, but please confirm. Somehow I cannot find the official documentation about the API introduction (12.6?) and the API change (12.8?)

    Copy link
    Collaborator

    @xwang233 xwang233 left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    lgtm, thanks for the fix!

    @xwang233 xwang233 merged commit 14026f0 into main Jan 14, 2026
    62 checks passed
    @xwang233 xwang233 deleted the fix_cuda128_BatchMemcpy branch January 14, 2026 19:43
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants