Fix 12.6 build with `cudaMemcpyBatchAsync` #5799

samnordmann · 2026-01-12T15:37:47Z

Fix 12.6 as described in #5734
@xwang233 is the fix ok ?

samnordmann · 2026-01-12T15:38:22Z

!test

github-actions · 2026-01-12T15:39:05Z

Review updated until commit c04a3b8

Description

Add CUDA version check for cudaMemcpyBatchAsync support (requires CUDA >= 12.8)
Throw clear exception when BatchMemcpy protocol used on unsupported CUDA versions
Wrap existing cudaMemcpyBatchAsync calls in version guard to prevent crashes
Apply fix to both postBroadcastWithCudaBackend and postAllgatherWithCudaBackend functions

Changes walkthrough

Relevant files

Bug fix

cuda_p2p.cpp `Add CUDA version guards for cudaMemcpyBatchAsync` csrc/multidevice/cuda_p2p.cpp Added CUDA_VERSION < 12080 preprocessor checks in postBroadcastWithCudaBackend and postAllgatherWithCudaBackend functions Throw NVF_THROW exception with descriptive message for unsupported CUDA versions Wrapped existing cudaMemcpyBatchAsync calls in #else blocks to ensure they only execute on CUDA >= 12.8 Prevents crashes when BatchMemcpy protocol is used on older CUDA versions	+12/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 No relevant tests
⚡ Recommended focus areas for review
Missing Test Coverage This PR adds CUDA version compatibility checks for cudaMemcpyBatchAsync functionality but doesn't include any tests to verify the version checking logic works correctly. Tests should be added to ensure proper error handling for CUDA versions < 12.8 and correct functionality for supported versions. #if CUDA_VERSION < 12080 NVF_THROW( "cudaMemcpyBatchAsync backend is not supported for CUDA version < " "12.8"); Limited Documentation The PR description is very brief and doesn't provide context about why this fix is needed or what specific issues it resolves. More detailed documentation would help reviewers understand the rationale behind the version check and the expected behavior. // clang-format off

greptile-apps · 2026-01-12T15:40:11Z

Greptile Overview

Greptile Summary

This PR attempts to fix CUDA 12.6 build failures by adding version guards around cudaMemcpyBatchAsync API usage. However, the fix contains a critical logic error.

The Issue

The PR adds outer version checks (#if CUDA_VERSION < 12080) that correctly throw errors for CUDA < 12.8, but it retains the incorrect inner version checks (#if CUDA_VERSION >= 13000) that were present in the base branch. This creates a problematic scenario:

CUDA Version Behavior:

< 12.8: Throws error (✅ correct - API not available)
12.8 - 12.9: Attempts to use old API signature with failIdx parameter (❌ will fail - parameter removed in 12.8)
≥ 13.0: Uses new API signature without failIdx (✅ correct)

Root Cause

According to both the PR description and error messages, the cudaMemcpyBatchAsync API signature changed in CUDA 12.8, removing the failIdx output parameter. However, the inner version checks still use >= 13000, creating a gap where CUDA 12.8 and 12.9 will try to compile code using an API signature that doesn't exist in those versions.

Required Fix

Change both inner version checks (lines 297 and 454) from #if CUDA_VERSION >= 13000 to #if CUDA_VERSION >= 12080 to correctly align with when the API actually changed.

Code Locations

postBroadcastWithCudaBackend: Line 297 needs correction
postAllgatherWithCudaBackend: Line 454 needs correction

Confidence Score: 1/5

This PR should NOT be merged - it contains critical logic errors that will cause compilation failures on CUDA 12.8-12.9
The PR only partially fixes the issue with CUDA 12.6 builds. While it correctly adds guards to prevent CUDA < 12.8 from attempting to use the API, it fails to update the inner version checks that distinguish between the old and new API signatures. CUDA 12.8 and 12.9 will fail to compile because they'll attempt to use the old API signature with the failIdx parameter that was removed in CUDA 12.8. The fix is straightforward (change >= 13000 to >= 12080 in two locations), but as written, this PR will break builds for CUDA 12.8 and 12.9.
csrc/multidevice/cuda_p2p.cpp requires immediate attention - both version check locations (lines 297 and 454) must be corrected

Important Files Changed

File Analysis

Filename	Score	Overview
csrc/multidevice/cuda_p2p.cpp	1/5	Incorrect CUDA version checks cause compilation failures for CUDA 12.8-12.9

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant API as cudaMemcpyBatchAsync API
    participant Compiler as CUDA Compiler
    
    Note over App,Compiler: PR Changes: CUDA Version Handling
    
    alt CUDA < 12.8
        App->>API: Check CUDA_VERSION < 12080
        API-->>App: NVF_THROW error
        Note right of API: ✅ Correct: API not available
    end
    
    alt CUDA 12.8 - 12.9 (BROKEN)
        App->>Compiler: Check CUDA_VERSION >= 12080
        Compiler->>Compiler: Passes outer check
        App->>Compiler: Check CUDA_VERSION >= 13000
        Compiler->>Compiler: Fails (version < 13000)
        App->>API: Call with failIdx parameter
        API-->>Compiler: ❌ Compilation Error
        Note right of API: Parameter removed in 12.8!
    end
    
    alt CUDA >= 13.0
        App->>Compiler: Check CUDA_VERSION >= 12080
        Compiler->>Compiler: Passes outer check
        App->>Compiler: Check CUDA_VERSION >= 13000
        Compiler->>Compiler: Passes inner check
        App->>API: Call without failIdx parameter
        API-->>App: ✅ Success
        Note right of API: Correct API signature
    end
    
    Note over App,Compiler: Expected behavior (after fix):<br/>Inner check should be >= 12080<br/>to match API change version

samnordmann · 2026-01-12T15:40:14Z

!test

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T15:44:58Z

csrc/multidevice/cuda_p2p.cpp

      launchMulticastKernel(
          multicast_handle->bufferMulticastPtr(), src_ptr, count, stream);
    } else if (protocol == MulticastProtocol::BatchMemcpy) {
+#if CUDA_VERSION >= 12080


The version check has been updated to >= 12080 (CUDA 12.8), but there's an inconsistency in the codebase: csrc/driver_api.h line 48 still declares cuMemcpyBatchAsync (driver API) with version 13000 (CUDA 13.0).

While this PR uses cudaMemcpyBatchAsync (runtime API) and not the driver API, the inconsistent version numbers could cause confusion. Consider either:

Updating driver_api.h to use version 12080 for cuMemcpyBatchAsync if the driver API also changed signature in CUDA 12.8

Removing cuMemcpyBatchAsync from driver_api.h if the codebase only uses the runtime API

Adding a comment in driver_api.h explaining why different versions are used if intentional

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-12T15:44:59Z

csrc/multidevice/cuda_p2p.cpp

+      NVF_THROW(
+          "cudaMemcpyBatchAsync backend is not supported for CUDA version < "
+          "12.8");


With this change, users on CUDA < 12.8 will encounter a runtime error if they use the BatchMemcpy protocol. This is the correct behavior, but there's a potential usability issue: getMulticastProtocol() in csrc/multidevice/ipc_utils.cpp returns MulticastProtocol::BatchMemcpy as the default protocol when no explicit protocol is set.

This means users on CUDA 12.6 who don't explicitly set a protocol will hit this error at runtime rather than getting a compile-time failure or automatic fallback. Consider adding:

A runtime check in getMulticastProtocol() that automatically falls back to Memcpy or Multimem if CUDA version < 12.8

Or a compile-time check that changes the default protocol based on CUDA_VERSION

Or better documentation warning users about CUDA version requirements for BatchMemcpy

Example for option 1:

MulticastProtocol getMulticastProtocol() { if (isOptionEnabled(EnableOption::MulticastProtocol)) { if (hasEnableOptionArgument(EnableOption::MulticastProtocol, "multimem")) { return MulticastProtocol::Multimem; } if (hasEnableOptionArgument(EnableOption::MulticastProtocol, "memcpy")) { return MulticastProtocol::Memcpy; } } #if CUDA_VERSION >= 12080 return MulticastProtocol::BatchMemcpy; #else return MulticastProtocol::Memcpy; // Fallback for older CUDA #endif }

xwang233 · 2026-01-13T00:29:16Z

Thanks for working on it. Build error looks real. Once the cu130 builds in CI passed, I'll check builds for cu126 and 128 internally.

samnordmann · 2026-01-13T08:49:37Z

!test

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T08:57:52Z

Additional Comments (2)

csrc/multidevice/cuda_p2p.cpp
The inner version check is incorrect. According to the PR description and error message, the cudaMemcpyBatchAsync API signature changed in CUDA 12.8 (removing the failIdx parameter), but this code still uses CUDA_VERSION >= 13000 to distinguish between API signatures.

This creates a logical error:

CUDA 12.8-12.9: Code will attempt to call the API with the failIdx parameter (lines 308-318), but this parameter was removed in CUDA 12.8, causing compilation failures.
CUDA 13.0+: Code correctly uses the new API without failIdx (lines 298-306).

The version check on line 297 should be #if CUDA_VERSION >= 12080 to match when the API actually changed. The same issue exists in the postAllgatherWithCudaBackend function at line 454.

#if CUDA_VERSION >= 12080

csrc/multidevice/cuda_p2p.cpp
Same issue as in postBroadcastWithCudaBackend: the inner version check should be >= 12080, not >= 13000.

The cudaMemcpyBatchAsync API signature changed in CUDA 12.8, so CUDA versions 12.8 and 12.9 would fail to compile when trying to use the old API signature with the failIdx parameter (lines 465-475) that no longer exists.

#if CUDA_VERSION >= 12080

samnordmann · 2026-01-13T13:26:51Z

Thanks for working on it. Build error looks real. Once the cu130 builds in CI passed, I'll check builds for cu126 and 128 internally.

Yes sorry about that. Now it is fixed. I think it should be working, as far I understand, but please confirm. Somehow I cannot find the official documentation about the API introduction (12.6?) and the API change (12.8?)

xwang233

lgtm, thanks for the fix!

Throw instead of calling cudaMemcpyBatchAsyn for Cuda<12.8

f2e9b53

samnordmann requested a review from xwang233 January 12, 2026 15:38

lint

76ce6d4

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

account for API change

c04a3b8

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

xwang233 approved these changes Jan 14, 2026

View reviewed changes

xwang233 merged commit 14026f0 into main Jan 14, 2026
62 checks passed

xwang233 deleted the fix_cuda128_BatchMemcpy branch January 14, 2026 19:43

Fix 12.6 build with cudaMemcpyBatchAsync #5799

Fix 12.6 build with cudaMemcpyBatchAsync #5799

Uh oh!

Conversation

samnordmann commented Jan 12, 2026

Uh oh!

samnordmann commented Jan 12, 2026

Uh oh!

github-actions bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

The Issue

Root Cause

Required Fix

Code Locations

Confidence Score: 1/5

Important Files Changed

Sequence Diagram

Uh oh!

samnordmann commented Jan 12, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

xwang233 commented Jan 13, 2026

Uh oh!

samnordmann commented Jan 13, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 13, 2026

Uh oh!

samnordmann commented Jan 13, 2026

Uh oh!

xwang233 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix 12.6 build with `cudaMemcpyBatchAsync` #5799

Fix 12.6 build with `cudaMemcpyBatchAsync` #5799

github-actions bot commented Jan 12, 2026 •

edited

Loading

greptile-apps bot commented Jan 12, 2026 •

edited

Loading