Skip to content

Fix native library loader silently failing when CUDA DLLs are missing#1555

Open
alinpahontu2912 wants to merge 5 commits intodotnet:mainfrom
alinpahontu2912:fix/cuda-dll-silent-load-failure
Open

Fix native library loader silently failing when CUDA DLLs are missing#1555
alinpahontu2912 wants to merge 5 commits intodotnet:mainfrom
alinpahontu2912:fix/cuda-dll-silent-load-failure

Conversation

@alinpahontu2912
Copy link
Member

Change all 'ok = TryLoadNativeLibraryByName(...)' calls to 'ok &= ...' so that failures accumulate instead of being overwritten by subsequent successful loads. Initialize 'ok = true' before the loading chain.

Previously, each load call overwrote the result of the previous one, so if an early CUDA dependency (e.g. cudnn_adv64_9) failed to load but LibTorchSharp succeeded, 'ok' would be true. This caused:

  • nativeBackendCudaLoaded set to true despite missing dependencies
  • The fallback loading path was skipped
  • The diagnostic trace (StringBuilder) was discarded
  • Subsequent load attempts were skipped entirely
  • CUDA operations failed later with cryptic errors

Now any single load failure keeps 'ok' as false, ensuring the fallback path is attempted and the full diagnostic trace is preserved in error messages.

Fixes #1545

alinpahontu2912 and others added 2 commits March 20, 2026 13:05
Change all 'ok = TryLoadNativeLibraryByName(...)' calls to 'ok &= ...' so that
failures accumulate instead of being overwritten by subsequent successful loads.
Initialize 'ok = true' before the loading chain.

Previously, each load call overwrote the result of the previous one, so if an
early CUDA dependency (e.g. cudnn_adv64_9) failed to load but LibTorchSharp
succeeded, 'ok' would be true. This caused:
- nativeBackendCudaLoaded set to true despite missing dependencies
- The fallback loading path was skipped
- The diagnostic trace (StringBuilder) was discarded
- Subsequent load attempts were skipped entirely
- CUDA operations failed later with cryptic errors

Now any single load failure keeps 'ok' as false, ensuring the fallback path is
attempted and the full diagnostic trace is preserved in error messages.

Fixes dotnet#1545

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change all 'ok = TryLoadNativeLibraryByName(...)' calls to 'ok &= ...' so that
failures accumulate instead of being overwritten by subsequent successful loads.
Initialize 'ok = true' before the loading chain.

Previously, each load call overwrote the result of the previous one, so if an
early CUDA dependency (e.g. cudnn_adv64_9) failed to load but LibTorchSharp
succeeded, 'ok' would be true. This caused:
- nativeBackendCudaLoaded set to true despite missing dependencies
- The fallback loading path was skipped
- The diagnostic trace (StringBuilder) was discarded
- Subsequent load attempts were skipped entirely
- CUDA operations failed later with cryptic errors

Now any single load failure keeps 'ok' as false, ensuring the fallback path is
attempted and the full diagnostic trace is preserved in error messages.

Fixes dotnet#1545

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a correctness issue in TorchSharp’s native library loader where a failed CUDA dependency load could be overwritten by later successful loads, causing the loader to incorrectly treat the CUDA backend as initialized and skip the fallback/diagnostics path.

Changes:

  • Initialize ok to true before the native-load attempt chain.
  • Accumulate load results using ok &= TryLoadNativeLibraryByName(...) so any single failure preserves ok == false while still attempting all loads (capturing full diagnostics).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

alinpahontu2912 and others added 3 commits March 20, 2026 16:33
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The previous fix changed simple assignments (ok =) to compound AND
(ok &=) for all native library loads. This broke CPU-only systems
because torch_cuda failing would permanently set ok=false, even when
LibTorchSharp loaded successfully, triggering a NotSupportedException.

Changes:
- CUDA preloads (cudnn, nvrtc, etc.): fire-and-forget, failures are
  non-fatal and only logged via trace
- torch_cuda / torch_cpu: fire-and-forget warmup loads for dependency
  resolution; their result doesn't gate success
- LibTorchSharp: the only critical load that determines ok
- TryInitializeDeviceType: catch NotSupportedException to prevent
  static constructor crashes
- Static constructor: fall back to CPU backend when CUDA is unavailable

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Silent Native Library Failure in torch::loadNativeBackEnd

2 participants