Skip to content

refactor(cuda_core): defer device checks in LaunchConfig to launch time#2066

Open
KRRT7 wants to merge 5 commits into
NVIDIA:mainfrom
KRRT7:fix/launch-config-defer-device-lookup-to-stream
Open

refactor(cuda_core): defer device checks in LaunchConfig to launch time#2066
KRRT7 wants to merge 5 commits into
NVIDIA:mainfrom
KRRT7:fix/launch-config-defer-device-lookup-to-stream

Conversation

@KRRT7
Copy link
Copy Markdown

@KRRT7 KRRT7 commented May 12, 2026

Summary

  • LaunchConfig.__init__ previously called Device() to validate compute capability (cluster launches) and cooperative_launch support, but the stream — and therefore the correct device — is not known at construction time
  • Moves both checks into _launcher.pyx where stream.device is available: _check_cluster_launch (CC < 9.0 guard) and an updated _check_cooperative_launch (adds cooperative_launch support guard before the existing grid-size check)
  • LaunchConfig.__init__ is now a pure data class with zero driver calls; cluster and cooperative configs can be constructed without a CUDA context

Test plan

  • Removed try/except CUDAError skip guards from test_launch_config_cluster_grid_conversion — constructing LaunchConfig(cluster=...) no longer raises on sub-CC-9.0 devices, so these tests now run on all hardware
  • Validated on L4 (CC 8.9): 4 passed, 2 skipped (cluster launch tests still skip correctly at launch() time on CC < 9.0), full suite 3225 passed

KRRT7 and others added 2 commits May 10, 2026 21:59
Avoid calling Device() twice (once for cluster validation, once for
cooperative check). Now called at most once, and zero times for the
common simple-launch path where neither cluster nor is_cooperative
is set.

Co-Authored-By: Claude <noreply@anthropic.com>
LaunchConfig.__init__ previously called Device() to validate compute
capability (for cluster launches) and cooperative_launch support, but at
construction time the stream — and therefore the correct device — is not
yet known.

Move both checks into _launcher.pyx where the stream is available:
- _check_cluster_launch: queries stream.device.compute_capability and
  raises if CC < 9.0 (thread block clusters require H100+)
- _check_cooperative_launch: now also guards cooperative_launch support
  via stream.device before the grid-size check

LaunchConfig.__init__ is now a pure data class with no driver calls.
Cluster and cooperative config objects can be constructed without a
CUDA context, and errors surface at launch() time with the correct
device in scope.

Remove the try/except CUDAError skip guards from cluster-related tests;
constructing LaunchConfig(cluster=...) no longer raises on sub-CC-9.0
devices, so those tests run on all hardware.
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label May 12, 2026
mdboom and others added 3 commits May 12, 2026 08:32
* Fix tab completion

* Fix tests

* Always install the monkeypatch

* Update release note

* Apply suggestion from @leofang

Co-authored-by: Leo Fang <leof@nvidia.com>

* Fix test

* Fix tests hanging on Windows

---------

Co-authored-by: Leo Fang <leof@nvidia.com>
LaunchConfig.__init__ previously called Device() to validate compute
capability (for cluster launches) and cooperative_launch support, but at
construction time the stream — and therefore the correct device — is not
yet known.

Move both checks into _launcher.pyx where the stream is available:
- _check_cluster_launch: queries stream.device.compute_capability and
  raises if CC < 9.0 (thread block clusters require H100+)
- _check_cooperative_launch: now also guards cooperative_launch support
  via stream.device before the grid-size check

LaunchConfig.__init__ is now a pure data class with no driver calls.
Cluster and cooperative config objects can be constructed without a
CUDA context, and errors surface at launch() time with the correct
device in scope.

Remove the try/except CUDAError skip guards from cluster-related tests;
constructing LaunchConfig(cluster=...) no longer raises on sub-CC-9.0
devices, so those tests run on all hardware.
@mdboom
Copy link
Copy Markdown
Contributor

mdboom commented May 12, 2026

Can you provide (or link to) more context about why this change is desirable?

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 12, 2026

There's a FIXME in the original code that motivated this — Device() at construction time doesn't know which device the kernel will actually run on. It just grabs whatever's current, which is wrong if you're targeting a non-default device or haven't called set_current() yet.

launch() already has the stream, and stream.device is already used right there in _check_cooperative_launch for the grid-size check. So this just moves the CC and cooperative-support checks to follow the same pattern — check against the device the kernel will actually run on.

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 12, 2026

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, @KRRT7! However, there is a compromise we need to make. launch is on the critical path so we would need to move checks to either beforehand or rely on the driver as the final arbitrator (#685). Let me think about this and get back to you later.

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 12, 2026

Makes sense. We could just drop the checks entirely and let the driver handle it (per #685).

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented May 12, 2026

A few more things on the hot path worth looking at:

_check_cooperative_launch makes an uncached driver round-trip on every cooperative launch: kernel.occupancy.max_active_blocks_per_multiprocessor calls cuOccupancyMaxActiveBlocksPerMultiprocessor with no caching. (stream.device and dev.properties.multiprocessor_count are cached after first use — TLS singleton lookup and _get_cached_attribute dict lookup respectively — so those aren't driver round-trips on the hot path, but there is still Python-level overhead from the calls and from prod(config.block).)

_to_native_launch_config rebuilds the CUlaunchConfig struct every launch even when nothing changed. check_or_create_options and Stream_accept are also per-call overhead when the user is just passing a LaunchConfig and Stream directly.

For reference: cuPy (raw.pyx#L118) calls getDevice() for a cache lookup then goes straight to the driver. Numba's per-launch work (dispatcher.py#L317) is just per-argument type marshalling — no config struct rebuilding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants