refactor(cuda_core): defer device checks in LaunchConfig to launch time by KRRT7 · Pull Request #2066 · NVIDIA/cuda-python

KRRT7 · 2026-05-12T13:24:13Z

Summary

LaunchConfig.__init__ previously called Device() to validate compute capability (cluster launches) and cooperative_launch support, but the stream — and therefore the correct device — is not known at construction time
Moves both checks into _launcher.pyx where stream.device is available: _check_cluster_launch (CC < 9.0 guard) and an updated _check_cooperative_launch (adds cooperative_launch support guard before the existing grid-size check)
LaunchConfig.__init__ is now a pure data class with zero driver calls; cluster and cooperative configs can be constructed without a CUDA context

Test plan

Removed try/except CUDAError skip guards from test_launch_config_cluster_grid_conversion — constructing LaunchConfig(cluster=...) no longer raises on sub-CC-9.0 devices, so these tests now run on all hardware
Validated on L4 (CC 8.9): 4 passed, 2 skipped (cluster launch tests still skip correctly at launch() time on CC < 9.0), full suite 3225 passed

Avoid calling Device() twice (once for cluster validation, once for cooperative check). Now called at most once, and zero times for the common simple-launch path where neither cluster nor is_cooperative is set. Co-Authored-By: Claude <noreply@anthropic.com>

LaunchConfig.__init__ previously called Device() to validate compute capability (for cluster launches) and cooperative_launch support, but at construction time the stream — and therefore the correct device — is not yet known. Move both checks into _launcher.pyx where the stream is available: - _check_cluster_launch: queries stream.device.compute_capability and raises if CC < 9.0 (thread block clusters require H100+) - _check_cooperative_launch: now also guards cooperative_launch support via stream.device before the grid-size check LaunchConfig.__init__ is now a pure data class with no driver calls. Cluster and cooperative config objects can be constructed without a CUDA context, and errors surface at launch() time with the correct device in scope. Remove the try/except CUDAError skip guards from cluster-related tests; constructing LaunchConfig(cluster=...) no longer raises on sub-CC-9.0 devices, so those tests run on all hardware.

copy-pr-bot · 2026-05-12T13:24:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang

* Fix tab completion * Fix tests * Always install the monkeypatch * Update release note * Apply suggestion from @leofang Co-authored-by: Leo Fang <leof@nvidia.com> * Fix test * Fix tests hanging on Windows --------- Co-authored-by: Leo Fang <leof@nvidia.com>

LaunchConfig.__init__ previously called Device() to validate compute capability (for cluster launches) and cooperative_launch support, but at construction time the stream — and therefore the correct device — is not yet known. Move both checks into _launcher.pyx where the stream is available: - _check_cluster_launch: queries stream.device.compute_capability and raises if CC < 9.0 (thread block clusters require H100+) - _check_cooperative_launch: now also guards cooperative_launch support via stream.device before the grid-size check LaunchConfig.__init__ is now a pure data class with no driver calls. Cluster and cooperative config objects can be constructed without a CUDA context, and errors surface at launch() time with the correct device in scope. Remove the try/except CUDAError skip guards from cluster-related tests; constructing LaunchConfig(cluster=...) no longer raises on sub-CC-9.0 devices, so those tests run on all hardware.

mdboom · 2026-05-12T13:40:01Z

Can you provide (or link to) more context about why this change is desirable?

KRRT7 · 2026-05-12T13:43:35Z

There's a FIXME in the original code that motivated this — Device() at construction time doesn't know which device the kernel will actually run on. It just grabs whatever's current, which is wrong if you're targeting a non-default device or haven't called set_current() yet.

launch() already has the stream, and stream.device is already used right there in _check_cooperative_launch for the grid-size check. So this just moves the CC and cooperative-support checks to follow the same pattern — check against the device the kernel will actually run on.

KRRT7 · 2026-05-12T13:44:15Z

https://github.com/NVIDIA/cuda-python/blob/main/cuda_core/cuda/core/_launch_config.pyx#L73

leofang

Thanks for your contribution, @KRRT7! However, there is a compromise we need to make. launch is on the critical path so we would need to move checks to either beforehand or rely on the driver as the final arbitrator (#685). Let me think about this and get back to you later.

KRRT7 · 2026-05-12T13:59:47Z

Makes sense. We could just drop the checks entirely and let the driver handle it (per #685).

KRRT7 · 2026-05-12T14:42:38Z

A few more things on the hot path worth looking at:

_check_cooperative_launch makes an uncached driver round-trip on every cooperative launch: kernel.occupancy.max_active_blocks_per_multiprocessor calls cuOccupancyMaxActiveBlocksPerMultiprocessor with no caching. (stream.device and dev.properties.multiprocessor_count are cached after first use — TLS singleton lookup and _get_cached_attribute dict lookup respectively — so those aren't driver round-trips on the hot path, but there is still Python-level overhead from the calls and from prod(config.block).)

_to_native_launch_config rebuilds the CUlaunchConfig struct every launch even when nothing changed. check_or_create_options and Stream_accept are also per-call overhead when the user is just passing a LaunchConfig and Stream directly.

For reference: cuPy (raw.pyx#L118) calls getDevice() for a cache lookup then goes straight to the driver. Numba's per-launch work (dispatcher.py#L317) is just per-argument type marshalling — no config struct rebuilding.

KRRT7 and others added 2 commits May 10, 2026 21:59

github-actions Bot added the cuda.core Everything related to the cuda.core module label May 12, 2026

mdboom and others added 3 commits May 12, 2026 08:32

Merge branch 'main' into fix/launch-config-defer-device-lookup-to-stream

e85179d

leofang requested changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(cuda_core): defer device checks in LaunchConfig to launch time#2066

refactor(cuda_core): defer device checks in LaunchConfig to launch time#2066
KRRT7 wants to merge 5 commits into
NVIDIA:mainfrom
KRRT7:fix/launch-config-defer-device-lookup-to-stream

KRRT7 commented May 12, 2026

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

mdboom commented May 12, 2026

Uh oh!

KRRT7 commented May 12, 2026

Uh oh!

KRRT7 commented May 12, 2026

Uh oh!

leofang left a comment

Uh oh!

KRRT7 commented May 12, 2026

Uh oh!

KRRT7 commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KRRT7 commented May 12, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

mdboom commented May 12, 2026

Uh oh!

KRRT7 commented May 12, 2026

Uh oh!

KRRT7 commented May 12, 2026

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

KRRT7 commented May 12, 2026

Uh oh!

KRRT7 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KRRT7 commented May 12, 2026 •

edited

Loading