Test fixes as required for QA #1567

rwgk · 2026-02-03T19:01:27Z

Description

Accumulated test fixes as required to avoid failures in QA environments:

commit 5ca7cf2 (this is only a band-aid, related to [ENH] Add safeguards for unsynchronized stream destruction #1539, but a highly effective one)
commit ec6c439

Side note, this PR exposed a flake (almost certainly unrelated):

https://github.com/NVIDIA/cuda-python/actions/runs/21643694781/job/62390783335?pr=1567

…valid_values This is ONLY A BAND-AID, but a very effective one: Andy's original suggestion: * NVIDIA/cuda-python-private#245 (comment) Results of extensive testing: * NVIDIA/cuda-python-private#245 (comment) Long-term: * NVIDIA#1539

Several NVML tests were failing on NVIDIA Thor (BLACKWELL architecture) with NotSupportedError and NoPermissionError. These are harmless failures that occur when certain NVML APIs are not supported on specific hardware configurations or when the test environment lacks sufficient permissions. This commit fixes all 15 failing tests by properly handling these expected error conditions using the existing test patterns: 1. Use unsupported_before(device, None) context manager to catch NotSupportedError and skip tests gracefully when APIs are not supported on the hardware. 2. Add explicit try/except blocks to catch NoPermissionError and skip tests when operations require elevated permissions. Changes by file: cuda_bindings/tests/nvml/test_device.py: - test_current_clock_freqs: Added unsupported_before wrapper - test_device_get_performance_modes: Added unsupported_before wrapper - test_nvlink_low_power_threshold: Added NoPermissionError handling cuda_bindings/tests/nvml/test_pynvml.py: - test_device_get_total_energy_consumption: Changed from VOLTA arch check to None (to handle failures on newer architectures) - test_device_get_memory_info: Added unsupported_before wrapper - test_device_get_pcie_throughput: Changed from MAXWELL arch check to None and wrapped both PCIe throughput calls cuda_core/tests/system/test_system_device.py: - test_device_bar1_memory: Changed from KEPLER arch check to None - test_device_memory: Added unsupported_before wrapper - test_device_pci_info: Added wrapper around get_pcie_throughput() call - test_module_id: Added unsupported_before wrapper - test_get_inforom_version: Added wrapper around inforom.image_version access - test_clock: Changed FERMI arch check to None for performance_state - test_clock_event_reasons: Added wrappers around both clock event calls - test_pstates: Added unsupported_before wrapper cuda_bindings/tests/nvml/test_gpu.py: - test_gpu_get_module_id: Added unsupported_before wrapper All tests now properly skip instead of failing when encountering NotSupportedError or NoPermissionError, following the existing test patterns in the codebase. Test results: - Before: 15 failed tests across 4 test files - After: All tests pass or skip appropriately - cuda_bindings: 335 passed, 30 skipped, 1 xfailed - cuda_core: 1733 passed, 120 skipped, 1 xfailed Co-authored-by: Cursor <cursoragent@cursor.com>

copy-pr-bot · 2026-02-03T19:01:36Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2026-02-03T19:02:00Z

/ok to test

rwgk · 2026-02-03T20:11:37Z

Tracking an almost certainly unrelated flake (resolved in the 2nd attempt):

https://github.com/NVIDIA/cuda-python/actions/runs/21643694781/job/62390783335?pr=1567

=================================== FAILURES ===================================
_____________________ TestIpcReexport.test_main[DeviceMR] ______________________

self = <test_send_buffers.TestIpcReexport object at 0x478595203d0>
ipc_device = <Device 0 (Tesla T4)>
ipc_memory_resource = <cuda.core._memory._device_memory_resource.DeviceMemoryResource object at 0x478594efae0>

    def test_main(self, ipc_device, ipc_memory_resource):
        # Set up the device.
        device = ipc_device
        device.set_current()
    
        # Allocate, fill a buffer.
        mr = ipc_memory_resource
        pgen = PatternGen(device, NBYTES)
        buffer = mr.allocate(NBYTES)
        pgen.fill_buffer(buffer, seed=0)
    
        # Set up communication.
        q_bc = mp.Queue()
        event_b, event_c = [mp.Event() for _ in range(2)]
    
        # Spawn B and C.
        proc_b = mp.Process(target=self.process_b_main, args=(buffer, q_bc, event_b))
        proc_c = mp.Process(target=self.process_c_main, args=(q_bc, event_c))
        proc_b.start()
        proc_c.start()
    
        # Wait for C to signal completion then clean up.
        event_c.wait(timeout=CHILD_TIMEOUT_SEC)
        event_b.set()  # b can finish now
        proc_b.join(timeout=CHILD_TIMEOUT_SEC)
        proc_c.join(timeout=CHILD_TIMEOUT_SEC)
        assert proc_b.exitcode == 0
>       assert proc_c.exitcode == 0
E       AssertionError: assert 1 == 0
E        +  where 1 = <Process name='Process-25' pid=5129 parent=4876 stopped exitcode=1>.exitcode

buffer     = <Buffer ptr=0x316000000 size=64>
device     = <Device 0 (Tesla T4)>
event_b    = <Event at 0x47858f9ab10 set>
event_c    = <Event at 0x4785952cf90 unset>
ipc_device = <Device 0 (Tesla T4)>
ipc_memory_resource = <cuda.core._memory._device_memory_resource.DeviceMemoryResource object at 0x478594efae0>
mr         = <cuda.core._memory._device_memory_resource.DeviceMemoryResource object at 0x478594efae0>
pgen       = <helpers.buffers.PatternGen object at 0x47859dd1e10>
proc_b     = <Process name='Process-24' pid=5128 parent=4876 stopped exitcode=0>
proc_c     = <Process name='Process-25' pid=5129 parent=4876 stopped exitcode=1>
q_bc       = <multiprocessing.queues.Queue object at 0x47859dd1910>
self       = <test_send_buffers.TestIpcReexport object at 0x478595203d0>

tests/memory_ipc/test_send_buffers.py:97: AssertionError
----------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.14.2/x64-freethreaded/lib/python3.14t/multiprocessing/queues.py", line 262, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/hostedtoolcache/Python/3.14.2/x64-freethreaded/lib/python3.14t/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^
  File "cuda/core/_memory/_buffer.pyx", line 99, in cuda.core._memory._buffer.Buffer.__reduce__
    return Buffer.from_ipc_descriptor, (self.memory_resource, self.get_ipc_descriptor())
  File "cuda/core/_memory/_buffer.pyx", line 139, in cuda.core._memory._buffer.Buffer.get_ipc_descriptor
    self._ipc_data = IPCDataForBuffer(_ipc.Buffer_get_ipc_descriptor(self), False)
  File "cuda/core/_memory/_ipc.pyx", line 160, in cuda.core._memory._ipc.Buffer_get_ipc_descriptor
    if not self.memory_resource.is_ipc_enabled:
AttributeError: 'NoneType' object has no attribute 'is_ipc_enabled'
Process Process-25:
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.14.2/x64-freethreaded/lib/python3.14t/multiprocessing/process.py", line 320, in _bootstrap
    self.run()
    ~~~~~~~~^^
  File "/opt/hostedtoolcache/Python/3.14.2/x64-freethreaded/lib/python3.14t/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/cuda-python/cuda-python/cuda_core/tests/memory_ipc/test_send_buffers.py", line 121, in process_c_main
    buffer = q_bc.get(timeout=CHILD_TIMEOUT_SEC)
  File "/opt/hostedtoolcache/Python/3.14.2/x64-freethreaded/lib/python3.14t/multiprocessing/queues.py", line 112, in get
    raise Empty
_queue.Empty

mdboom · 2026-02-03T20:41:38Z

@rwgk: Do you have a link to the failing test run? It seems like almost every single remaining NVML call is being skipped here, so something doesn't seem quite right... Or it just that you are running on hardware we've never tested on before?

Andy-Jost · 2026-02-03T20:48:11Z

Tracking an almost certainly unrelated flake (resolved in the 2nd attempt):

https://github.com/NVIDIA/cuda-python/actions/runs/21643694781/job/62390783335?pr=1567

File "/__w/cuda-python/cuda-python/cuda_core/tests/memory_ipc/test_send_buffers.py", line 121, in process_c_main
buffer = q_bc.get(timeout=CHILD_TIMEOUT_SEC)
File "/opt/hostedtoolcache/Python/3.14.2/x64-freethreaded/lib/python3.14t/multiprocessing/queues.py", line 112, in get
raise Empty
_queue.Empty

In every case I've run into it, _queue.Empty is the result of a timeout in IPC.

We might consider something like @pytest.mark.flaky(reruns=2) for any test with a timeout.

Andy-Jost

I approve of the change in test_launcher.py.

mdboom

LGTM, given that this is new hardware for us.

github-actions · 2026-02-03T21:11:42Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

rwgk and others added 2 commits February 3, 2026 10:58

This comment has been minimized.

Sign in to view

rwgk requested review from Andy-Jost and mdboom February 3, 2026 20:15

Andy-Jost approved these changes Feb 3, 2026

View reviewed changes

mdboom approved these changes Feb 3, 2026

View reviewed changes

rwgk merged commit 6bdcda0 into NVIDIA:main Feb 3, 2026
167 of 169 checks passed

rwgk deleted the test_fixes_as_needed_for_qa branch February 3, 2026 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test fixes as required for QA #1567

Test fixes as required for QA #1567

rwgk commented Feb 3, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

rwgk commented Feb 3, 2026

Uh oh!

This comment has been minimized.

rwgk commented Feb 3, 2026

Uh oh!

mdboom commented Feb 3, 2026

Uh oh!

Andy-Jost commented Feb 3, 2026 •

edited

Loading

Uh oh!

Andy-Jost left a comment •

edited

Loading

Uh oh!

mdboom left a comment

Uh oh!

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Test fixes as required for QA #1567

Test fixes as required for QA #1567

Conversation

rwgk commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

rwgk commented Feb 3, 2026

Uh oh!

This comment has been minimized.

rwgk commented Feb 3, 2026

Uh oh!

mdboom commented Feb 3, 2026

Uh oh!

Andy-Jost commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andy-Jost left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdboom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rwgk commented Feb 3, 2026 •

edited

Loading

Andy-Jost commented Feb 3, 2026 •

edited

Loading

Andy-Jost left a comment •

edited

Loading