Skip to content

Conversation

@rwgk
Copy link
Collaborator

@rwgk rwgk commented Feb 3, 2026

Description

Accumulated test fixes as required to avoid failures in QA environments:


Side note, this PR exposed a flake (almost certainly unrelated):

https://github.com/NVIDIA/cuda-python/actions/runs/21643694781/job/62390783335?pr=1567

See comment below.

rwgk and others added 2 commits February 3, 2026 10:58
…valid_values

This is ONLY A BAND-AID, but a very effective one:

Andy's original suggestion:

* NVIDIA/cuda-python-private#245 (comment)

Results of extensive testing:

* NVIDIA/cuda-python-private#245 (comment)

Long-term:

* NVIDIA#1539
Several NVML tests were failing on NVIDIA Thor (BLACKWELL architecture)
with NotSupportedError and NoPermissionError. These are harmless failures
that occur when certain NVML APIs are not supported on specific hardware
configurations or when the test environment lacks sufficient permissions.

This commit fixes all 15 failing tests by properly handling these expected
error conditions using the existing test patterns:

1. Use unsupported_before(device, None) context manager to catch
   NotSupportedError and skip tests gracefully when APIs are not supported
   on the hardware.

2. Add explicit try/except blocks to catch NoPermissionError and skip tests
   when operations require elevated permissions.

Changes by file:

cuda_bindings/tests/nvml/test_device.py:
- test_current_clock_freqs: Added unsupported_before wrapper
- test_device_get_performance_modes: Added unsupported_before wrapper
- test_nvlink_low_power_threshold: Added NoPermissionError handling

cuda_bindings/tests/nvml/test_pynvml.py:
- test_device_get_total_energy_consumption: Changed from VOLTA arch check
  to None (to handle failures on newer architectures)
- test_device_get_memory_info: Added unsupported_before wrapper
- test_device_get_pcie_throughput: Changed from MAXWELL arch check to None
  and wrapped both PCIe throughput calls

cuda_core/tests/system/test_system_device.py:
- test_device_bar1_memory: Changed from KEPLER arch check to None
- test_device_memory: Added unsupported_before wrapper
- test_device_pci_info: Added wrapper around get_pcie_throughput() call
- test_module_id: Added unsupported_before wrapper
- test_get_inforom_version: Added wrapper around inforom.image_version access
- test_clock: Changed FERMI arch check to None for performance_state
- test_clock_event_reasons: Added wrappers around both clock event calls
- test_pstates: Added unsupported_before wrapper

cuda_bindings/tests/nvml/test_gpu.py:
- test_gpu_get_module_id: Added unsupported_before wrapper

All tests now properly skip instead of failing when encountering
NotSupportedError or NoPermissionError, following the existing test patterns
in the codebase.

Test results:
- Before: 15 failed tests across 4 test files
- After: All tests pass or skip appropriately
- cuda_bindings: 335 passed, 30 skipped, 1 xfailed
- cuda_core: 1733 passed, 120 skipped, 1 xfailed

Co-authored-by: Cursor <cursoragent@cursor.com>
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 3, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator Author

rwgk commented Feb 3, 2026

/ok to test

@github-actions

This comment has been minimized.

@rwgk
Copy link
Collaborator Author

rwgk commented Feb 3, 2026

Tracking an almost certainly unrelated flake (resolved in the 2nd attempt):

https://github.com/NVIDIA/cuda-python/actions/runs/21643694781/job/62390783335?pr=1567

=================================== FAILURES ===================================
_____________________ TestIpcReexport.test_main[DeviceMR] ______________________

self = <test_send_buffers.TestIpcReexport object at 0x478595203d0>
ipc_device = <Device 0 (Tesla T4)>
ipc_memory_resource = <cuda.core._memory._device_memory_resource.DeviceMemoryResource object at 0x478594efae0>

    def test_main(self, ipc_device, ipc_memory_resource):
        # Set up the device.
        device = ipc_device
        device.set_current()
    
        # Allocate, fill a buffer.
        mr = ipc_memory_resource
        pgen = PatternGen(device, NBYTES)
        buffer = mr.allocate(NBYTES)
        pgen.fill_buffer(buffer, seed=0)
    
        # Set up communication.
        q_bc = mp.Queue()
        event_b, event_c = [mp.Event() for _ in range(2)]
    
        # Spawn B and C.
        proc_b = mp.Process(target=self.process_b_main, args=(buffer, q_bc, event_b))
        proc_c = mp.Process(target=self.process_c_main, args=(q_bc, event_c))
        proc_b.start()
        proc_c.start()
    
        # Wait for C to signal completion then clean up.
        event_c.wait(timeout=CHILD_TIMEOUT_SEC)
        event_b.set()  # b can finish now
        proc_b.join(timeout=CHILD_TIMEOUT_SEC)
        proc_c.join(timeout=CHILD_TIMEOUT_SEC)
        assert proc_b.exitcode == 0
>       assert proc_c.exitcode == 0
E       AssertionError: assert 1 == 0
E        +  where 1 = <Process name='Process-25' pid=5129 parent=4876 stopped exitcode=1>.exitcode

buffer     = <Buffer ptr=0x316000000 size=64>
device     = <Device 0 (Tesla T4)>
event_b    = <Event at 0x47858f9ab10 set>
event_c    = <Event at 0x4785952cf90 unset>
ipc_device = <Device 0 (Tesla T4)>
ipc_memory_resource = <cuda.core._memory._device_memory_resource.DeviceMemoryResource object at 0x478594efae0>
mr         = <cuda.core._memory._device_memory_resource.DeviceMemoryResource object at 0x478594efae0>
pgen       = <helpers.buffers.PatternGen object at 0x47859dd1e10>
proc_b     = <Process name='Process-24' pid=5128 parent=4876 stopped exitcode=0>
proc_c     = <Process name='Process-25' pid=5129 parent=4876 stopped exitcode=1>
q_bc       = <multiprocessing.queues.Queue object at 0x47859dd1910>
self       = <test_send_buffers.TestIpcReexport object at 0x478595203d0>

tests/memory_ipc/test_send_buffers.py:97: AssertionError
----------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.14.2/x64-freethreaded/lib/python3.14t/multiprocessing/queues.py", line 262, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/hostedtoolcache/Python/3.14.2/x64-freethreaded/lib/python3.14t/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^
  File "cuda/core/_memory/_buffer.pyx", line 99, in cuda.core._memory._buffer.Buffer.__reduce__
    return Buffer.from_ipc_descriptor, (self.memory_resource, self.get_ipc_descriptor())
  File "cuda/core/_memory/_buffer.pyx", line 139, in cuda.core._memory._buffer.Buffer.get_ipc_descriptor
    self._ipc_data = IPCDataForBuffer(_ipc.Buffer_get_ipc_descriptor(self), False)
  File "cuda/core/_memory/_ipc.pyx", line 160, in cuda.core._memory._ipc.Buffer_get_ipc_descriptor
    if not self.memory_resource.is_ipc_enabled:
AttributeError: 'NoneType' object has no attribute 'is_ipc_enabled'
Process Process-25:
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.14.2/x64-freethreaded/lib/python3.14t/multiprocessing/process.py", line 320, in _bootstrap
    self.run()
    ~~~~~~~~^^
  File "/opt/hostedtoolcache/Python/3.14.2/x64-freethreaded/lib/python3.14t/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/cuda-python/cuda-python/cuda_core/tests/memory_ipc/test_send_buffers.py", line 121, in process_c_main
    buffer = q_bc.get(timeout=CHILD_TIMEOUT_SEC)
  File "/opt/hostedtoolcache/Python/3.14.2/x64-freethreaded/lib/python3.14t/multiprocessing/queues.py", line 112, in get
    raise Empty
_queue.Empty

@rwgk rwgk requested review from Andy-Jost and mdboom February 3, 2026 20:15
@mdboom
Copy link
Contributor

mdboom commented Feb 3, 2026

@rwgk: Do you have a link to the failing test run? It seems like almost every single remaining NVML call is being skipped here, so something doesn't seem quite right... Or it just that you are running on hardware we've never tested on before?

@Andy-Jost
Copy link
Contributor

Andy-Jost commented Feb 3, 2026

Tracking an almost certainly unrelated flake (resolved in the 2nd attempt):

https://github.com/NVIDIA/cuda-python/actions/runs/21643694781/job/62390783335?pr=1567

File "/__w/cuda-python/cuda-python/cuda_core/tests/memory_ipc/test_send_buffers.py", line 121, in process_c_main
buffer = q_bc.get(timeout=CHILD_TIMEOUT_SEC)
File "/opt/hostedtoolcache/Python/3.14.2/x64-freethreaded/lib/python3.14t/multiprocessing/queues.py", line 112, in get
raise Empty
_queue.Empty

In every case I've run into it, _queue.Empty is the result of a timeout in IPC.

We might consider something like @pytest.mark.flaky(reruns=2) for any test with a timeout.

Copy link
Contributor

@Andy-Jost Andy-Jost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of the change in test_launcher.py.

Copy link
Contributor

@mdboom mdboom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, given that this is new hardware for us.

@rwgk rwgk merged commit 6bdcda0 into NVIDIA:main Feb 3, 2026
167 of 169 checks passed
@rwgk rwgk deleted the test_fixes_as_needed_for_qa branch February 3, 2026 21:01
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants