-
Notifications
You must be signed in to change notification settings - Fork 243
Test fixes as required for QA #1567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…valid_values This is ONLY A BAND-AID, but a very effective one: Andy's original suggestion: * NVIDIA/cuda-python-private#245 (comment) Results of extensive testing: * NVIDIA/cuda-python-private#245 (comment) Long-term: * NVIDIA#1539
Several NVML tests were failing on NVIDIA Thor (BLACKWELL architecture) with NotSupportedError and NoPermissionError. These are harmless failures that occur when certain NVML APIs are not supported on specific hardware configurations or when the test environment lacks sufficient permissions. This commit fixes all 15 failing tests by properly handling these expected error conditions using the existing test patterns: 1. Use unsupported_before(device, None) context manager to catch NotSupportedError and skip tests gracefully when APIs are not supported on the hardware. 2. Add explicit try/except blocks to catch NoPermissionError and skip tests when operations require elevated permissions. Changes by file: cuda_bindings/tests/nvml/test_device.py: - test_current_clock_freqs: Added unsupported_before wrapper - test_device_get_performance_modes: Added unsupported_before wrapper - test_nvlink_low_power_threshold: Added NoPermissionError handling cuda_bindings/tests/nvml/test_pynvml.py: - test_device_get_total_energy_consumption: Changed from VOLTA arch check to None (to handle failures on newer architectures) - test_device_get_memory_info: Added unsupported_before wrapper - test_device_get_pcie_throughput: Changed from MAXWELL arch check to None and wrapped both PCIe throughput calls cuda_core/tests/system/test_system_device.py: - test_device_bar1_memory: Changed from KEPLER arch check to None - test_device_memory: Added unsupported_before wrapper - test_device_pci_info: Added wrapper around get_pcie_throughput() call - test_module_id: Added unsupported_before wrapper - test_get_inforom_version: Added wrapper around inforom.image_version access - test_clock: Changed FERMI arch check to None for performance_state - test_clock_event_reasons: Added wrappers around both clock event calls - test_pstates: Added unsupported_before wrapper cuda_bindings/tests/nvml/test_gpu.py: - test_gpu_get_module_id: Added unsupported_before wrapper All tests now properly skip instead of failing when encountering NotSupportedError or NoPermissionError, following the existing test patterns in the codebase. Test results: - Before: 15 failed tests across 4 test files - After: All tests pass or skip appropriately - cuda_bindings: 335 passed, 30 skipped, 1 xfailed - cuda_core: 1733 passed, 120 skipped, 1 xfailed Co-authored-by: Cursor <cursoragent@cursor.com>
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
This comment has been minimized.
This comment has been minimized.
|
Tracking an almost certainly unrelated flake (resolved in the 2nd attempt): https://github.com/NVIDIA/cuda-python/actions/runs/21643694781/job/62390783335?pr=1567 |
|
@rwgk: Do you have a link to the failing test run? It seems like almost every single remaining NVML call is being skipped here, so something doesn't seem quite right... Or it just that you are running on hardware we've never tested on before? |
In every case I've run into it, We might consider something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve of the change in test_launcher.py.
mdboom
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, given that this is new hardware for us.
|
Description
Accumulated test fixes as required to avoid failures in QA environments:
Side note, this PR exposed a flake (almost certainly unrelated):
https://github.com/NVIDIA/cuda-python/actions/runs/21643694781/job/62390783335?pr=1567
See comment below.