Skip to content

Conversation

@ma-ts
Copy link

@ma-ts ma-ts commented Jan 25, 2026

Fix for nvidia-smi hanging after approximately 66 days of uptime.

The function os_get_monotonic_time_ns() in kernel-open/nvidia/os-interface.c uses jiffies_to_timespec64(jiffies, &ts) to obtain
monotonic time. The jiffies counter is an unsigned 32-bit value that wraps at 2^32 ticks. At HZ=750 this occurs after 2^32 / 750 / 86400 = 66.3 days. When the wrap
occurs, jiffies_to_timespec64() returns a near-zero value, causing time to appear to jump backwards.

This breaks timeout comparisons throughout the driver. Code in thread_state.c, locks.c,
and gpu_timeout.c stores a start time and later checks if currentTime >= startTime + timeout. After the wrap, currentTime is
suddenly much smaller than startTime, so these comparisons behave incorrectly and all operations appear to have timed out immediately.

The fix replaces the jiffies-based implementation with ktime_get_raw_ts64(), which reads from hardware timers and provides a monotonic 64-bit nanosecond timestamp that won't
wrap for centuries. This matches the implementation already used by os_get_monotonic_time_ns_hr() in the same file.

@CLAassistant
Copy link

CLAassistant commented Jan 25, 2026

CLA assistant check
All committers have signed the CLA.

@ma-ts ma-ts changed the title fix: move to kernel-based timing fix: Fix jiffies wrap in os_get_monotonic_time_ns() causing hangs after 66 days Jan 25, 2026
@pjaroszynski
Copy link

jiffies should be 64-bit, right?

@mtijanic
Copy link
Collaborator

jiffies should be 64-bit, right?

Yeah, on 64bit systems (the only ones supported by this codebase), jiffies will be a regular 64bit variable, so this won't overflow. Also, as mentioned in #971 (comment) the nvidia-smi hang bug is not part of this codebase at all. (Also, who uses CONFIG_HZ=750?)

That said, I think moving from jiffies to ktime here might be worthwhile anyway. However, AFAICT it was added in 4.18, and we still support 4.4 and later, so it would need some extra conftests to decide, and given that I don't think it worth it to have diverging behavior based on the kernel. Maybe wait until we drop those kernels and then move unconditionally....

@tycho
Copy link

tycho commented Jan 26, 2026

Why not just use ktime_get_raw_ns to eliminate the need for the timespec64_to_ns conversion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants