fix: Fix jiffies wrap in os_get_monotonic_time_ns() causing hangs after 66 days #1014
+1
−7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix for
nvidia-smihanging after approximately 66 days of uptime.The function
os_get_monotonic_time_ns()inkernel-open/nvidia/os-interface.cusesjiffies_to_timespec64(jiffies, &ts)to obtainmonotonic time. The
jiffiescounter is an unsigned 32-bit value that wraps at 2^32 ticks. AtHZ=750this occurs after2^32 / 750 / 86400 = 66.3 days. When the wrapoccurs,
jiffies_to_timespec64()returns a near-zero value, causing time to appear to jump backwards.This breaks timeout comparisons throughout the driver. Code in
thread_state.c,locks.c,and
gpu_timeout.cstores a start time and later checks ifcurrentTime >= startTime + timeout. After the wrap,currentTimeissuddenly much smaller than
startTime, so these comparisons behave incorrectly and all operations appear to have timed out immediately.The fix replaces the jiffies-based implementation with
ktime_get_raw_ts64(), which reads from hardware timers and provides a monotonic 64-bit nanosecond timestamp that won'twrap for centuries. This matches the implementation already used by
os_get_monotonic_time_ns_hr()in the same file.