-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Feature Description
I would like to propose adding a new feature: a per-CPU watchdog mechanism that detects both softlockup and hardlockup conditions inside the kernel.
The watchdog would periodically monitor CPU activity by tracking two kinds of “heartbeats”:
- Soft heartbeat: updated when the CPU reaches a scheduling point
- Hard heartbeat: updated in the timer interrupt handler
If either heartbeat has not been updated within a configurable threshold, the watchdog will report a softlockup or hardlockup event.
Problem or Need
During kernel development, we encountered CPU stall issues where:
- A CPU continues to receive interrupts but never reaches a scheduling point (softlockup)
- A CPU completely stops receiving timer interrupts (hardlockup)
Without dedicated detection, these issues are extremely difficult to debug.
A watchdog mechanism would help developers and contributors identify deadlocks, long critical sections, misconfigured interrupts, or broken scheduling paths.
This feature would improve debuggability, stability, and observability across SMP systems.
Suggested Implementation
Below is a high-level proposal for how the watchdog could be implemented.
1. Per-CPU State
pub struct PerCpuState {
// === Softlockup Detection ===
/// Timestamp when the scheduler last indicated progress (nanoseconds).
/// Updated by scheduler at scheduling points; checked by timer interrupt.
soft_timestamp: AtomicU64,
// === Hardlockup Detection ===
/// Timer interrupt counter (incremented in timer interrupt).
hrtimer_interrupts: AtomicU32,
/// Saved hrtimer_interrupts value from last NMI/pseudo-NMI check.
hrtimer_interrupts_saved: AtomicU32,
}2. Heartbeat Injection Points
- Scheduler or scheduling tick →
soft_timestamp - Timer IRQ handler →
hrtimer_interrupts
3. Periodic Checker
Softlockup and hardlockup are detected through different mechanisms.
Softlockup Detection
Checked on every timer IRQ.
Detection condition:
now_ns - soft_timestamp > softlockup_threshold_ns
Hardlockup Detection
Hardlockup detection runs every hardlockup_threshold (via NMI / pseudo-NMI / FIQ acting as pseudo-NMI).
Detection condition:
hrtimer_interrupts == hrtimer_interrupts_saved
4. Reporting Options
- Log warnings or errors
- Optionally trigger a kernel panic (behind a feature flag)
- Optional debugging info:
- Task queue snapshot
- CPU state
- Other diagnostic data