Skip to content

[feature request] Add feat: watchdog (Softlockup / Hardlockup Detection) #88

@WEIXIAOYVYI

Description

@WEIXIAOYVYI

Feature Description

I would like to propose adding a new feature: a per-CPU watchdog mechanism that detects both softlockup and hardlockup conditions inside the kernel.

The watchdog would periodically monitor CPU activity by tracking two kinds of “heartbeats”:

  • Soft heartbeat: updated when the CPU reaches a scheduling point
  • Hard heartbeat: updated in the timer interrupt handler

If either heartbeat has not been updated within a configurable threshold, the watchdog will report a softlockup or hardlockup event.


Problem or Need

During kernel development, we encountered CPU stall issues where:

  • A CPU continues to receive interrupts but never reaches a scheduling point (softlockup)
  • A CPU completely stops receiving timer interrupts (hardlockup)

Without dedicated detection, these issues are extremely difficult to debug.
A watchdog mechanism would help developers and contributors identify deadlocks, long critical sections, misconfigured interrupts, or broken scheduling paths.

This feature would improve debuggability, stability, and observability across SMP systems.


Suggested Implementation

Below is a high-level proposal for how the watchdog could be implemented.

1. Per-CPU State

pub struct PerCpuState {
    // === Softlockup Detection ===
    /// Timestamp when the scheduler last indicated progress (nanoseconds).
    /// Updated by scheduler at scheduling points; checked by timer interrupt.
    soft_timestamp: AtomicU64,

    // === Hardlockup Detection ===
    /// Timer interrupt counter (incremented in timer interrupt).
    hrtimer_interrupts: AtomicU32,
    /// Saved hrtimer_interrupts value from last NMI/pseudo-NMI check.
    hrtimer_interrupts_saved: AtomicU32,
}

2. Heartbeat Injection Points

  • Scheduler or scheduling tick → soft_timestamp
  • Timer IRQ handler → hrtimer_interrupts

3. Periodic Checker

Softlockup and hardlockup are detected through different mechanisms.

Softlockup Detection

Checked on every timer IRQ.

Detection condition:

now_ns - soft_timestamp > softlockup_threshold_ns

Hardlockup Detection

Hardlockup detection runs every hardlockup_threshold (via NMI / pseudo-NMI / FIQ acting as pseudo-NMI).

Detection condition:

hrtimer_interrupts == hrtimer_interrupts_saved

4. Reporting Options

  • Log warnings or errors
  • Optionally trigger a kernel panic (behind a feature flag)
  • Optional debugging info:
    • Task queue snapshot
    • CPU state
    • Other diagnostic data

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions