feat: accept stale channel monitors for recovery by ben-kaufman · Pull Request #76 · synonymdev/ldk-node

ben-kaufman · 2026-03-18T13:09:50Z

Summary

Add set_accept_stale_channel_monitors(bool) builder API for one-time recovery when a channel monitor's update_id falls behind the ChannelManager (e.g. after migration overwrote newer monitor data with stale backup)
When enabled, force-syncs stale monitor update_ids during build instead of failing with DangerousValue
Defers chain sync on startup while sending probes to trigger commitment round-trips that heal stale monitor state
Polls monitor update_ids (60s timeout) with probe retries every 10s for late-connecting peers

The probe-triggered commitment round-trip provides the monitor with:

LatestHolderCommitmentTXInfo — correct current commitment with real signatures
CommitmentSecret — recovers all gap revocation secrets via the derivation tree

Context

Related to bitkit-android#847 and bitkit-ios#495.

Root cause: recent PRs in bitkit ios & android introduced orphaned channel recovery that re-fetched old RN monitors and passed them via setChannelDataMigration(). The app was pinned to ldk-node pre-rc.33 which blindly overwrote existing VSS monitors without comparing update_ids. These PRs got released in a recent app update.

Depends on: ovitrif/rust-lightning@0.2.2-accept-stale-monitors-v2 (patched force_set_latest_update_id() that resets CounterpartyCommitmentSecrets + accept_stale_channel_monitors flag on ChannelManagerReadArgs).

Test plan

test in bitkit-android
test in bitkit-ios

On BuildError.ReadFailed (likely stale ChannelMonitor from migration overwrite), automatically retry once with accept_stale_channel_monitors enabled. The ldk-node recovery flag force-syncs the monitor's update_id and heals commitment state via a delayed chain sync + keysend round-trip. A persisted UserDefaults flag ensures this only triggers once — set on any successful build (affected or not), preventing future retries. Depends on: synonymdev/ldk-node#76 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On BuildException.ReadFailed (likely stale ChannelMonitor from migration overwrite), automatically retry once with accept_stale_channel_monitors enabled. The ldk-node recovery flag force-syncs the monitor's update_id and heals commitment state via a delayed chain sync + keysend round-trip. A persisted SharedPreferences flag ensures this only triggers once — set on any successful build (affected or not), preventing future retries. Depends on: synonymdev/ldk-node#76 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When a channel monitor's update_id falls behind the ChannelManager (e.g. after a migration overwrote newer data with stale backup data), LDK refuses to start with DangerousValue. This adds a recovery path: - Builder: new `set_accept_stale_channel_monitors(bool)` flag - Build: passes flag to ChannelManagerReadArgs, which force-syncs stale monitor update_ids instead of returning DangerousValue - Startup: when flag is set, defers chain sync while sending probes on all channels to trigger commitment round-trips that heal the stale monitor state. Polls monitor update_ids with 60s timeout and retries probes every 10s for late-connecting peers. The probe-triggered commitment round-trip provides: - LatestHolderCommitmentTXInfo (correct current commitment state) - CommitmentSecret (recovers all gap revocation secrets via the derivation tree) Depends on: ben-kaufman/rust-lightning#fix/accept-stale-channel-monitors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Move inline imports (Path, RouteHop, NodeFeatures, ChannelFeatures) to top of lib.rs - Release is_running write lock before healing block_on to prevent stop() deadlock during recovery - Add stop_sender signal to healing loop so shutdown can cancel it - Early return after defer_chain_sync path (is_running already set) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Extract duplicated Path/RouteHop construction into build_probe_path closure - Address review feedback points with inline comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bug 1 (CRITICAL): send_probe rejects single-hop paths ("No need probing a path with less than two hops"). Replaced with send_spontaneous_payment which sends a 1-sat keysend — works for direct single-hop channels. Bug 2: start() spawned chain sync after stop() completed during healing. Added is_running check after block_on returns. Bug 3: accept_stale_channel_monitors flag was never cleared, causing every restart on the same Node instance to re-trigger the 60s healing delay. Changed to AtomicBool, cleared after healing completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Set max_total_routing_fee_msat = 0 to prevent the router from picking a multi-hop route through a different channel, which would heal the wrong monitor. With zero max fee, only the direct (zero-fee) route to the counterparty is used. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When multiple channels exist with the same peer, the router may pick the same channel for both payments — leaving the other unhealed. Fix: send one payment per unique counterparty (not per channel) and dedup in the retry loop. Each retry may route through a different channel as outbound capacity shifts from previous attempts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move HashSet (lib.rs) and AtomicBool (builder.rs) to top-level imports per CLAUDE.md: "ALWAYS move imports to the top of the file" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Remove peer dedup from healing payments — send one per unhealed channel (not per peer). The previous dedup prevented retries to the same peer even when some of their channels remained unhealed. 2. Fix TOCTOU race: subscribe to stop_sender while holding the is_running read lock before spawning chain sync. This prevents stop() from completing between the check and subscribe, which would orphan the task (missing the already-sent stop signal). Extracted spawn_chain_sync_task_with_receiver() so the normal path (spawn_chain_sync_task) still works unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On BuildError.ReadFailed (likely stale ChannelMonitor from migration overwrite), automatically retry once with accept_stale_channel_monitors enabled. The ldk-node recovery flag force-syncs the monitor's update_id and heals commitment state via a delayed chain sync + keysend round-trip. A persisted UserDefaults flag ensures this only triggers once — set on any successful build (affected or not), preventing future retries. Depends on: synonymdev/ldk-node#76 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BuildError::ReadFailed is returned by 19 different code paths (KVStore I/O errors, deserialization failures, task join errors). Using it to detect stale channel monitors in app-side retry logic causes false positives: a corrupt scorer or transient VSS timeout would burn the one-shot recovery flag, preventing actual stale monitor recovery later. Add a dedicated DangerousValue variant that is returned only when ChannelManager::read fails with DecodeError::DangerousValue (stale monitors vs manager desync). This lets apps catch the specific case without interfering with other ReadFailed scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ovitrif · 2026-03-18T16:50:56Z

New commit: `BuildError::DangerousValue`

Added a dedicated DangerousValue variant to BuildError to distinguish stale channel monitors from the 18 other ReadFailed causes (KVStore I/O errors, deserialization failures, task join errors, etc).

Rationale

The app-side recovery catches BuildException.ReadFailed to trigger the one-shot accept_stale_channel_monitors retry. But ReadFailed is a catch-all — a corrupt scorer, transient VSS timeout, or disk error would also trigger it, burning the recovery flag on a false positive. If the user later hits the actual stale monitor bug, recovery won't fire.

With DangerousValue, the app catches only the stale monitor case:

// Android example

// Before (catches 19 different failure modes):
catch (e: BuildException.ReadFailed)

// After (catches only stale monitors):
catch (e: BuildException.DangerousValue)

Changes

src/builder.rs: new BuildError::DangerousValue variant + DecodeError::DangerousValue match at ChannelManager read (line 2025)
bindings/ldk_node.udl: exposed to FFI
Bindings regenerated

…Value) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Depends on: synonymdev/ldk-node#76

Switch rust-lightning fork to ovitrif/rust-lightning#0.2.2-accept-stale-monitors-v2 which resets the commitment secrets store in force_set_latest_update_id, preventing SIGABRT from a ChannelMonitorUpdateStatus mode mismatch after provide_secret() fails validation on a stale secrets tree. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On BuildError.ReadFailed (likely stale ChannelMonitor from migration overwrite), automatically retry once with accept_stale_channel_monitors enabled. The ldk-node recovery flag force-syncs the monitor's update_id and heals commitment state via a delayed chain sync + keysend round-trip. A persisted UserDefaults flag ensures this only triggers once — set on any successful build (affected or not), preventing future retries. Depends on: synonymdev/ldk-node#76 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rce-close The healing keysend creates an HTLC with cltv_expiry based on the ChannelManager's best block height. For users offline >24h (all affected users), the height is stale — the HTLC is already expired by the time chain sync catches up, causing LDK to force-close the channel (HTLCsTimedOut). This defeats the entire recovery. Fix: call sync_lightning_wallet() before sending healing payments. This updates the ChannelManager's chain tip to the current height so the HTLC gets a valid CLTV expiry. If sync fails, skip the keysend entirely to avoid the stale-CLTV force-close — the monitor will heal naturally on the first real user payment after continuous chain sync starts. The stale monitor processes blocks during this sync (accepted risk: Blocktank is trusted, and the monitor's force-synced update_id makes it "valid" from LDK's perspective — the only concern is counterparty force-close during the offline period, which Blocktank wouldn't do). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… (rc.36) When orphaned channel migration encounters an existing monitor that can't be deserialized (e.g. UnknownVersion), skip the migration instead of killing node startup with ReadFailed. The existing data is preserved. Also adds HTLC timeout force-close fix to CHANGELOG (from rc.35). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread src/lib.rs Outdated

claude Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread src/lib.rs

claude Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread src/builder.rs

jvsena42 reviewed Mar 18, 2026

View reviewed changes

Comment thread src/builder.rs

This comment has been minimized.

Sign in to view

ben-kaufman mentioned this pull request Mar 18, 2026

feat: one-time stale channel monitor recovery synonymdev/bitkit-ios#500

Closed

ben-kaufman mentioned this pull request Mar 18, 2026

feat: one-time stale channel monitor recovery synonymdev/bitkit-android#854

Closed

This comment was marked as resolved.

Sign in to view

claude Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread src/builder.rs Outdated

claude Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread src/lib.rs Outdated

claude Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread src/lib.rs Outdated

claude Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread src/lib.rs Outdated

This comment has been minimized.

Sign in to view

ben-kaufman and others added 10 commits March 18, 2026 23:36

refactor: extract build_probe_path helper, add rationale comments

6554c4e

- Extract duplicated Path/RouteHop construction into build_probe_path closure - Address review feedback points with inline comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: move inline imports to top of file

cf30189

Move HashSet (lib.rs) and AtomicBool (builder.rs) to top-level imports per CLAUDE.md: "ALWAYS move imports to the top of the file" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: regenerate bindings after rebase on main

cb60b2e

bump version

cbe7fb4

ben-kaufman force-pushed the fix/accept-stale-channel-monitors branch from cd33be3 to cb60b2e Compare March 18, 2026 16:01

ovitrif self-requested a review March 18, 2026 16:37

jvsena42 mentioned this pull request Mar 18, 2026

feat: one-time stale channel monitor recovery (v2) synonymdev/bitkit-ios#501

Closed

chore: update changelog for rc.34 (stale monitor recovery + Dangerous…

3cb09d7

…Value) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ovitrif requested a review from coreyphillips March 18, 2026 18:30

ovitrif pushed a commit to synonymdev/bitkit-android that referenced this pull request Mar 18, 2026

feat: channel monitor recovery flow

ae73061

Depends on: synonymdev/ldk-node#76

ovitrif requested a review from jvsena42 March 18, 2026 21:52

ovitrif force-pushed the fix/accept-stale-channel-monitors branch from 46c6e32 to 153ecbe Compare March 18, 2026 22:20

ovitrif mentioned this pull request Mar 18, 2026

feat: stale channel monitors recovery synonymdev/bitkit-android#855

Closed

ben-kaufman mentioned this pull request Mar 18, 2026

feat: stale channel monitors recovery synonymdev/bitkit-ios#502

Merged

coreyphillips reviewed Mar 19, 2026

View reviewed changes

Comment thread src/lib.rs

coreyphillips requested changes Mar 19, 2026

View reviewed changes

Comment thread src/lib.rs

coreyphillips self-requested a review March 19, 2026 02:16

coreyphillips approved these changes Mar 19, 2026

View reviewed changes

ben-kaufman and others added 2 commits March 19, 2026 18:09

generate bindings

db5ce1b

This comment was marked as resolved.

Sign in to view

ovitrif changed the title ~~feat: accept stale channel monitors for recovery from monitor desync~~ feat: accept stale channel monitors for recovery Mar 19, 2026

ovitrif requested a review from pwltr March 19, 2026 17:40

ovitrif assigned ovitrif and ben-kaufman Mar 19, 2026

ovitrif and others added 2 commits March 19, 2026 21:01

chore: use synonymdev/rust-lightning fork

bc61106

generate bindings

ae38ead

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jvsena42 approved these changes Mar 20, 2026

View reviewed changes

ovitrif mentioned this pull request Mar 20, 2026

chore: version 2.1.2 synonymdev/bitkit-android#859

Merged

ovitrif merged commit afcebdf into main Mar 24, 2026
4 checks passed

ovitrif deleted the fix/accept-stale-channel-monitors branch March 24, 2026 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: accept stale channel monitors for recovery#76

feat: accept stale channel monitors for recovery#76
ovitrif merged 18 commits intomainfrom
fix/accept-stale-channel-monitors

ben-kaufman commented Mar 18, 2026 •

edited by ovitrif

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

ovitrif commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ben-kaufman commented Mar 18, 2026 • edited by ovitrif Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test plan

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

ovitrif commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New commit: BuildError::DangerousValue

Rationale

Changes

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ben-kaufman commented Mar 18, 2026 •

edited by ovitrif

Loading

ovitrif commented Mar 18, 2026 •

edited

Loading

New commit: `BuildError::DangerousValue`