Skip to content

feat: accept stale channel monitors for recovery#76

Open
ben-kaufman wants to merge 18 commits intomainfrom
fix/accept-stale-channel-monitors
Open

feat: accept stale channel monitors for recovery#76
ben-kaufman wants to merge 18 commits intomainfrom
fix/accept-stale-channel-monitors

Conversation

@ben-kaufman
Copy link

@ben-kaufman ben-kaufman commented Mar 18, 2026

Summary

  • Add set_accept_stale_channel_monitors(bool) builder API for one-time recovery when a channel monitor's update_id falls behind the ChannelManager (e.g. after migration overwrote newer monitor data with stale backup)
  • When enabled, force-syncs stale monitor update_ids during build instead of failing with DangerousValue
  • Defers chain sync on startup while sending probes to trigger commitment round-trips that heal stale monitor state
  • Polls monitor update_ids (60s timeout) with probe retries every 10s for late-connecting peers

The probe-triggered commitment round-trip provides the monitor with:

  • LatestHolderCommitmentTXInfo — correct current commitment with real signatures
  • CommitmentSecret — recovers all gap revocation secrets via the derivation tree

Context

Related to bitkit-android#847 and bitkit-ios#495.

Root cause: recent PRs in bitkit ios & android introduced orphaned channel recovery that re-fetched old RN monitors and passed them via setChannelDataMigration(). The app was pinned to ldk-node pre-rc.33 which blindly overwrote existing VSS monitors without comparing update_ids. These PRs got released in a recent app update.

Depends on: ovitrif/rust-lightning@0.2.2-accept-stale-monitors-v2 (patched force_set_latest_update_id() that resets CounterpartyCommitmentSecrets + accept_stale_channel_monitors flag on ChannelManagerReadArgs).

Test plan

  • test in bitkit-android
  • test in bitkit-ios

Related

🤖 Built with help from Claude Code

@claude

This comment has been minimized.

@claude

This comment has been minimized.

ben-kaufman added a commit to synonymdev/bitkit-ios that referenced this pull request Mar 18, 2026
On BuildError.ReadFailed (likely stale ChannelMonitor from migration
overwrite), automatically retry once with accept_stale_channel_monitors
enabled. The ldk-node recovery flag force-syncs the monitor's update_id
and heals commitment state via a delayed chain sync + keysend round-trip.

A persisted UserDefaults flag ensures this only triggers once — set on
any successful build (affected or not), preventing future retries.

Depends on: synonymdev/ldk-node#76

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ben-kaufman added a commit to synonymdev/bitkit-android that referenced this pull request Mar 18, 2026
On BuildException.ReadFailed (likely stale ChannelMonitor from migration
overwrite), automatically retry once with accept_stale_channel_monitors
enabled. The ldk-node recovery flag force-syncs the monitor's update_id
and heals commitment state via a delayed chain sync + keysend round-trip.

A persisted SharedPreferences flag ensures this only triggers once — set
on any successful build (affected or not), preventing future retries.

Depends on: synonymdev/ldk-node#76

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
claude[bot]

This comment was marked as resolved.

@claude

This comment has been minimized.

ben-kaufman and others added 10 commits March 18, 2026 23:36
When a channel monitor's update_id falls behind the ChannelManager
(e.g. after a migration overwrote newer data with stale backup data),
LDK refuses to start with DangerousValue. This adds a recovery path:

- Builder: new `set_accept_stale_channel_monitors(bool)` flag
- Build: passes flag to ChannelManagerReadArgs, which force-syncs
  stale monitor update_ids instead of returning DangerousValue
- Startup: when flag is set, defers chain sync while sending probes
  on all channels to trigger commitment round-trips that heal the
  stale monitor state. Polls monitor update_ids with 60s timeout
  and retries probes every 10s for late-connecting peers.

The probe-triggered commitment round-trip provides:
- LatestHolderCommitmentTXInfo (correct current commitment state)
- CommitmentSecret (recovers all gap revocation secrets via the
  derivation tree)

Depends on: ben-kaufman/rust-lightning#fix/accept-stale-channel-monitors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move inline imports (Path, RouteHop, NodeFeatures, ChannelFeatures)
  to top of lib.rs
- Release is_running write lock before healing block_on to prevent
  stop() deadlock during recovery
- Add stop_sender signal to healing loop so shutdown can cancel it
- Early return after defer_chain_sync path (is_running already set)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract duplicated Path/RouteHop construction into build_probe_path closure
- Address review feedback points with inline comments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug 1 (CRITICAL): send_probe rejects single-hop paths ("No need probing
a path with less than two hops"). Replaced with send_spontaneous_payment
which sends a 1-sat keysend — works for direct single-hop channels.

Bug 2: start() spawned chain sync after stop() completed during healing.
Added is_running check after block_on returns.

Bug 3: accept_stale_channel_monitors flag was never cleared, causing
every restart on the same Node instance to re-trigger the 60s healing
delay. Changed to AtomicBool, cleared after healing completes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set max_total_routing_fee_msat = 0 to prevent the router from picking
a multi-hop route through a different channel, which would heal the
wrong monitor. With zero max fee, only the direct (zero-fee) route to
the counterparty is used.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When multiple channels exist with the same peer, the router may pick
the same channel for both payments — leaving the other unhealed. Fix:
send one payment per unique counterparty (not per channel) and dedup
in the retry loop. Each retry may route through a different channel as
outbound capacity shifts from previous attempts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move HashSet (lib.rs) and AtomicBool (builder.rs) to top-level imports
per CLAUDE.md: "ALWAYS move imports to the top of the file"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Remove peer dedup from healing payments — send one per unhealed
   channel (not per peer). The previous dedup prevented retries to the
   same peer even when some of their channels remained unhealed.

2. Fix TOCTOU race: subscribe to stop_sender while holding the
   is_running read lock before spawning chain sync. This prevents
   stop() from completing between the check and subscribe, which would
   orphan the task (missing the already-sent stop signal).

   Extracted spawn_chain_sync_task_with_receiver() so the normal path
   (spawn_chain_sync_task) still works unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ben-kaufman ben-kaufman force-pushed the fix/accept-stale-channel-monitors branch from cd33be3 to cb60b2e Compare March 18, 2026 16:01
jvsena42 pushed a commit to synonymdev/bitkit-ios that referenced this pull request Mar 18, 2026
On BuildError.ReadFailed (likely stale ChannelMonitor from migration
overwrite), automatically retry once with accept_stale_channel_monitors
enabled. The ldk-node recovery flag force-syncs the monitor's update_id
and heals commitment state via a delayed chain sync + keysend round-trip.

A persisted UserDefaults flag ensures this only triggers once — set on
any successful build (affected or not), preventing future retries.

Depends on: synonymdev/ldk-node#76

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ovitrif ovitrif self-requested a review March 18, 2026 16:37
BuildError::ReadFailed is returned by 19 different code paths (KVStore
I/O errors, deserialization failures, task join errors). Using it to
detect stale channel monitors in app-side retry logic causes false
positives: a corrupt scorer or transient VSS timeout would burn the
one-shot recovery flag, preventing actual stale monitor recovery later.

Add a dedicated DangerousValue variant that is returned only when
ChannelManager::read fails with DecodeError::DangerousValue (stale
monitors vs manager desync). This lets apps catch the specific case
without interfering with other ReadFailed scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ovitrif
Copy link
Collaborator

ovitrif commented Mar 18, 2026

New commit: BuildError::DangerousValue

Added a dedicated DangerousValue variant to BuildError to distinguish stale channel monitors from the 18 other ReadFailed causes (KVStore I/O errors, deserialization failures, task join errors, etc).

Rationale

The app-side recovery catches BuildException.ReadFailed to trigger the one-shot accept_stale_channel_monitors retry. But ReadFailed is a catch-all — a corrupt scorer, transient VSS timeout, or disk error would also trigger it, burning the recovery flag on a false positive. If the user later hits the actual stale monitor bug, recovery won't fire.

With DangerousValue, the app catches only the stale monitor case:

// Android example

// Before (catches 19 different failure modes):
catch (e: BuildException.ReadFailed)

// After (catches only stale monitors):
catch (e: BuildException.DangerousValue)

Changes

  • src/builder.rs: new BuildError::DangerousValue variant + DecodeError::DangerousValue match at ChannelManager read (line 2025)
  • bindings/ldk_node.udl: exposed to FFI
  • Bindings regenerated

…Value)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ovitrif ovitrif requested a review from coreyphillips March 18, 2026 18:30
ovitrif pushed a commit to synonymdev/bitkit-android that referenced this pull request Mar 18, 2026
@ovitrif ovitrif requested a review from jvsena42 March 18, 2026 21:52
Switch rust-lightning fork to ovitrif/rust-lightning#0.2.2-accept-stale-monitors-v2
which resets the commitment secrets store in force_set_latest_update_id,
preventing SIGABRT from a ChannelMonitorUpdateStatus mode mismatch after
provide_secret() fails validation on a stale secrets tree.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ovitrif ovitrif force-pushed the fix/accept-stale-channel-monitors branch from 46c6e32 to 153ecbe Compare March 18, 2026 22:20
ben-kaufman added a commit to synonymdev/bitkit-ios that referenced this pull request Mar 18, 2026
On BuildError.ReadFailed (likely stale ChannelMonitor from migration
overwrite), automatically retry once with accept_stale_channel_monitors
enabled. The ldk-node recovery flag force-syncs the monitor's update_id
and heals commitment state via a delayed chain sync + keysend round-trip.

A persisted UserDefaults flag ensures this only triggers once — set on
any successful build (affected or not), preventing future retries.

Depends on: synonymdev/ldk-node#76

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ben-kaufman added a commit to synonymdev/bitkit-ios that referenced this pull request Mar 18, 2026
On BuildError.ReadFailed (likely stale ChannelMonitor from migration
overwrite), automatically retry once with accept_stale_channel_monitors
enabled. The ldk-node recovery flag force-syncs the monitor's update_id
and heals commitment state via a delayed chain sync + keysend round-trip.

A persisted UserDefaults flag ensures this only triggers once — set on
any successful build (affected or not), preventing future retries.

Depends on: synonymdev/ldk-node#76

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coreyphillips coreyphillips self-requested a review March 19, 2026 02:16
ben-kaufman and others added 2 commits March 19, 2026 18:09
…rce-close

The healing keysend creates an HTLC with cltv_expiry based on the
ChannelManager's best block height. For users offline >24h (all affected
users), the height is stale — the HTLC is already expired by the time
chain sync catches up, causing LDK to force-close the channel
(HTLCsTimedOut). This defeats the entire recovery.

Fix: call sync_lightning_wallet() before sending healing payments. This
updates the ChannelManager's chain tip to the current height so the HTLC
gets a valid CLTV expiry. If sync fails, skip the keysend entirely to
avoid the stale-CLTV force-close — the monitor will heal naturally on
the first real user payment after continuous chain sync starts.

The stale monitor processes blocks during this sync (accepted risk:
Blocktank is trusted, and the monitor's force-synced update_id makes it
"valid" from LDK's perspective — the only concern is counterparty
force-close during the offline period, which Blocktank wouldn't do).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Collaborator

@ovitrif ovitrif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest changes lgtm

@ovitrif ovitrif changed the title feat: accept stale channel monitors for recovery from monitor desync feat: accept stale channel monitors for recovery Mar 19, 2026
… (rc.36)

When orphaned channel migration encounters an existing monitor that can't
be deserialized (e.g. UnknownVersion), skip the migration instead of
killing node startup with ReadFailed. The existing data is preserved.

Also adds HTLC timeout force-close fix to CHANGELOG (from rc.35).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ovitrif and others added 2 commits March 19, 2026 21:01
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants