Skip to content

0.2.2 accept stale monitors#1

Open
ovitrif wants to merge 2 commits into0.2.2from
0.2.2-accept-stale-monitors
Open

0.2.2 accept stale monitors#1
ovitrif wants to merge 2 commits into0.2.2from
0.2.2-accept-stale-monitors

Conversation

@ovitrif
Copy link
Collaborator

@ovitrif ovitrif commented Mar 19, 2026

Summary

Adds accept_stale_channel_monitors to ChannelManagerReadArgs, allowing nodes to recover from monitor desync (e.g. a migration or backup restore that overwrote a newer monitor with older data). When enabled, stale monitors have their update_id force-synced to the ChannelManager's expected value and their BOLT-3 commitment secrets tree reset, so the channel can self-heal on the next commitment round-trip.

Changes

1. accept_stale_channel_monitors field (channelmanager.rs)

New bool field on ChannelManagerReadArgs (default: false). When a monitor's update_id is behind the ChannelManager's, instead of returning DecodeError::DangerousValue, the deserialization path:

  • Logs a warning with both update_id values
  • Calls monitor.force_set_latest_update_id(target_id) to resync

2. force_set_latest_update_id with secrets tree reset (channelmonitor.rs)

New public method that:

  • Sets latest_update_id to the given value (skipping sequential validation)
  • Resets commitment_secrets to CounterpartyCommitmentSecrets::new() — clearing stale secrets that would fail cross-tree BOLT-3 derivation validation against new secrets from healing round-trips

3. Sentinel skip in provide_secret (chan_utils.rs)

Skip uninitialized sentinel slots (old_idx == 1 << 48) in the cross-check loop. Without this, ~50% of channels crash on the first post-reset provide_secret call — when the first secret index lands at position > 0, derive_secret checks against the all-zeros sentinel and always fails.

4. Test scaffolding (functional_test_utils.rs, reload_tests.rs)

Added accept_stale_channel_monitors: false to existing ChannelManagerReadArgs construction sites to compile with the new field.

Security analysis

Resetting the secrets tree

Concern Assessment
Loss of penalty capability for pre-stale revoked commitments The stale monitor lacks commitment transaction data for the gap period — it cannot construct penalty transactions even with the old secrets. The secrets alone provide no actionable security benefit.
Pre-stale commitments where monitor has HTLC data Monitor loses secrets for these. However, the alternative (keeping stale secrets) causes a crash making the node completely non-functional, which is strictly worse.
Post-recovery exposure After recovery, the channel should be closed as soon as practical to minimize the window where old secrets are unavailable.

Sentinel skip in provide_secret

Concern Assessment
Could a normal channel hit the sentinel skip? No. During sequential provision, sentinel slots at position P are always overwritten before position P+1 is first checked — the binary counting pattern guarantees this. The skip never triggers for non-stale channels.
Could an attacker craft index 1 << 48? No. Valid indices are 0 to (1<<48)-1. The commitment number determines the index via (1<<48) - 1 - C, and C starts at 0. Even at 1 commitment/second it takes ~9 million years to exhaust the space.
Are real cross-checks weakened? No. Cross-checks between real (non-sentinel) slots remain fully enforced. The skip only applies to slots with the impossible sentinel index 1 << 48.
When does the skip trigger? Only after an explicit CounterpartyCommitmentSecrets::new() reset in force_set_latest_update_id.

Test plan

  • cargo build --all-targets — compiles without errors
  • cargo test --lib -p lightning — unit tests pass (includes existing CounterpartyCommitmentSecrets tests)
  • cargo fmt --check — no formatting issues in changed files
  • Manual test in bitkit app via ldk-node: restore a stale monitor, send a healing payment, verify no crash

ben-kaufman and others added 2 commits March 18, 2026 21:01
Add a flag to ChannelManagerReadArgs that allows stale channel monitors
to be accepted on startup instead of returning DangerousValue. When a
stale monitor is detected, its update_id is force-synced to match the
ChannelManager. The monitor's commitment state remains stale until the
next real channel update (e.g. a commitment round-trip triggered by a
fee update or probe).

This enables recovery after monitor data was overwritten by a migration
or backup restore. The caller should trigger a commitment round-trip
after startup to heal the monitor state and recover revocation secrets
via the derivation tree.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a stale monitor is restored via accept_stale_channel_monitors and
force_set_latest_update_id, the CounterpartyCommitmentSecrets tree retains
secrets from before the update_id gap. When a healing payment triggers
commitment round-trips, provide_secret() fails because the stale secrets
occupy a different subtree in the BOLT-3 derivation tree than the new
secrets. This causes update_monitor() -> Err -> chainmonitor mode
mismatch -> Rust panic / SIGABRT.

Two coordinated changes fix this:

1. channelmonitor.rs: Reset commitment_secrets to new() in
   force_set_latest_update_id so new secrets build a fresh, consistent
   derivation tree instead of failing cross-tree validation.

2. chan_utils.rs: Skip uninitialized sentinel slots (old_idx == 1 << 48)
   in provide_secret's cross-check loop. Without this, ~50% of channels
   crash on the first post-reset provide_secret call (when the first
   index lands at position > 0 and derive_secret checks against the
   all-zeros sentinel, which always fails).

Security analysis — resetting the secrets tree:

- Losing old secrets means we cannot build penalty transactions for
  revoked commitments from before the stale point.
- However, the stale monitor lacks commitment transaction data for the
  gap period — it cannot construct penalty transactions even WITH the
  old secrets. The secrets alone provide no actionable security benefit.
- For pre-stale commitments, the monitor has HTLC data but loses
  secrets. The alternative (keeping stale secrets) causes a crash that
  makes the node completely non-functional, which is strictly worse.
- After recovery, the channel should be closed as soon as practical
  to minimize exposure during the window where old secrets are lost.

Security analysis — sentinel skip in provide_secret:

- 1 << 48 is outside the valid index range (0 to (1<<48)-1) and
  uniquely identifies uninitialized slots. No real secret can ever
  have this index.
- During normal sequential provision (new channels), sentinel slots at
  position P are always overwritten before position P+1 is first
  checked — the binary counting pattern guarantees this. The sentinel
  skip never triggers for non-stale channels.
- Cross-checks between real (non-sentinel) slots remain fully enforced.
- The only way to reach the sentinel skip is after an explicit reset
  via force_set_latest_update_id (our stale recovery path).
- Even at 1 commitment/second, it takes ~9 million years to exhaust
  the index space — it never wraps to 1 << 48.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants