Open
Conversation
Add a flag to ChannelManagerReadArgs that allows stale channel monitors to be accepted on startup instead of returning DangerousValue. When a stale monitor is detected, its update_id is force-synced to match the ChannelManager. The monitor's commitment state remains stale until the next real channel update (e.g. a commitment round-trip triggered by a fee update or probe). This enables recovery after monitor data was overwritten by a migration or backup restore. The caller should trigger a commitment round-trip after startup to heal the monitor state and recover revocation secrets via the derivation tree. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a stale monitor is restored via accept_stale_channel_monitors and force_set_latest_update_id, the CounterpartyCommitmentSecrets tree retains secrets from before the update_id gap. When a healing payment triggers commitment round-trips, provide_secret() fails because the stale secrets occupy a different subtree in the BOLT-3 derivation tree than the new secrets. This causes update_monitor() -> Err -> chainmonitor mode mismatch -> Rust panic / SIGABRT. Two coordinated changes fix this: 1. channelmonitor.rs: Reset commitment_secrets to new() in force_set_latest_update_id so new secrets build a fresh, consistent derivation tree instead of failing cross-tree validation. 2. chan_utils.rs: Skip uninitialized sentinel slots (old_idx == 1 << 48) in provide_secret's cross-check loop. Without this, ~50% of channels crash on the first post-reset provide_secret call (when the first index lands at position > 0 and derive_secret checks against the all-zeros sentinel, which always fails). Security analysis — resetting the secrets tree: - Losing old secrets means we cannot build penalty transactions for revoked commitments from before the stale point. - However, the stale monitor lacks commitment transaction data for the gap period — it cannot construct penalty transactions even WITH the old secrets. The secrets alone provide no actionable security benefit. - For pre-stale commitments, the monitor has HTLC data but loses secrets. The alternative (keeping stale secrets) causes a crash that makes the node completely non-functional, which is strictly worse. - After recovery, the channel should be closed as soon as practical to minimize exposure during the window where old secrets are lost. Security analysis — sentinel skip in provide_secret: - 1 << 48 is outside the valid index range (0 to (1<<48)-1) and uniquely identifies uninitialized slots. No real secret can ever have this index. - During normal sequential provision (new channels), sentinel slots at position P are always overwritten before position P+1 is first checked — the binary counting pattern guarantees this. The sentinel skip never triggers for non-stale channels. - Cross-checks between real (non-sentinel) slots remain fully enforced. - The only way to reach the sentinel skip is after an explicit reset via force_set_latest_update_id (our stale recovery path). - Even at 1 commitment/second, it takes ~9 million years to exhaust the index space — it never wraps to 1 << 48. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
accept_stale_channel_monitorstoChannelManagerReadArgs, allowing nodes to recover from monitor desync (e.g. a migration or backup restore that overwrote a newer monitor with older data). When enabled, stale monitors have theirupdate_idforce-synced to theChannelManager's expected value and their BOLT-3 commitment secrets tree reset, so the channel can self-heal on the next commitment round-trip.Changes
1.
accept_stale_channel_monitorsfield (channelmanager.rs)New
boolfield onChannelManagerReadArgs(default:false). When a monitor'supdate_idis behind theChannelManager's, instead of returningDecodeError::DangerousValue, the deserialization path:monitor.force_set_latest_update_id(target_id)to resync2.
force_set_latest_update_idwith secrets tree reset (channelmonitor.rs)New public method that:
latest_update_idto the given value (skipping sequential validation)commitment_secretstoCounterpartyCommitmentSecrets::new()— clearing stale secrets that would fail cross-tree BOLT-3 derivation validation against new secrets from healing round-trips3. Sentinel skip in
provide_secret(chan_utils.rs)Skip uninitialized sentinel slots (
old_idx == 1 << 48) in the cross-check loop. Without this, ~50% of channels crash on the first post-resetprovide_secretcall — when the first secret index lands at position > 0,derive_secretchecks against the all-zeros sentinel and always fails.4. Test scaffolding (functional_test_utils.rs, reload_tests.rs)
Added
accept_stale_channel_monitors: falseto existingChannelManagerReadArgsconstruction sites to compile with the new field.Security analysis
Resetting the secrets tree
Sentinel skip in
provide_secret1 << 48?0to(1<<48)-1. The commitment number determines the index via(1<<48) - 1 - C, and C starts at 0. Even at 1 commitment/second it takes ~9 million years to exhaust the space.1 << 48.CounterpartyCommitmentSecrets::new()reset inforce_set_latest_update_id.Test plan
cargo build --all-targets— compiles without errorscargo test --lib -p lightning— unit tests pass (includes existingCounterpartyCommitmentSecretstests)cargo fmt --check— no formatting issues in changed files