Skip to content

fix: prevent RN migration from overwriting local channel monitors#495

Closed
jvsena42 wants to merge 3 commits intomasterfrom
fix/channel-monitor-stale-data
Closed

fix: prevent RN migration from overwriting local channel monitors#495
jvsena42 wants to merge 3 commits intomasterfrom
fix/channel-monitor-stale-data

Conversation

@jvsena42
Copy link
Member

@jvsena42 jvsena42 commented Mar 17, 2026

Description

This PR prevents old RN channel monitors from overwriting newer local ones during orphaned channel recovery, which caused LDK to refuse to start and display stale balances.

Root cause: PR #462 introduced orphaned channel monitor recovery that re-fetches all RN monitors from the remote backup. The ldk-node version at the time (c5698d0, post-rc.32) blindly wrote migration monitors to storage via setChannelDataMigration(), overwriting existing ones — including monitors that had advanced far beyond the old RN version. PR #480 fixed the channel manager overwrite but not the monitor overwrite.

Why it triggered for this user: isChannelRecoveryChecked is a new UserDefaults key introduced by PR #462. On first launch of the new version, it defaults to false, causing fetchOrphanedChannelMonitorsIfNeeded() to run for ALL existing RN-migrated users — not just those with genuinely orphaned channels. Every RN-migrated user with active channels who updated to the version with PR #462 is affected.

Fix: The migration-aware write protection already exists in ldk-node v0.7.0-rc.33 (commit e48948f). apply_channel_data_migration() now compares update_id before writing and skips if the existing monitor is newer or equal. This PR bumps ldk-node to include this fix. No Swift-side filtering is needed.

Recovery for already-affected users: Requires a separate ldk-node change — set_accept_stale_channel_monitors(true) to allow the node to start with stale monitors and cooperative-close the affected channels (see PR comments for details).

Balance impact: When the node fails to start, @AppStorage balance values (totalBalanceSats, totalOnchainSats, totalLightningSats) are never updated — the user sees a plausible-looking but completely frozen balance from the last successful session.

Logs from the bug

bitkit_logs_2026-03-17_14-16-30.zip

Linked Issues/Tasks

Related PRs:

Screenshot / Video

N/A - backend logic change

@jvsena42 jvsena42 self-assigned this Mar 17, 2026
@jvsena42

This comment was marked as outdated.

@jvsena42
Copy link
Member Author

jvsena42 commented Mar 17, 2026

Safest recovery path for already-affected users

The safest approach is to accept stale monitors and cooperative-close the affected channels, rather than wiping data or force-closing with revoked commitments.

Why this is the safest option

Option Funds at risk? Other channels? Mechanism
Wipe all channel data Funds may go unclaimed (no monitor watching for force-close) All lost Counterparty eventually force-closes
Force-close with stale monitor ALL funds at risk — broadcasting a revoked commitment (update_id 10 has been superseded 110 times, counterparty holds revocation keys) Preserved Penalty transaction
Accept stale + cooperative close No Preserved Cooperative close uses agreed-upon final balances

How it works

  1. Node starts with a stale monitor (warn instead of refusing)
  2. Immediately initiates cooperative close of the stale channel
  3. Both parties agree on correct final balances → closing tx → funds returned safely
  4. The only risk window is between startup and cooperative close: if the counterparty broadcasts a revoked state between updates 10–120, our stale monitor can't punish them. But the counterparty is Blocktank (our LSP) — no incentive to cheat.

Required changes

rust-lightning (ChannelManager deserialization):

The stale check and DangerousValue error lives in channelmanager.rs:~17355. Needs a flag to warn-and-continue instead of refusing to start:

// Instead of returning Err(DecodeError::DangerousValue) when monitor is stale:
if args.accept_stale_channel_monitors {
    log_warn!(logger,
        "Accepting stale monitor for channel {} (monitor update_id {} vs manager update_id {}). \
         Cooperative close recommended.",
        channel_id, monitor_update_id, manager_update_id
    );
    stale_channels.push(channel_id);
    // Continue deserialization instead of returning error
} else {
    return Err(DecodeError::DangerousValue); // existing behavior
}

ldk-node (builder.rs):

Expose the flag via the builder API:

pub fn set_accept_stale_channel_monitors(&mut self, accept: bool) -> &mut Self {
    self.accept_stale_monitors = accept;
    self
}

After the node starts, auto-close stale channels:

for channel_id in stale_channels {
    node.close_channel(&channel_id, counterparty_node_id)?;
}

Swift side (after ldk-node change):

On the first startup failure due to stale monitors, retry with the flag enabled:

// In LightningService.setup() or WalletViewModel.start()
// If node fails with stale monitor error, rebuild with:
builder.setAcceptStaleChannelMonitors(true)
// Node starts → stale channels are auto-closed cooperatively

Combine with migration-aware write (prevention)

For defense-in-depth, also add the migration-aware write from the previous comment to setChannelDataMigration() in builder.rs — skip writing monitors that already exist in the KVStore. This prevents the bug at the ldk-node level regardless of what the app passes.

@jvsena42 jvsena42 marked this pull request as ready for review March 17, 2026 17:29
@jvsena42 jvsena42 marked this pull request as draft March 17, 2026 17:29
@jvsena42
Copy link
Member Author

Impact on displayed balance

The balance values (totalBalanceSats, totalOnchainSats, totalLightningSats, etc.) are all @AppStorage (UserDefaults-backed) properties in WalletViewModel. They persist across app launches and are never reset to 0 unless explicitly overwritten.

When the node fails to start due to the stale monitor:

  1. updateBalanceState() fails because lightningService.refreshCache() / balanceManager.deriveBalanceState() require a running node
  2. The error is caught and none of the balance properties are updated
  3. The @AppStorage values retain whatever was last written from the previous successful session (before the update to build 182)

Result: The user sees a plausible-looking but completely frozen balance — the last known values from before the app update. It's not zero, so it doesn't immediately look broken. But:

  • No new payments are reflected (incoming or outgoing)
  • Both on-chain and Lightning balances are frozen (even though on-chain could theoretically refresh independently, deriveBalanceState() fails entirely when the node isn't running)
  • Send/receive operations fail because the node isn't started
  • The home screen, savings view, and spending view all display stale data

@jvsena42
Copy link
Member Author

jvsena42 commented Mar 18, 2026

Correction: ldk-node migration-aware write already exists

The migration-aware write protection already exists in ldk-node — but was added after the app's pinned version.

Timeline

Commit Date Description
c5698d0 Mar 5 App's pinned ldk-node version (Package.resolved) — does NOT have the protection
e48948f Mar 9 fix: prevent channel monitor migration from overwriting newer state
2df9b0a Mar 9 fix: fail-closed on migration read/deserialize errors
cde6037 Mar 9 fix: skip redundant monitor write when update_id is equal
4f91c8c Mar 9 fix: guard channel manager migration against overwriting newer state

The current apply_channel_data_migration() at builder.rs:1342-1492 already:

  • Channel manager: Checks if one exists in the KVStore before writing (skips if yes)
  • Channel monitors: Deserializes existing monitors, compares update_id, skips if existing ≥ migrated
  • Fail-closed: Returns ReadFailed on any non-NotFound errors

What this means

  1. Bumping ldk-node to include e48948f (on fix/existing-channel-rewriting branch or main) fixes the root cause at the Rust level
  2. The Swift-side filterMigrationMonitors in this PR is belt-and-suspenders prevention — still valuable but the ldk-node fix is the primary protection
  3. Recovery for already-affected users still needs Option D (accept stale monitors + cooperative close) — the migration-aware write only prevents future overwrites, doesn't help users whose monitors are already stale in VSS

@jvsena42
Copy link
Member Author

Update: ldk-node fix confirmed in v0.7.0-rc.33

The migration-aware write protection is included in ldk-node v0.7.0-rc.33 (latest tag). The app is currently pinned at c5698d0 (post-rc.32, pre-rc.33) which does NOT have the protection.

  • App version: c5698d0 (between v0.7.0-rc.32 and v0.7.0-rc.33) — no update_id comparison, blindly overwrites
  • Fix version: v0.7.0-rc.33 — includes e48948f through dc087fb (migration-aware write, fail-closed errors, tests)

Action: Bump ldk-node to v0.7.0-rc.33. The Swift-side filterMigrationMonitors code can be removed — ldk-node handles it at the Rust level.

Previous comments about "required ldk-node change" for migration-aware write are now moot — it's already done. The only remaining ldk-node work is Option D (accept stale monitors + cooperative close) for recovering already-affected users.

@jvsena42 jvsena42 force-pushed the fix/channel-monitor-stale-data branch from d7a9d96 to 8ab0069 Compare March 18, 2026 10:30
@jvsena42
Copy link
Member Author

@piotr-iohk replaced by #498 for a cleaner commit history

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant