Skip to content

Skip ChannelManager persistence after monitor update completion#4424

Open
joostjager wants to merge 2 commits intolightningdevkit:mainfrom
joostjager:skip-cm-persist-monitor-completed
Open

Skip ChannelManager persistence after monitor update completion#4424
joostjager wants to merge 2 commits intolightningdevkit:mainfrom
joostjager:skip-cm-persist-monitor-completed

Conversation

@joostjager
Copy link
Copy Markdown
Contributor

@joostjager joostjager commented Feb 18, 2026

Since we are pulling ChannelManager persistence into the critical path, we should avoid persisting unnecessarily. When process_pending_monitor_events processes only Completed events, resulting state changes may be recoverable on restart and do not require persistence.

@ldk-reviews-bot
Copy link
Copy Markdown

ldk-reviews-bot commented Feb 18, 2026

👋 I see @TheBlueMatt was un-assigned.
If you'd like another reviewer assignment, please click here.

Comment thread lightning/src/ln/channelmanager.rs Outdated
if result == NotifyOption::SkipPersistNoEvents {
result = NotifyOption::SkipPersistHandleEvents;
}
self.channel_monitor_updated(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to carefully review everything this calls to make sure they set the store-required flag. Might be easier to just manually set the flag if there's anything to do in handle_post_monitor_update_chan_resume?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "anything to do in handle_post_monitor_update_chan_resume", do you mean specifically a new monitor update being created? Or is there more that isn't recoverable on restart?

Added a commit that sets needs_persist_flag in handle_new_monitor_update_locked_actions_handled_by_caller right after chain_monitor.update_channel to cover that case.

Have to say that certainty level instantly drops when working on this state machine 😬

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an alternative approach on that is more conservative in stripping out persists: https://github.com/joostjager/rust-lightning/tree/skip-cm-persist-monitor-completed-alt

Rather than skipping persistence for all MonitorEvent::Completed events unconditionally, it inspects the result of monitor_updating_restored and only skips persistence when the effect is just queuing outbound messages (commitment_signed, revoke_and_ack). But also needs to be carefully reviewed to see if we don't miss anything.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than trying to carefully thread the needs-persistence flags through, maybe we just explicitly set needs_persist_flag in channel_monitor_updated? But yea I think your second approach there makes way more sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, pushed that to this PR after splitting in two commits for reviewability. Now it needs a careful eye to see if the persist condition is not missing anything.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, as of b0ec5eb I don't see any references to needs_persist_flag in the diff?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 'pushed that', I meant the second, more conservative, approach for determining whether persistence is needed.

I thought threading through at least keeps the overview of where persistence is requested from exactly, and it also connects with process_pending_monitor_events needing to return a value for the persistence guard. It is a bit confusing that two mechanisms are both used.

I can return SkipPersistHandleEvents from process_pending_monitor_events for Completed events, but I think the needs_persist flag cannot be set in channel_monitor_updated. It needs to be determined two levels deeper in try_resume_channel_post_monitor_update. Is your suggestion to thread back the bool from try_resume_channel_post_monitor_update to channel_monitor_updated and set the flag there? Or set the flag in try_resume_channel_post_monitor_update and consequently also in several other branches of the call tree?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, my thinking was that we'd just self.needs_persist_flag.store() in try_resume_channel_post_monitor_update/wherever we decide we need to persist (after debug_assert'ing that we're holding the global_persistency_lock). Glancing at the other callsites this looks correct/redundant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 18, 2026

Codecov Report

❌ Patch coverage is 98.43750% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 86.98%. Comparing base (1af5ff4) to head (2308073).

Files with missing lines Patch % Lines
lightning/src/ln/channelmanager.rs 98.43% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4424      +/-   ##
==========================================
- Coverage   86.99%   86.98%   -0.01%     
==========================================
  Files         163      163              
  Lines      109008   109026      +18     
  Branches   109008   109026      +18     
==========================================
+ Hits        94830    94839       +9     
- Misses      11692    11702      +10     
+ Partials     2486     2485       -1     
Flag Coverage Δ
fuzzing 38.26% <84.37%> (+0.05%) ⬆️
tests 86.09% <98.43%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@joostjager joostjager self-assigned this Feb 19, 2026
@joostjager joostjager force-pushed the skip-cm-persist-monitor-completed branch 3 times, most recently from 48dce43 to b0ec5eb Compare February 25, 2026 13:30
@joostjager joostjager marked this pull request as ready for review February 26, 2026 09:22
@joostjager joostjager requested a review from TheBlueMatt March 2, 2026 10:34
@TheBlueMatt TheBlueMatt removed their request for review March 3, 2026 22:18
@TheBlueMatt TheBlueMatt self-requested a review March 31, 2026 00:32
@ldk-reviews-bot
Copy link
Copy Markdown

🔔 1st Reminder

Hey @TheBlueMatt! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@TheBlueMatt TheBlueMatt removed their request for review April 3, 2026 18:04
@TheBlueMatt TheBlueMatt added this to the 0.3 milestone Apr 3, 2026
@joostjager joostjager force-pushed the skip-cm-persist-monitor-completed branch from b0ec5eb to ebac69e Compare April 9, 2026 09:51
Comment thread lightning/src/ln/channelmanager.rs Outdated
Comment on lines +10347 to +10362
let unbroadcasted_batch_funding_txid =
chan.context.unbroadcasted_batch_funding_txid(&chan.funding);

// Queuing outbound messages (commitment_update, raa) alone does
// not require ChannelManager persistence.
let needs_persist = has_state_changes
|| !updates.finalized_claimed_htlcs.is_empty()
|| !updates.failed_htlcs.is_empty()
|| !update_actions.is_empty()
|| unbroadcasted_batch_funding_txid.is_some()
|| !htlc_forwards.is_empty()
|| decode_update_add_htlcs.is_some();

if needs_persist {
self.needs_persist_flag.store(true, Ordering::Release);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The needs_persist check omits committed_outbound_htlc_sources (pruning optimization, replayable) and tx_signatures (resendable message). This is likely intentional, but consider adding a comment explaining their exclusion so future maintainers of MonitorRestoreUpdates know the invariant.

More importantly: handle_channel_resumption (called above) invokes emit_channel_pending_event! and emit_initial_channel_ready_event!, which can push events to self.pending_events and mutate channel flags — neither of which is captured in this needs_persist check. This appears safe today because:

  • emit_channel_pending_event! requires is_funding_broadcastable() && !channel_pending_event_emitted — in practice this aligns with funding_broadcastable.is_some()
  • emit_initial_channel_ready_event! requires is_usable() && !initial_channel_ready_event_emitted — in practice this aligns with channel_ready.is_some()

Both conditions imply has_state_changes = true. However, this coupling is implicit. If those conditions ever diverge, events could land in pending_events without triggering persistence. In process_events_body the !pending_events.is_empty() check provides a safety net, but in get_and_clear_pending_msg_events there is no such fallback.

Consider adding a debug_assert or explicit comment documenting this invariant.

Comment on lines 13187 to +13193
let _ = self.handle_error(err, counterparty_node_id);
}

has_pending_monitor_events
if needs_persist {
NotifyOption::DoPersist
} else {
NotifyOption::SkipPersistHandleEvents
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: the function now uses two distinct mechanisms for signaling persistence need:

  1. Return value (NotifyOption::DoPersist): for HTLCEvent, HolderForceClosed*, CommitmentTxConfirmed
  2. Direct flag (needs_persist_flag.store): for state changes discovered inside channel_monitor_updatedtry_resume_channel_post_monitor_update

The SkipPersistHandleEvents return here relies on the caller calling event_persist_notifier.notify() so the background processor wakes up and sees the directly-set flag. This works correctly today but is subtle — consider adding a comment at the return site explaining that persistence may have already been requested via needs_persist_flag and that the caller must still call notify() for SkipPersistHandleEvents.

@ldk-claude-review-bot
Copy link
Copy Markdown
Collaborator

ldk-claude-review-bot commented Apr 9, 2026

Review Summary

No new issues found beyond the prior review comments (which remain valid as design notes, not bugs).

Verification performed in this re-review

  • All 18 let _ = discard sites: Confirmed each is inside a PersistenceNotifierGuard::notify_on_drop context or equivalent always-persist path (e.g., process_background_events unconditionally returns DoPersist at line 8803, close_channel_internal uses notify_on_drop at line 4144, best_block_updated uses optionally_notify_skipping_background_events with DoPersist closure at line 16062)
  • process_events_body merge logic: All 9 combinations of (process_background_events, process_pending_monitor_events) correctly produce the "max" NotifyOption, with DoPersist always winning
  • get_and_clear_pending_msg_events refactor: Equivalent to prior behavior with finer-grained persistence control; maybe_generate_initial_closing_signed can only upgrade to DoPersist
  • has_state_changes completeness: Verified funding_broadcastable, channel_ready, announcement_sigs, and pending_update_adds correctly capture the ChannelManager mutations from handle_channel_resumption for the common paths. The unchecked funding_tx_signed field (splice path) is covered by the !pending_events.is_empty() safety net in process_events_body and by SkipPersistHandleEventsnotify() in get_and_clear_pending_msg_events
  • Splice path safety: Analyzed on_tx_signatures_exchange (channel.rs:9351) and confirmed that splice completions, even when has_state_changes is false, are fully replayable from ChannelMonitor state on restart. The SplicePending event emitted at channelmanager.rs:11185 is always caught by process_events_body's pending events check
  • needs_persist |= at line 13832: Uses bitwise OR — channel_monitor_updated is always called regardless of prior needs_persist value
  • handle_holding_cell_free_result interaction: Sets needs_persist_flag directly (line 13864), independent of channel_monitor_updated's return value. The background processor correctly picks this up even when process_pending_monitor_events returns SkipPersistHandleEvents
  • post_monitor_update_unlock tracking: Correctly tracks all state-changing operations including batch funding, update_actions, htlc_forwards, finalized_claimed_htlcs, failed_htlcs, and committed_outbound_htlc_sources

Prior inline comments status

All three prior inline comments remain valid as design/documentation notes but are not bugs:

  • channelmanager.rs:10362 — Implicit coupling between has_state_changes and event emission macros
  • channelmanager.rs:13849 — Dual persistence signaling mechanisms (NotifyOption return vs direct needs_persist_flag)
  • channelmanager.rs:10506 — Retracted (not a bug)

Comment thread lightning/src/ln/channelmanager.rs Outdated
@@ -11001,7 +10648,11 @@ This indicates a bug inside LDK. Please report this error at https://github.com/
let per_peer_state = self.per_peer_state.read().unwrap();
let mut peer_state_lock;
let peer_state_mutex_opt = per_peer_state.get(counterparty_node_id);
if peer_state_mutex_opt.is_none() { return }
if peer_state_mutex_opt.is_none() {
// Peer is gone; conservatively request persistence.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? We obviously didn't do anything?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm annoying, this is a left over from one of the previous version of this change. Comment removed.

Comment thread lightning/src/ln/channelmanager.rs Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a doc comment update that says that this will set the needs-persist flag and will trigger the bp wake if required. Same with try_resume_channel_post_monitor_update, handle_post_monitor_update_chan_resume (which looks like it needs to be updated?).

Comment thread lightning/src/ln/channelmanager.rs Outdated
|| decode_update_add_htlcs.is_some();

if needs_persist {
self.needs_persist_flag.store(true, Ordering::Release);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh oops (a) this needs to happen after we do whatever handling we need in handle_post_monitor_update_chan_resume and (b) we're missing a notify call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will go back to a bit more passing back of values. It's not ideal to have those two mechanisms. I am not sure if this is actually incorrect though because it still happening under the global lock and notifier pattern.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I again threaded the bool all the way back so MonitorEvent::Completed still has one place that decides between SkipPersistHandleEvents and DoPersist. If we want the atomic flag setting to happen at the end anyway, we can just as well carry that decision back explicitly instead of splitting it between the top-level NotifyOption return and lower helper side effects.

@joostjager joostjager force-pushed the skip-cm-persist-monitor-completed branch from ebac69e to 887ee22 Compare April 15, 2026 11:49
Comment thread lightning/src/ln/channelmanager.rs Outdated
@joostjager joostjager force-pushed the skip-cm-persist-monitor-completed branch 3 times, most recently from d3cffca to 0d5f8e3 Compare April 15, 2026 12:01
joostjager and others added 2 commits April 15, 2026 14:42
Refactor process_pending_monitor_events to return a NotifyOption
instead of a bool, allowing callers to distinguish between
DoPersist, SkipPersistHandleEvents, and SkipPersistNoEvents.

Both call sites in process_events_body and
get_and_clear_pending_msg_events are updated accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When process_pending_monitor_events processes only Completed events
and the resulting work is limited to message-only monitor
completion handling, ChannelManager persistence can be skipped.

Completion handling now reports whether it actually mutated
ChannelManager state, and process_pending_monitor_events uses
that to decide between SkipPersistHandleEvents and DoPersist.

AI tools were used in preparing this commit.
@joostjager joostjager force-pushed the skip-cm-persist-monitor-completed branch from 0d5f8e3 to 2308073 Compare April 15, 2026 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants