Skip ChannelManager persistence after monitor update completion#4424
Skip ChannelManager persistence after monitor update completion#4424joostjager wants to merge 2 commits intolightningdevkit:mainfrom
Conversation
|
👋 I see @TheBlueMatt was un-assigned. |
| if result == NotifyOption::SkipPersistNoEvents { | ||
| result = NotifyOption::SkipPersistHandleEvents; | ||
| } | ||
| self.channel_monitor_updated( |
There was a problem hiding this comment.
We'll need to carefully review everything this calls to make sure they set the store-required flag. Might be easier to just manually set the flag if there's anything to do in handle_post_monitor_update_chan_resume?
There was a problem hiding this comment.
By "anything to do in handle_post_monitor_update_chan_resume", do you mean specifically a new monitor update being created? Or is there more that isn't recoverable on restart?
Added a commit that sets needs_persist_flag in handle_new_monitor_update_locked_actions_handled_by_caller right after chain_monitor.update_channel to cover that case.
Have to say that certainty level instantly drops when working on this state machine 😬
There was a problem hiding this comment.
I have an alternative approach on that is more conservative in stripping out persists: https://github.com/joostjager/rust-lightning/tree/skip-cm-persist-monitor-completed-alt
Rather than skipping persistence for all MonitorEvent::Completed events unconditionally, it inspects the result of monitor_updating_restored and only skips persistence when the effect is just queuing outbound messages (commitment_signed, revoke_and_ack). But also needs to be carefully reviewed to see if we don't miss anything.
There was a problem hiding this comment.
Rather than trying to carefully thread the needs-persistence flags through, maybe we just explicitly set needs_persist_flag in channel_monitor_updated? But yea I think your second approach there makes way more sense.
There was a problem hiding this comment.
Ok, pushed that to this PR after splitting in two commits for reviewability. Now it needs a careful eye to see if the persist condition is not missing anything.
There was a problem hiding this comment.
I'm confused, as of b0ec5eb I don't see any references to needs_persist_flag in the diff?
There was a problem hiding this comment.
With 'pushed that', I meant the second, more conservative, approach for determining whether persistence is needed.
I thought threading through at least keeps the overview of where persistence is requested from exactly, and it also connects with process_pending_monitor_events needing to return a value for the persistence guard. It is a bit confusing that two mechanisms are both used.
I can return SkipPersistHandleEvents from process_pending_monitor_events for Completed events, but I think the needs_persist flag cannot be set in channel_monitor_updated. It needs to be determined two levels deeper in try_resume_channel_post_monitor_update. Is your suggestion to thread back the bool from try_resume_channel_post_monitor_update to channel_monitor_updated and set the flag there? Or set the flag in try_resume_channel_post_monitor_update and consequently also in several other branches of the call tree?
There was a problem hiding this comment.
Yea, my thinking was that we'd just self.needs_persist_flag.store() in try_resume_channel_post_monitor_update/wherever we decide we need to persist (after debug_assert'ing that we're holding the global_persistency_lock). Glancing at the other callsites this looks correct/redundant.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4424 +/- ##
==========================================
- Coverage 86.99% 86.98% -0.01%
==========================================
Files 163 163
Lines 109008 109026 +18
Branches 109008 109026 +18
==========================================
+ Hits 94830 94839 +9
- Misses 11692 11702 +10
+ Partials 2486 2485 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
48dce43 to
b0ec5eb
Compare
|
🔔 1st Reminder Hey @TheBlueMatt! This PR has been waiting for your review. |
b0ec5eb to
ebac69e
Compare
| let unbroadcasted_batch_funding_txid = | ||
| chan.context.unbroadcasted_batch_funding_txid(&chan.funding); | ||
|
|
||
| // Queuing outbound messages (commitment_update, raa) alone does | ||
| // not require ChannelManager persistence. | ||
| let needs_persist = has_state_changes | ||
| || !updates.finalized_claimed_htlcs.is_empty() | ||
| || !updates.failed_htlcs.is_empty() | ||
| || !update_actions.is_empty() | ||
| || unbroadcasted_batch_funding_txid.is_some() | ||
| || !htlc_forwards.is_empty() | ||
| || decode_update_add_htlcs.is_some(); | ||
|
|
||
| if needs_persist { | ||
| self.needs_persist_flag.store(true, Ordering::Release); | ||
| } |
There was a problem hiding this comment.
The needs_persist check omits committed_outbound_htlc_sources (pruning optimization, replayable) and tx_signatures (resendable message). This is likely intentional, but consider adding a comment explaining their exclusion so future maintainers of MonitorRestoreUpdates know the invariant.
More importantly: handle_channel_resumption (called above) invokes emit_channel_pending_event! and emit_initial_channel_ready_event!, which can push events to self.pending_events and mutate channel flags — neither of which is captured in this needs_persist check. This appears safe today because:
emit_channel_pending_event!requiresis_funding_broadcastable() && !channel_pending_event_emitted— in practice this aligns withfunding_broadcastable.is_some()emit_initial_channel_ready_event!requiresis_usable() && !initial_channel_ready_event_emitted— in practice this aligns withchannel_ready.is_some()
Both conditions imply has_state_changes = true. However, this coupling is implicit. If those conditions ever diverge, events could land in pending_events without triggering persistence. In process_events_body the !pending_events.is_empty() check provides a safety net, but in get_and_clear_pending_msg_events there is no such fallback.
Consider adding a debug_assert or explicit comment documenting this invariant.
| let _ = self.handle_error(err, counterparty_node_id); | ||
| } | ||
|
|
||
| has_pending_monitor_events | ||
| if needs_persist { | ||
| NotifyOption::DoPersist | ||
| } else { | ||
| NotifyOption::SkipPersistHandleEvents |
There was a problem hiding this comment.
Minor: the function now uses two distinct mechanisms for signaling persistence need:
- Return value (
NotifyOption::DoPersist): forHTLCEvent,HolderForceClosed*,CommitmentTxConfirmed - Direct flag (
needs_persist_flag.store): for state changes discovered insidechannel_monitor_updated→try_resume_channel_post_monitor_update
The SkipPersistHandleEvents return here relies on the caller calling event_persist_notifier.notify() so the background processor wakes up and sees the directly-set flag. This works correctly today but is subtle — consider adding a comment at the return site explaining that persistence may have already been requested via needs_persist_flag and that the caller must still call notify() for SkipPersistHandleEvents.
Review SummaryNo new issues found beyond the prior review comments (which remain valid as design notes, not bugs). Verification performed in this re-review
Prior inline comments statusAll three prior inline comments remain valid as design/documentation notes but are not bugs:
|
| @@ -11001,7 +10648,11 @@ This indicates a bug inside LDK. Please report this error at https://github.com/ | |||
| let per_peer_state = self.per_peer_state.read().unwrap(); | |||
| let mut peer_state_lock; | |||
| let peer_state_mutex_opt = per_peer_state.get(counterparty_node_id); | |||
| if peer_state_mutex_opt.is_none() { return } | |||
| if peer_state_mutex_opt.is_none() { | |||
| // Peer is gone; conservatively request persistence. | |||
There was a problem hiding this comment.
Why? We obviously didn't do anything?
There was a problem hiding this comment.
Hm annoying, this is a left over from one of the previous version of this change. Comment removed.
There was a problem hiding this comment.
needs a doc comment update that says that this will set the needs-persist flag and will trigger the bp wake if required. Same with try_resume_channel_post_monitor_update, handle_post_monitor_update_chan_resume (which looks like it needs to be updated?).
| || decode_update_add_htlcs.is_some(); | ||
|
|
||
| if needs_persist { | ||
| self.needs_persist_flag.store(true, Ordering::Release); |
There was a problem hiding this comment.
Oh oops (a) this needs to happen after we do whatever handling we need in handle_post_monitor_update_chan_resume and (b) we're missing a notify call.
There was a problem hiding this comment.
Ok, will go back to a bit more passing back of values. It's not ideal to have those two mechanisms. I am not sure if this is actually incorrect though because it still happening under the global lock and notifier pattern.
There was a problem hiding this comment.
I again threaded the bool all the way back so MonitorEvent::Completed still has one place that decides between SkipPersistHandleEvents and DoPersist. If we want the atomic flag setting to happen at the end anyway, we can just as well carry that decision back explicitly instead of splitting it between the top-level NotifyOption return and lower helper side effects.
ebac69e to
887ee22
Compare
d3cffca to
0d5f8e3
Compare
Refactor process_pending_monitor_events to return a NotifyOption instead of a bool, allowing callers to distinguish between DoPersist, SkipPersistHandleEvents, and SkipPersistNoEvents. Both call sites in process_events_body and get_and_clear_pending_msg_events are updated accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When process_pending_monitor_events processes only Completed events and the resulting work is limited to message-only monitor completion handling, ChannelManager persistence can be skipped. Completion handling now reports whether it actually mutated ChannelManager state, and process_pending_monitor_events uses that to decide between SkipPersistHandleEvents and DoPersist. AI tools were used in preparing this commit.
0d5f8e3 to
2308073
Compare
Since we are pulling ChannelManager persistence into the critical path, we should avoid persisting unnecessarily. When process_pending_monitor_events processes only Completed events, resulting state changes may be recoverable on restart and do not require persistence.