Replace dual-sync-async persistence panic with Watch contract#4436
Replace dual-sync-async persistence panic with Watch contract#4436TheBlueMatt merged 3 commits intolightningdevkit:mainfrom
Conversation
|
👋 I see @TheBlueMatt was un-assigned. |
721f491 to
9aff3f2
Compare
9aff3f2 to
93a9d79
Compare
|
I didn't add a dedicated test for the InProgress -> complete -> Completed progression since it's already well-covered by existing tests that exercise this flow without disabling the contract check: The |
c20c819 to
77a6a30
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4436 +/- ##
==========================================
- Coverage 86.20% 86.18% -0.03%
==========================================
Files 161 160 -1
Lines 107332 107463 +131
Branches 107332 107463 +131
==========================================
+ Hits 92525 92613 +88
- Misses 12192 12224 +32
- Partials 2615 2626 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
TheBlueMatt
left a comment
There was a problem hiding this comment.
There's more context in the PR description than the commit message, please update the commit message with the context (PR descriptions should almost always just be the commit messages themselves!). Also, IIRC the chanmon_consistency target is conservative to avoid violating the old API, we should update it to allow returning Completed more often to match the API changes here.
|
👋 The first review has been submitted! Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer. |
77a6a30 to
218e89a
Compare
|
Updated fuzzer |
| // bit-twiddling mutations to have similar effects. This is probably overkill, but no | ||
| // harm in doing so. | ||
| 0x00 => { | ||
| *mon_style[0].borrow_mut() = ChannelMonitorUpdateStatus::InProgress; |
There was a problem hiding this comment.
We can probably drop mon_style again, no? Given we added it in the being-basically-reverted commit?
There was a problem hiding this comment.
It's currently only used to remember what style to use after a node reload. But I could read from update_ret too and set that again on the new persister.
We could also reduce the number of command bytes by defining a command to toggle the mode.
218e89a to
7e2b509
Compare
|
Pushed two commits that remove The only functional change is that nodes now always start in sync mode, but the fuzzer can immediately toggle it to async. On reload, the style is carried over. |
TheBlueMatt
left a comment
There was a problem hiding this comment.
DId you run the fuzzer for a while? I'm not sure that we shouldn't be able to hit a failure like the below.
lightning/src/chain/mod.rs
Outdated
| /// [`Persist`]/[`Watch`] without first restarting. | ||
| /// A [`Watch`] implementation must not return this for a channel update if there are still | ||
| /// pending [`InProgress`] updates for that channel. That is, an update can only be considered | ||
| /// complete once all prior updates have also completed. |
There was a problem hiding this comment.
Hmm, thinking about this a bit more there's a weird race. Async updates are "completed" only after release_pending_monitor_events returns the completed update, but also only after it has been processed, which may be some arbitrary and indeterminate amount of time later.
For the ChainMonitor return case, this is fine (though we need to update ChainMonitor - we should probably be returning InProgress for any new update that comes after one for which we have to return a spurious InProgress. Do you already intend to do that in a followup? should we not do it here?). But for the general API its pretty weird - you can in theory return a Completed after returning InProgress as long as you want some indeterminate and arbitrary amount of time, so in practice you can't...
We could fix this by adding some mutual exclusion in channelmanager.rs where we wait before processing Completed monitor updates until after any pending MonitorEvents are processed - this should be basically zero cost as ~no one is going to be using mixed-async-sync persistence in practice so we'll rarely if ever have any contention.
On the flip side, we could say that implementations are allowed to flip from Completed to InProgress for a channel, but never back (without a restart). Its more complicated to document, but it captures what we need to allow the ChainMonitor behavior.
There was a problem hiding this comment.
implementations are allowed to flip from Completed to InProgress for a channel, but never back
This sounds like the easier way. There is no need to support mixing sync and async for Persist implementations. I do wonder now whether my initial version #4435 wasn't just sufficient. Moving the mixed mode check to the chainmonitor level and keeping everything else the same.
Fuzzer run was planned for tonight.
There was a problem hiding this comment.
The per-channel mode latch turned out to be not ideal because it required either a new lock on ChannelManager or threading the latch through deeply nested call chains via an already-held lock, all to accommodate ChainMonitor internally using InProgress creatively to signal a monitor update failure rather than async persistence.
Perhaps it is better to make this case explicit in the API? #4445
You were definitely right about the fuzzer. Woke up to lots of errors.
There was a problem hiding this comment.
Returning Err from Watch::update_channel doesn't work for a future remote ChainMonitor setup where the ChannelMonitor lives in a separate process. In that case, update_channel returns InProgress and you only find out asynchronously whether the monitor update succeeded. The API was previously changed away from returning Result for this reason (as explained by @TheBlueMatt)
Went with the option to just document that switching back from async to sync is impractical.
Also reverted the fuzzer changes since the fuzzer freely switching between sync and async modes per node is impractical for the same reason.
Changes pushed.
c37be93 to
4cf848c
Compare
TheBlueMatt
left a comment
There was a problem hiding this comment.
Don't we still need to update ChainMonitor? IIUC it can currently switch from Completed to InProgress if a monitor update application Errs, but then it'll immediately switch back for a later update to the same channel, which is still not allowed.
lightning/src/chain/mod.rs
Outdated
| /// [`InProgress`] back to [`Completed`] for a given channel is only valid after all prior | ||
| /// updates have finished. | ||
| /// | ||
| /// Note that an update is only considered complete from [`ChannelManager`]'s perspective once |
There was a problem hiding this comment.
IMO its not worth getting into this level of detail. Better to just say that you canot switch from InProgress to Completed without a restart, but you can go the other way.
There was a problem hiding this comment.
It's updated. I'd have preferred to keep the more extensive docs though. The interaction between persistence completion, MonitorEvent::Completed processing, and switching update status isn't completely straightforward, and I think spelling it out is helpful.
4cf848c to
0d1e978
Compare
|
Discussed offline that attention is needed for the case where |
|
Furthermore there also might be a problem with switching modes after a restart. If |
Ugh, yea, we'll have to explicitly allow this. Its fine, because those updates will get applied on startup before we generate any new ones, just awkward. |
ReviewThe deferral mechanism and locking strategy are well thought out overall. Two items worth attention:
🤖 Generated with Claude Code |
Automated ReviewTwo issues found, both in
See inline comments for details. |
lightning/src/chain/chainmonitor.rs
Outdated
| for update_id in deferred.drain(..) { | ||
| pending.retain(|id| *id != update_id); | ||
| let monitor_is_pending_updates = monitor_state.has_pending_updates(&pending); | ||
| if !monitor_is_pending_updates { | ||
| let funding_txo = monitor_state.monitor.get_funding_txo(); | ||
| let channel_id = monitor_state.monitor.channel_id(); | ||
| pending_monitor_events.push(( | ||
| funding_txo, | ||
| channel_id, | ||
| vec![MonitorEvent::Completed { | ||
| funding_txo, | ||
| channel_id, | ||
| monitor_update_id: monitor_state.monitor.get_latest_update_id(), | ||
| }], | ||
| monitor_state.monitor.get_counterparty_node_id(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
Potential duplicate MonitorEvent::Completed: if channel_monitor_updated (public API, or via get_and_clear_completed_updates) processes a deferred update_id before this drain runs, that update will already have been removed from pending and a Completed event will already have been emitted. Then this drain loop will call retain (no-op since the id is gone), see pending is empty, and emit a second Completed for the same monitor.
In practice this requires a buggy Persist implementation reporting the same update both synchronously (returning Completed) and asynchronously (via get_and_clear_completed_updates), or external misuse of force_channel_monitor_updated. The impact is low since ChannelManager::channel_monitor_updated handles duplicate Completed events idempotently. But a defensive check would be cheap:
| for update_id in deferred.drain(..) { | |
| pending.retain(|id| *id != update_id); | |
| let monitor_is_pending_updates = monitor_state.has_pending_updates(&pending); | |
| if !monitor_is_pending_updates { | |
| let funding_txo = monitor_state.monitor.get_funding_txo(); | |
| let channel_id = monitor_state.monitor.channel_id(); | |
| pending_monitor_events.push(( | |
| funding_txo, | |
| channel_id, | |
| vec![MonitorEvent::Completed { | |
| funding_txo, | |
| channel_id, | |
| monitor_update_id: monitor_state.monitor.get_latest_update_id(), | |
| }], | |
| monitor_state.monitor.get_counterparty_node_id(), | |
| )); | |
| } | |
| for update_id in deferred.drain(..) { | |
| let prev_len = pending.len(); | |
| pending.retain(|id| *id != update_id); | |
| if prev_len == pending.len() { | |
| // Already resolved by channel_monitor_updated; skip to avoid | |
| // duplicate MonitorEvent::Completed. | |
| continue; | |
| } | |
| let monitor_is_pending_updates = monitor_state.has_pending_updates(&pending); | |
| if !monitor_is_pending_updates { | |
| let funding_txo = monitor_state.monitor.get_funding_txo(); | |
| let channel_id = monitor_state.monitor.channel_id(); | |
| pending_monitor_events.push(( | |
| funding_txo, | |
| channel_id, | |
| vec![MonitorEvent::Completed { | |
| funding_txo, | |
| channel_id, | |
| monitor_update_id: monitor_state.monitor.get_latest_update_id(), | |
| }], | |
| monitor_state.monitor.get_counterparty_node_id(), | |
| )); | |
| } | |
| } |
|
|
f90575f to
488f28e
Compare
|
Rebased after deferred write merge #4351 Fixed Claude comment #4436 (comment) |
c47f677 to
c4585b2
Compare
Commit 0760f99 ("Disallow dual-sync-async persistence without restarting") added a panic in non-test builds when a Persist implementation returns both Completed and InProgress from the same ChannelManager instance. However, this check runs against the status that ChainMonitor returns to ChannelManager, not the raw Persist result. When ChannelMonitor::update_monitor fails (e.g. a counterparty commitment_signed arrives after a funding spend confirms), ChainMonitor persists the full monitor successfully but overrides the return value to InProgress. If the user's Persist impl only ever returns Completed, this override triggers a false mode-mismatch panic. This replaces the panic with a per-channel contract at the Watch trait level: a Watch implementation must not return Completed for a channel update while prior InProgress updates are still pending. Switching from Completed to InProgress is always allowed, but switching back is impractical because the Watch implementation cannot observe when ChannelManager has finished processing a MonitorEvent::Completed. The documentation on ChannelMonitorUpdateStatus is updated to describe these rules. The mode tracking and panic checks from 0760f99 are removed and replaced with a panic that validates the new contract directly on the in-flight update state. Legacy tests that switch the persister between modes mid-flight can opt out via Node::disable_monitor_completeness_assertion(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a regression test that reproduces the panic when a commitment_signed is processed after the counterparty commitment transaction has confirmed. The ChannelMonitor's no_further_updates_allowed() returns true, causing update_monitor to fail, which ChainMonitor overrides to InProgress. A subsequent preimage claim returning Completed then triggers the per-channel assertion that Completed must not follow InProgress. AI tools were used in preparing this commit.
c4585b2 to
533f37f
Compare
|
Rebased again |
lightning/src/chain/chainmonitor.rs
Outdated
| // completion actions resolve once the ChannelForceClosed update (generated during | ||
| // force-close processing) also gets deferred and resolved in the next event cycle. | ||
| let mut pending = monitor_state.pending_monitor_updates.lock().unwrap(); | ||
| for update_id in deferred.drain(..) { |
There was a problem hiding this comment.
Can't we remove this entire hunk if we simply push the MonitorEvent::Completed into self.pending_monitor_events immediately and then handle self.pending_monitor_events events after the per-monitor fetching above?
There was a problem hiding this comment.
I don't think that works, because when we defer, there may also be other InProgress events still?
There was a problem hiding this comment.
But doesn't that require a dual-sync-async Persist impl? ie because an earlier persist would have to have gone InProgress due to an async persister and then a new (failed) one would have had to gotten an immediate Completed?
There was a problem hiding this comment.
Hm true. I vaguely remember there was a reason, but maybe it was indeed only valid without the single mode assumption.
Code simplifies: https://github.com/lightningdevkit/rust-lightning/compare/533f37fb4859c8d093dc74e4d13a6f291f689697..f8a955c50d5d5ac28a1c5bb94ae820b17d315c03
When no_further_updates_allowed() is true and the persister returns Completed, ChainMonitor now overrides the return to InProgress and pushes a MonitorEvent::Completed directly into pending_monitor_events. In release_pending_monitor_events, these deferred completions are appended after per-monitor events, so ChannelManager sees the force-close MonitorEvents before the completion. This eliminates phantom InProgress entries that would never complete: previously, a rejected pre-close update (e.g. commitment_signed arriving after funding spend) returned InProgress with no completion path, blocking MonitorUpdateCompletionActions (PaymentClaimed, PaymentForwarded) indefinitely. A subsequent post-close update returning Completed would then violate the in-order completion invariant. AI tools were used in preparing this commit.
533f37f to
f8a955c
Compare
| // Drain pending_monitor_events (which includes deferred post-close | ||
| // completions) after per-monitor events so that force-close | ||
| // MonitorEvents are processed by ChannelManager first. | ||
| pending_monitor_events.extend(self.pending_monitor_events.lock().unwrap().split_off(0)); |
There was a problem hiding this comment.
The reordering here affects all async persist completions, not just the new deferred post-close ones. Previously, pending_monitor_events (which includes Completed events from channel_monitor_updated / async persist callbacks) was drained before per-monitor events. Now it's drained after.
This means if an async persist completion and a force-close MonitorEvent (e.g., CommitmentTxConfirmed) are both pending in the same release_pending_monitor_events call, the force-close is now processed first by ChannelManager. Previously, the persist completion would have been processed first, potentially unfreezing the channel and allowing queued messages to flow before the force-close.
This is likely the intended behavior for the deferred-completion case, but it's a broader behavioral change that also affects the normal async-persist path. Worth documenting in the commit message or a code comment that this ordering change is intentional and applies to all channel_monitor_updated completions.
There was a problem hiding this comment.
It is intended behavior indeed.
| #[cfg(test)] | ||
| pub(crate) skip_monitor_update_assertion: AtomicBool, |
There was a problem hiding this comment.
The skip_monitor_update_assertion field is #[cfg(test)] only, but the assertion itself (line 10389) is unconditional — it fires in all builds. The old monitor_update_type panic was #[cfg(not(any(test, feature = "_externalize_tests")))], meaning it was disabled for both test and _externalize_tests builds.
With this change, builds using _externalize_tests (external crate tests) now get a panic they didn't have before and can't disable, since skip_monitor_update_assertion doesn't exist in non-#[cfg(test)] builds. If any external tests mix InProgress and Completed on the same channel (as the internal tests being patched do), they'll hit this new panic.
Consider either gating the field/skip-path on #[cfg(any(test, feature = "_externalize_tests"))] to match the old behavior, or documenting this as an intentional tightening.
There was a problem hiding this comment.
I think this is fine, we don't want to encourage the mixing.
TheBlueMatt
left a comment
There was a problem hiding this comment.
This is now really straightforward. Gonna go ahead and just land.
Fix false panic by relaxing the
Watchcontract from a global "pick one mode" restriction to a per-channel rule: don't returnCompletedwhile priorInProgressupdates are still pending for that channel.