Replace dual-sync-async persistence panic with Watch contract by joostjager · Pull Request #4436 · lightningdevkit/rust-lightning

joostjager · 2026-02-24T10:13:27Z

Fix false panic by relaxing the Watch contract from a global "pick one mode" restriction to a per-channel rule: don't return Completed while prior InProgress updates are still pending for that channel.

ldk-reviews-bot · 2026-02-24T10:13:30Z

👋 I see @TheBlueMatt was un-assigned.
If you'd like another reviewer assignment, please click here.

joostjager · 2026-02-24T10:26:56Z

I didn't add a dedicated test for the InProgress -> complete -> Completed progression since it's already well-covered by existing tests that exercise this flow without disabling the contract check: test_monitor_update_fail_cs, test_monitor_update_fail_no_rebroadcast, test_monitor_update_fail_reestablish, monitor_failed_no_reestablish_response, test_monitor_update_fail_claim, test_monitor_update_on_pending_forwards, monitor_update_claim_fail_no_response, do_during_funding_monitor_fail, test_path_paused_mpp, test_temporary_error_during_shutdown, double_temp_error, test_manager_persisted_post_outbound_edge_holding_cell.

The ChainMonitor override path (where update_monitor fails and the return is overridden to InProgress) is also exercised without disabling the check by: test_monitor_and_persister_update_fail, test_concurrent_monitor_claim, test_update_err_monitor_lockdown.

codecov · 2026-02-24T13:07:42Z

Codecov Report

❌ Patch coverage is 97.22222% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 86.18%. Comparing base (43066b7) to head (f8a955c).
⚠️ Report is 15 commits behind head on main.

Files with missing lines	Patch %	Lines
lightning/src/ln/channelmanager.rs	90.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4436      +/-   ##
==========================================
- Coverage   86.20%   86.18%   -0.03%     
==========================================
  Files         161      160       -1     
  Lines      107332   107463     +131     
  Branches   107332   107463     +131     
==========================================
+ Hits        92525    92613      +88     
- Misses      12192    12224      +32     
- Partials     2615     2626      +11

Flag	Coverage Δ
tests	`86.18% <97.22%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TheBlueMatt

There's more context in the PR description than the commit message, please update the commit message with the context (PR descriptions should almost always just be the commit messages themselves!). Also, IIRC the chanmon_consistency target is conservative to avoid violating the old API, we should update it to allow returning Completed more often to match the API changes here.

ldk-reviews-bot · 2026-02-24T14:07:55Z

👋 The first review has been submitted!

Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer.

joostjager · 2026-02-24T17:29:41Z

Updated fuzzer

TheBlueMatt · 2026-02-24T18:15:43Z

fuzz/src/chanmon_consistency.rs

 			// bit-twiddling mutations to have similar effects. This is probably overkill, but no
 			// harm in doing so.
 			0x00 => {
 				*mon_style[0].borrow_mut() = ChannelMonitorUpdateStatus::InProgress;


We can probably drop mon_style again, no? Given we added it in the being-basically-reverted commit?

It's currently only used to remember what style to use after a node reload. But I could read from update_ret too and set that again on the new persister.

We could also reduce the number of command bytes by defining a command to toggle the mode.

joostjager · 2026-02-24T19:25:45Z

Pushed two commits that remove mon_style and also save command bytes by toggling the style.

The only functional change is that nodes now always start in sync mode, but the fuzzer can immediately toggle it to async. On reload, the style is carried over.

TheBlueMatt

DId you run the fuzzer for a while? I'm not sure that we shouldn't be able to hit a failure like the below.

TheBlueMatt · 2026-02-24T20:21:27Z

lightning/src/chain/mod.rs

-	/// [`Persist`]/[`Watch`] without first restarting.
+	/// A [`Watch`] implementation must not return this for a channel update if there are still
+	/// pending [`InProgress`] updates for that channel. That is, an update can only be considered
+	/// complete once all prior updates have also completed.


Hmm, thinking about this a bit more there's a weird race. Async updates are "completed" only after release_pending_monitor_events returns the completed update, but also only after it has been processed, which may be some arbitrary and indeterminate amount of time later.

For the ChainMonitor return case, this is fine (though we need to update ChainMonitor - we should probably be returning InProgress for any new update that comes after one for which we have to return a spurious InProgress. Do you already intend to do that in a followup? should we not do it here?). But for the general API its pretty weird - you can in theory return a Completed after returning InProgress as long as you want some indeterminate and arbitrary amount of time, so in practice you can't...

We could fix this by adding some mutual exclusion in channelmanager.rs where we wait before processing Completed monitor updates until after any pending MonitorEvents are processed - this should be basically zero cost as ~no one is going to be using mixed-async-sync persistence in practice so we'll rarely if ever have any contention.

On the flip side, we could say that implementations are allowed to flip from Completed to InProgress for a channel, but never back (without a restart). Its more complicated to document, but it captures what we need to allow the ChainMonitor behavior.

implementations are allowed to flip from Completed to InProgress for a channel, but never back

This sounds like the easier way. There is no need to support mixing sync and async for Persist implementations. I do wonder now whether my initial version #4435 wasn't just sufficient. Moving the mixed mode check to the chainmonitor level and keeping everything else the same.

Fuzzer run was planned for tonight.

The per-channel mode latch turned out to be not ideal because it required either a new lock on ChannelManager or threading the latch through deeply nested call chains via an already-held lock, all to accommodate ChainMonitor internally using InProgress creatively to signal a monitor update failure rather than async persistence.

Perhaps it is better to make this case explicit in the API? #4445

You were definitely right about the fuzzer. Woke up to lots of errors.

Returning Err from Watch::update_channel doesn't work for a future remote ChainMonitor setup where the ChannelMonitor lives in a separate process. In that case, update_channel returns InProgress and you only find out asynchronously whether the monitor update succeeded. The API was previously changed away from returning Result for this reason (as explained by @TheBlueMatt)

Went with the option to just document that switching back from async to sync is impractical.

Also reverted the fuzzer changes since the fuzzer freely switching between sync and async modes per node is impractical for the same reason.

Changes pushed.

TheBlueMatt

Don't we still need to update ChainMonitor? IIUC it can currently switch from Completed to InProgress if a monitor update application Errs, but then it'll immediately switch back for a later update to the same channel, which is still not allowed.

TheBlueMatt · 2026-02-26T12:53:40Z

lightning/src/chain/mod.rs

+	/// [`InProgress`] back to [`Completed`] for a given channel is only valid after all prior
+	/// updates have finished.
 	///
+	/// Note that an update is only considered complete from [`ChannelManager`]'s perspective once


IMO its not worth getting into this level of detail. Better to just say that you canot switch from InProgress to Completed without a restart, but you can go the other way.

It's updated. I'd have preferred to keep the more extensive docs though. The interaction between persistence completion, MonitorEvent::Completed processing, and switching update status isn't completely straightforward, and I think spelling it out is helpful.

joostjager · 2026-02-26T20:42:25Z

Discussed offline that attention is needed for the case where ChainMonitor override a result to InProgress after which a further update is supposed to be applied to the monitor and expected to succeed (preimage as an example).

joostjager · 2026-02-26T20:43:17Z

Furthermore there also might be a problem with switching modes after a restart. If ChannelManager was saved with in_flight_monitor_updates, a Completed result after the restart may fail the mode check...

TheBlueMatt · 2026-03-03T22:27:04Z

Furthermore there also might be a problem with switching modes after a restart. If ChannelManager was saved with in_flight_monitor_updates, a Completed result after the restart may fail the mode check...

Ugh, yea, we'll have to explicitly allow this. Its fine, because those updates will get applied on startup before we generate any new ones, just awkward.

lightning/src/chain/chainmonitor.rs

lightning/src/ln/channelmanager.rs

ldk-claude-review-bot · 2026-03-16T09:19:22Z

Review

The deferral mechanism and locking strategy are well thought out overall. Two items worth attention:

Lock granularity in deferred drain loop (inline comment): pending_monitor_updates is locked/unlocked per iteration in release_pending_monitor_events. Holding it for the entire loop is both more efficient and eliminates a theoretical race with channel_monitor_updated.
Removed update_res.is_err() fallback (inline comment): The old InProgress return on any update_monitor error is replaced by the no_further_updates_allowed() deferral. This covers the main case but changes behavior for rare internal monitor update failures. Practical impact is low.

🤖 Generated with Claude Code

lightning/src/chain/chainmonitor.rs

lightning/src/ln/channelmanager.rs

ldk-claude-review-bot · 2026-03-16T09:24:27Z

Automated Review

Two issues found, both in chainmonitor.rs update_channel:

Dropped update_res.is_err() override (bug) — The old code returned InProgress when update_monitor() failed, freezing the channel. The replacement no_further_updates_allowed() check doesn't cover this case for non-post-close channels, potentially reporting Completed to ChannelManager when the monitor update actually failed.
Race between channel_monitor_updated and deferred drain loop — The pending_monitor_updates lock is acquired/released per drain iteration, while channel_monitor_updated can run concurrently from another thread. This could produce duplicate Completed events or premature completion signals.

See inline comments for details.

ldk-claude-review-bot · 2026-03-16T09:58:20Z

lightning/src/chain/chainmonitor.rs

+			for update_id in deferred.drain(..) {
+				pending.retain(|id| *id != update_id);
+				let monitor_is_pending_updates = monitor_state.has_pending_updates(&pending);
+				if !monitor_is_pending_updates {
+					let funding_txo = monitor_state.monitor.get_funding_txo();
+					let channel_id = monitor_state.monitor.channel_id();
+					pending_monitor_events.push((
+						funding_txo,
+						channel_id,
+						vec![MonitorEvent::Completed {
+							funding_txo,
+							channel_id,
+							monitor_update_id: monitor_state.monitor.get_latest_update_id(),
+						}],
+						monitor_state.monitor.get_counterparty_node_id(),
+					));
+				}


Potential duplicate MonitorEvent::Completed: if channel_monitor_updated (public API, or via get_and_clear_completed_updates) processes a deferred update_id before this drain runs, that update will already have been removed from pending and a Completed event will already have been emitted. Then this drain loop will call retain (no-op since the id is gone), see pending is empty, and emit a second Completed for the same monitor.

In practice this requires a buggy Persist implementation reporting the same update both synchronously (returning Completed) and asynchronously (via get_and_clear_completed_updates), or external misuse of force_channel_monitor_updated. The impact is low since ChannelManager::channel_monitor_updated handles duplicate Completed events idempotently. But a defensive check would be cheap:

Suggested change

for update_id in deferred.drain(..) {

pending.retain(|id| *id != update_id);

let monitor_is_pending_updates = monitor_state.has_pending_updates(&pending);

if !monitor_is_pending_updates {

let funding_txo = monitor_state.monitor.get_funding_txo();

let channel_id = monitor_state.monitor.channel_id();

pending_monitor_events.push((

funding_txo,

channel_id,

vec![MonitorEvent::Completed {

funding_txo,

channel_id,

monitor_update_id: monitor_state.monitor.get_latest_update_id(),

}],

monitor_state.monitor.get_counterparty_node_id(),

));

}

for update_id in deferred.drain(..) {

let prev_len = pending.len();

pending.retain(|id| *id != update_id);

if prev_len == pending.len() {

// Already resolved by channel_monitor_updated; skip to avoid

// duplicate MonitorEvent::Completed.

continue;

}

let monitor_is_pending_updates = monitor_state.has_pending_updates(&pending);

if !monitor_is_pending_updates {

let funding_txo = monitor_state.monitor.get_funding_txo();

let channel_id = monitor_state.monitor.channel_id();

pending_monitor_events.push((

funding_txo,

channel_id,

vec![MonitorEvent::Completed {

funding_txo,

channel_id,

monitor_update_id: monitor_state.monitor.get_latest_update_id(),

}],

monitor_state.monitor.get_counterparty_node_id(),

));

}

}

Added in post-rebase push diff

lightning/src/chain/mod.rs

ldk-claude-review-bot · 2026-03-16T09:58:43Z

joostjager · 2026-03-19T07:31:10Z

Rebased after deferred write merge #4351

Fixed Claude comment #4436 (comment)

Commit 0760f99 ("Disallow dual-sync-async persistence without restarting") added a panic in non-test builds when a Persist implementation returns both Completed and InProgress from the same ChannelManager instance. However, this check runs against the status that ChainMonitor returns to ChannelManager, not the raw Persist result. When ChannelMonitor::update_monitor fails (e.g. a counterparty commitment_signed arrives after a funding spend confirms), ChainMonitor persists the full monitor successfully but overrides the return value to InProgress. If the user's Persist impl only ever returns Completed, this override triggers a false mode-mismatch panic. This replaces the panic with a per-channel contract at the Watch trait level: a Watch implementation must not return Completed for a channel update while prior InProgress updates are still pending. Switching from Completed to InProgress is always allowed, but switching back is impractical because the Watch implementation cannot observe when ChannelManager has finished processing a MonitorEvent::Completed. The documentation on ChannelMonitorUpdateStatus is updated to describe these rules. The mode tracking and panic checks from 0760f99 are removed and replaced with a panic that validates the new contract directly on the in-flight update state. Legacy tests that switch the persister between modes mid-flight can opt out via Node::disable_monitor_completeness_assertion(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a regression test that reproduces the panic when a commitment_signed is processed after the counterparty commitment transaction has confirmed. The ChannelMonitor's no_further_updates_allowed() returns true, causing update_monitor to fail, which ChainMonitor overrides to InProgress. A subsequent preimage claim returning Completed then triggers the per-channel assertion that Completed must not follow InProgress. AI tools were used in preparing this commit.

joostjager · 2026-03-19T13:52:49Z

Rebased again

TheBlueMatt · 2026-03-19T20:25:13Z

lightning/src/chain/chainmonitor.rs

+			// completion actions resolve once the ChannelForceClosed update (generated during
+			// force-close processing) also gets deferred and resolved in the next event cycle.
+			let mut pending = monitor_state.pending_monitor_updates.lock().unwrap();
+			for update_id in deferred.drain(..) {


Can't we remove this entire hunk if we simply push the MonitorEvent::Completed into self.pending_monitor_events immediately and then handle self.pending_monitor_events events after the per-monitor fetching above?

I don't think that works, because when we defer, there may also be other InProgress events still?

But doesn't that require a dual-sync-async Persist impl? ie because an earlier persist would have to have gone InProgress due to an async persister and then a new (failed) one would have had to gotten an immediate Completed?

Hm true. I vaguely remember there was a reason, but maybe it was indeed only valid without the single mode assumption.

Code simplifies: https://github.com/lightningdevkit/rust-lightning/compare/533f37fb4859c8d093dc74e4d13a6f291f689697..f8a955c50d5d5ac28a1c5bb94ae820b17d315c03

When no_further_updates_allowed() is true and the persister returns Completed, ChainMonitor now overrides the return to InProgress and pushes a MonitorEvent::Completed directly into pending_monitor_events. In release_pending_monitor_events, these deferred completions are appended after per-monitor events, so ChannelManager sees the force-close MonitorEvents before the completion. This eliminates phantom InProgress entries that would never complete: previously, a rejected pre-close update (e.g. commitment_signed arriving after funding spend) returned InProgress with no completion path, blocking MonitorUpdateCompletionActions (PaymentClaimed, PaymentForwarded) indefinitely. A subsequent post-close update returning Completed would then violate the in-order completion invariant. AI tools were used in preparing this commit.

ldk-claude-review-bot · 2026-03-20T12:34:22Z

lightning/src/chain/chainmonitor.rs

+		// Drain pending_monitor_events (which includes deferred post-close
+		// completions) after per-monitor events so that force-close
+		// MonitorEvents are processed by ChannelManager first.
+		pending_monitor_events.extend(self.pending_monitor_events.lock().unwrap().split_off(0));


The reordering here affects all async persist completions, not just the new deferred post-close ones. Previously, pending_monitor_events (which includes Completed events from channel_monitor_updated / async persist callbacks) was drained before per-monitor events. Now it's drained after.

This means if an async persist completion and a force-close MonitorEvent (e.g., CommitmentTxConfirmed) are both pending in the same release_pending_monitor_events call, the force-close is now processed first by ChannelManager. Previously, the persist completion would have been processed first, potentially unfreezing the channel and allowing queued messages to flow before the force-close.

This is likely the intended behavior for the deferred-completion case, but it's a broader behavioral change that also affects the normal async-persist path. Worth documenting in the commit message or a code comment that this ordering change is intentional and applies to all channel_monitor_updated completions.

It is intended behavior indeed.

ldk-claude-review-bot · 2026-03-20T12:34:32Z

lightning/src/ln/channelmanager.rs

+	#[cfg(test)]
+	pub(crate) skip_monitor_update_assertion: AtomicBool,


The skip_monitor_update_assertion field is #[cfg(test)] only, but the assertion itself (line 10389) is unconditional — it fires in all builds. The old monitor_update_type panic was #[cfg(not(any(test, feature = "_externalize_tests")))], meaning it was disabled for both test and _externalize_tests builds.

With this change, builds using _externalize_tests (external crate tests) now get a panic they didn't have before and can't disable, since skip_monitor_update_assertion doesn't exist in non-#[cfg(test)] builds. If any external tests mix InProgress and Completed on the same channel (as the internal tests being patched do), they'll hit this new panic.

Consider either gating the field/skip-path on #[cfg(any(test, feature = "_externalize_tests"))] to match the old behavior, or documenting this as an intentional tightening.

I think this is fine, we don't want to encourage the mixing.

TheBlueMatt

This is now really straightforward. Gonna go ahead and just land.

joostjager force-pushed the revert-mixed-mode-check branch from 721f491 to 9aff3f2 Compare February 24, 2026 10:19

joostjager mentioned this pull request Feb 24, 2026

Move ChannelMonitorUpdateStatus mode check from ChannelManager to ChainMonitor #4435

Closed

joostjager force-pushed the revert-mixed-mode-check branch from 9aff3f2 to 93a9d79 Compare February 24, 2026 10:26

joostjager force-pushed the revert-mixed-mode-check branch 2 times, most recently from c20c819 to 77a6a30 Compare February 24, 2026 11:07

joostjager marked this pull request as ready for review February 24, 2026 12:07

joostjager requested a review from TheBlueMatt February 24, 2026 12:07

TheBlueMatt reviewed Feb 24, 2026

View reviewed changes

joostjager force-pushed the revert-mixed-mode-check branch from 77a6a30 to 218e89a Compare February 24, 2026 15:11

joostjager requested a review from TheBlueMatt February 24, 2026 17:29

TheBlueMatt reviewed Feb 24, 2026

View reviewed changes

joostjager force-pushed the revert-mixed-mode-check branch from 218e89a to 7e2b509 Compare February 24, 2026 19:24

TheBlueMatt reviewed Feb 24, 2026

View reviewed changes

joostjager force-pushed the revert-mixed-mode-check branch 2 times, most recently from c37be93 to 4cf848c Compare February 26, 2026 09:10

joostjager added the weekly goal Someone wants to land this this week label Feb 26, 2026

joostjager requested a review from TheBlueMatt February 26, 2026 09:46

TheBlueMatt reviewed Feb 26, 2026

View reviewed changes

joostjager force-pushed the revert-mixed-mode-check branch from 4cf848c to 0d1e978 Compare February 26, 2026 13:06

joostjager added this to Weekly Goals Feb 26, 2026

joostjager self-assigned this Feb 26, 2026