-
Notifications
You must be signed in to change notification settings - Fork 432
Defer ChainMonitor updates and persistence to flush() #4345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Defer ChainMonitor updates and persistence to flush() #4345
Conversation
|
👋 Hi! I see this is a draft PR. |
|
Added a DeferredChainMonitor wrapper instead of modifying ChainMonitor directly. The wrapper intercepts watch_channel and update_channel calls, queues them, and returns InProgress. When flush is called, it processes the queued operations and persists them in the correct order after ChannelManager persistence. This approach keeps ChainMonitor unchanged so that existing tests which expect synchronous behavior continue to work without modification. Only the background processor and production code paths use the deferred wrapper while the test suite can keep using ChainMonitor directly. |
36a8b33 to
73c0a66
Compare
|
Initially attempted to implement this as a thin adapter/wrapper that would sit between the ChannelManager and an existing ChainMonitor, forwarding calls while deferring the Watch operations. However, when integrating with ldk-node, this approach quickly ran into Rust ownership and lifetime issues since it required keeping both the original ChainMonitor and the wrapper around simultaneously. The current implementation takes a simpler approach where DeferredChainMonitor owns its own ChainMonitor internally and implements Deref to it, making it a complete drop-in replacement that can be instantiated with the same parameters as ChainMonitor while exposing all the same traits and methods. |
5bd0ea3 to
0c005d0
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4345 +/- ##
==========================================
- Coverage 86.53% 86.08% -0.45%
==========================================
Files 158 157 -1
Lines 103190 102897 -293
Branches 103190 102897 -293
==========================================
- Hits 89300 88584 -716
- Misses 11469 11807 +338
- Partials 2421 2506 +85
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0c005d0 to
bc1b327
Compare
Introduce a `DeferredChainMonitor` wrapper around `ChainMonitor` that queues `watch_channel` and `update_channel` operations, returning `InProgress` until `flush()` is called. This enables batched persistence of monitor updates after `ChannelManager` persistence, ensuring correct ordering where the `ChannelManager` state is never ahead of the monitor state on restart. Key changes: - `DeferredChainMonitor` queues monitor operations and returns `InProgress` - Calling `flush()` applies pending operations and persists monitors - All `ChainMonitor` traits (Listen, Confirm, EventsProvider, etc.) are passed through, allowing drop-in replacement - Background processor updated to capture pending count before `ChannelManager` persistence, then flush after persistence completes Includes comprehensive tests covering the full channel lifecycle with payment flows using `DeferredChainMonitor`. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The test assumed `read_dir` would return files in a specific order (`.tmp` files first), but the Rust documentation states: "The order in which this iterator returns entries is platform and filesystem dependent." When the real monitor file was returned before a `.tmp` file, the assertion `mons.next().is_none()` would fail. Fix by adding a helper function that filters out `.tmp` files entirely, making the tests robust to any file ordering. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
db134e7 to
d142660
Compare
Summary
Introduce
DeferredChainMonitor, a wrapper aroundChainMonitorthat queueswatch_channelandupdate_channeloperations, returningInProgressuntilflush()is called. This enables batched persistence of monitor updates afterChannelManagerpersistence, ensuring correct ordering where theChannelManagerstate is never ahead of the monitor state on restart.The Problem
There's a race condition that can cause channel force closures: if the node crashes after writing channel monitors but before writing the channel manager, the monitors will be ahead of the manager on restart. This can lead to state desync and force closures.
The Solution
By deferring monitor writes until after the channel manager is persisted (via
flush()), we ensure the manager is always at least as up-to-date as the monitors.Key changes:
DeferredChainMonitorqueues monitor operations and returnsInProgressflush()applies pending operations and persists monitorsChainMonitortraits (Listen,Confirm,EventsProvider, etc.) are passed through, allowing drop-in replacementChannelManagerpersistence, then flush after persistence completesAlternative Designs Considered
Several approaches were explored to solve the monitor/manager persistence ordering problem:
1. Queue at KVStore level (#4310)
Introduces a
QueuedKVStoreSyncwrapper that queues all writes in memory, committing them in a single batch at chokepoints where data leaves the system (get_and_clear_pending_msg_events,get_and_clear_pending_events). This approach aims for true atomic multi-key writes but requires KVStore backends that support transactions (e.g., SQLite) - filesystem backends cannot achieve full atomicity.Trade-offs: Most general solution but requires changes to persistence boundaries and cannot fully close the desync gap with filesystem storage.
2. Queue at Persister level (#4317)
Updates
MonitorUpdatingPersisterto queue persist operations in memory, with actual writes happening onflush(). Addsflush()to thePersisttrait andChainMonitor.Trade-offs: Only fixes the issue for
MonitorUpdatingPersister; customPersistimplementations remain vulnerable to the race condition.3. Queue internally in ChainMonitor (#4351)
Modifies
ChainMonitordirectly to queue operations internally, returningInProgressuntilflush()is called.Trade-offs: Requires an enormous amount of test changes since existing tests expect immediate persistence behavior.