Resolve outbound HTLC fails via `MonitorEvent`s by valentinewallace · Pull Request #4604 · lightningdevkit/rust-lightning

valentinewallace · 2026-05-07T16:46:22Z

As part of #4482, we're looking into changing our architecture -- currently, the Channel{Manager} is responsible for managing the resolution of off-chain HTLCs, and the ChannelMonitor is responsible for them once they're on-chain. See the issue description but there's complexity that results from this design.

Quoting the issue, "Instead, we want to do all resolution in ChannelMonitors (in response to ChannelMonitorUpdates) and pass them back to ChannelManager in the form of MonitorEvents (similar to how HTLCs are resolved after channels are closed). In order to have reliable resolution, we'll need to keep MonitorEvents around in the ChannelMonitor until the ChannelManager has finished processing them - adding a new MonitorEvent resolution path through a new method (rather than via ChannelMonitorUpdates)."

Here we begin resolving outbound payment failures via MonitorEvents, if persistent_monitor_events is enabled (it will only be enabled randomly in tests for now).

Based on #4491, #4545, #4524

Helps when debugging to know which variants failed.

Currently, the resolution of HTLCs (and decisions on when HTLCs can be forwarded) is the responsibility of Channel objects (a part of ChannelManager) until the channel is closed, and then the ChannelMonitor thereafter. This leads to some complexity around race conditions for HTLCs right around channel closure. Additionally, there is lots of complexity reconstructing the state of all HTLCs in the ChannelManager deserialization/loading logic. Instead, we want to do all resolution in ChannelMonitors (in response to ChannelMonitorUpdates) and pass them back to ChannelManager in the form of MonitorEvents (similar to how HTLCs are resolved after channels are closed). In order to have reliable resolution, we'll need to keep MonitorEvents around in the ChannelMonitor until the ChannelManager has finished processing them. This simplify things - on restart instead of examining the set of HTLCs in monitors we can simply replay all the pending MonitorEvents. As a first step towards this, here we persist a flag in the ChannelManager and ChannelMonitors indicating whether this new feature is enabled. It will be used in upcoming commits to maintain compatibility and create an upgrade/downgrade path between LDK versions.

Cleans up the next commit

This field will be deprecated in upcoming commits when we start persisting MonitorEvent ids alongside the MonitorEvents.

Currently, the resolution of HTLCs (and decisions on when HTLCs can be forwarded) is the responsibility of Channel objects (a part of ChannelManager) until the channel is closed, and then the ChannelMonitor thereafter. This leads to some complexity around race conditions for HTLCs right around channel closure. Additionally, there is lots of complexity reconstructing the state of all HTLCs in the ChannelManager deserialization/loading logic. Instead, we want to do all resolution in ChannelMonitors (in response to ChannelMonitorUpdates) and pass them back to ChannelManager in the form of MonitorEvents (similar to how HTLCs are resolved after channels are closed). In order to have reliable resolution, we'll need to keep MonitorEvents around in the ChannelMonitor until the ChannelManager has finished processing them. This simplify things - on restart instead of examining the set of HTLCs in monitors we can simply replay all the pending MonitorEvents. Here we add an as-yet-unused API to chain::Watch to allow the ChannelManager to tell the a ChannelMonitor that a MonitorEvent has been irrevocably processed and can be deleted.

Currently, the resolution of HTLCs (and decisions on when HTLCs can be forwarded) is the responsibility of Channel objects (a part of ChannelManager) until the channel is closed, and then the ChannelMonitor thereafter. This leads to some complexity around race conditions for HTLCs right around channel closure. Additionally, there is lots of complexity reconstructing the state of all HTLCs in the ChannelManager deserialization/loading logic. Instead, we want to do all resolution in ChannelMonitors (in response to ChannelMonitorUpdates) and pass them back to ChannelManager in the form of MonitorEvents (similar to how HTLCs are resolved after channels are closed). In order to have reliable resolution, we'll need to keep MonitorEvents around in the ChannelMonitor until the ChannelManager has finished processing them. This simplify things - on restart instead of examining the set of HTLCs in monitors we can simply replay all the pending MonitorEvents. To allow the ChannelManager to ack specific monitor events once they are resolved in upcoming commits, here we give each MonitorEvent a corresponding unique id. It's implemented in such a way that we can delete legacy monitor event code in the future when the new persistent monitor events flag is enabled by default.

Currently, the resolution of HTLCs (and decisions on when HTLCs can be forwarded) is the responsibility of Channel objects (a part of ChannelManager) until the channel is closed, and then the ChannelMonitor thereafter. This leads to some complexity around race conditions for HTLCs right around channel closure. Additionally, there is lots of complexity reconstructing the state of all HTLCs in the ChannelManager deserialization/loading logic. Instead, we want to do all resolution in ChannelMonitors (in response to ChannelMonitorUpdates) and pass them back to ChannelManager in the form of MonitorEvents (similar to how HTLCs are resolved after channels are closed). In order to have reliable resolution, we'll need to keep MonitorEvents around in the ChannelMonitor until the ChannelManager has finished processing them. This will simplify things - on restart instead of examining the set of HTLCs in monitors we can simply replay all the pending MonitorEvents. Here for the purposes of merging initial support for persistent monitor events, we ack each immediately after it is received/handled by the ChannelManager, which is equivalent to the behavior we had prior to monitor events becoming persistent. In upcoming work, we'll want to have much more special handling for HTLCUpdate monitor events in particular -- e.g. for outbound payment claim events, we should only ACK the monitor event when the PaymentSent event is processed, until that point we want it to keep being provided back to us on startup. All the other monitor events are trivial to ACK, since they don't need to be re-processed on startup.

Currently, the resolution of HTLCs (and decisions on when HTLCs can be forwarded) is the responsibility of Channel objects (a part of ChannelManager) until the channel is closed, and then the ChannelMonitor thereafter. This leads to some complexity around race conditions for HTLCs right around channel closure. Additionally, there is lots of complexity reconstructing the state of all HTLCs in the ChannelManager deserialization/loading logic. Instead, we want to do all resolution in ChannelMonitors (in response to ChannelMonitorUpdates) and pass them back to ChannelManager in the form of MonitorEvents (similar to how HTLCs are resolved after channels are closed). In order to have reliable resolution, we'll need to keep MonitorEvents around in the ChannelMonitor until the ChannelManager has finished processing them. This will simplify things - on restart instead of examining the set of HTLCs in monitors we can simply replay all the pending MonitorEvents. Here we complete work that was built on recent prior commits and actually start re-providing monitor events on startup if they went un-acked during runtime. This isn't actually supported in prod yet, so this new code will run randomly in tests, to ensure we still support the old paths.

And log them in check_added_monitors if it fails.

Will be used in upcoming commits when generating MonitorEvents.

Processing MonitorEvent::HTLCEvent causes the ChannelManager to call claim_funds_internal, but it will currently pass in None for the user_channel_id parameter. In upcoming commits when we begin generating monitor events for off-chain HTLC claims as well as onchain, we'll want to start using an accurate value instead.

Used in an upcoming commit to insert a pending payment if it's missing on startup and we need to re-claim.

Used in an upcoming commit to generate an EventCompletionAction::AckMonitorEvent.

In upcoming commits, we'll be generating monitor events for off-chain claims as well as on-chain. As a small prefactor, calculate the from_onchain value rather than hardcoding it to true.

If ChannelManager::persistent_monitor_events is enabled, we may want to avoid acking a monitor event until after an Event is processed by the user. In upcoming commits, we'll use this to ensure a MonitorEvent::HTLCEvent will keep being re-provided back to us until after an Event::PaymentSent is processed.

Currently, the resolution of HTLCs (and decisions on when HTLCs can be forwarded) is the responsibility of Channel objects (a part of ChannelManager) until the channel is closed, and then the ChannelMonitor thereafter. This leads to some complexity around race conditions for HTLCs right around channel closure. Additionally, there is lots of complexity reconstructing the state of all HTLCs in the ChannelManager deserialization/loading logic. Instead, we want to do all resolution in ChannelMonitors (in response to ChannelMonitorUpdates) and pass them back to ChannelManager in the form of MonitorEvents (similar to how HTLCs are resolved after channels are closed). In order to have reliable resolution, we'll need to keep MonitorEvents around in the ChannelMonitor until the ChannelManager has finished processing them. This will simplify things - on restart instead of examining the set of HTLCs in monitors we can simply replay all the pending MonitorEvents. In recent work, we added support for keeping monitor events around until they are explicitly acked by the ChannelManager, but would always ack monitor events immediately, which preserved the previous behavior and didn't break any tests. Up until this point, we only generated HTLC monitor events when a payment was claimed/failed on-chain. In this commit, we start generating persistent monitor events whenever a payment is claimed *off*-chain, specifically when new latest holder commitment data is provided to the monitor. For the purpose of making incremental progress on this feature, these events will be a no-op and/or continue to be acked immediately except in the narrow case of an off-chain outbound payment claim. HTLC forward claim monitor events will be a no-op, and on-chain outbound payment claim events continue to be acked immediately. Off-chain outbound payment claims, however, now have monitor events generated for them that will not be acked by the ChannelManager until the PaymentSent event is processed by the user. This also allows us to stop blocking the RAA monitor update that removes the preimage, because the purpose of that behavior was to ensure the user got a PaymentSent event and the monitor event now serves that purpose instead.

This isn't a bug at the moment because a claim in this situation would already be filtered out due to its inclusion in htlcs_resolved_to_user. However, when we stop issuing ReleasePaymentComplete monitor updates for claims in upcoming commits, HTLC claims will no longer be in htlcs_resolved_to_user when get_onchain_failed_htlcs checks.

Currently, the resolution of HTLCs (and decisions on when HTLCs can be forwarded) is the responsibility of Channel objects (a part of ChannelManager) until the channel is closed, and then the ChannelMonitor thereafter. This leads to some complexity around race conditions for HTLCs right around channel closure. Additionally, there is lots of complexity reconstructing the state of all HTLCs in the ChannelManager deserialization/loading logic. Instead, we want to do all resolution in ChannelMonitors (in response to ChannelMonitorUpdates) and pass them back to ChannelManager in the form of MonitorEvents (similar to how HTLCs are resolved after channels are closed). In order to have reliable resolution, we'll need to keep MonitorEvents around in the ChannelMonitor until the ChannelManager has finished processing them. This will simplify things - on restart instead of examining the set of HTLCs in monitors we can simply replay all the pending MonitorEvents. In recent work, we added support for keeping monitor events around until they are explicitly acked by the ChannelManager, but would always ack monitor events immediately, which preserved the previous behavior and didn't break any tests. Here we start acking monitor events for on-chain HTLC claims when the user processes the PaymentSent event, if the persistent_monitor_events feature is enabled. This allows us to stop issuing ReleasePaymentComplete monitor updates for onchain payment claims, because the purpose of that behavior is to ensure we won't keep repeatedly issuing PaymentSent events, and the monitor event and acking it on PaymentSent processing now serves that purpose instead.

Use new ForwardEventContents struct as the return value of the closure passed to claim_funds_from_htlc_forward_hop. Useful as upcoming commits will add a struct that wants to contain a forward event specifically, not just any Event.

claim_mpp_part will use this event id in future commits when generating monitor update completion actions to ack the event corresponding to the id.

Channel will store this event id in upcoming commits, to allow the monitor event corresponding to the id to be acked after the HTLC is removed via revoke_and_ack.

ldk-reviews-bot · 2026-05-07T16:46:25Z

👋 Hi! I see this is a draft PR.
I'll wait to assign reviewers until you mark it as ready for review.
Just convert it out of draft status when you're ready for review!

codecov · 2026-05-07T17:55:50Z

Codecov Report

❌ Patch coverage is 88.12729% with 97 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.10%. Comparing base (964a84f) to head (70665c7).
⚠️ Report is 41 commits behind head on main.

Files with missing lines	Patch %	Lines
lightning/src/chain/channelmonitor.rs	90.14%	25 Missing and 10 partials ⚠️
lightning/src/ln/channel.rs	75.47%	24 Missing and 2 partials ⚠️
lightning/src/ln/channelmanager.rs	90.51%	24 Missing and 2 partials ⚠️
lightning/src/ln/functional_test_utils.rs	64.00%	9 Missing ⚠️
lightning/src/chain/chainmonitor.rs	95.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4604      +/-   ##
==========================================
- Coverage   86.18%   86.10%   -0.09%     
==========================================
  Files         156      157       +1     
  Lines      108528   109297     +769     
  Branches   108528   109297     +769     
==========================================
+ Hits        93532    94105     +573     
- Misses      12386    12563     +177     
- Partials     2610     2629      +19

Flag	Coverage Δ
tests	`86.10% <88.12%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

We don't actually need to write this field to disk because monitor events will be re-provided on startup. Will be used in upcoming commits to ack the monitor events corresponding to these ids after the htlcs are removed via revoke_and_ack.

Will be set in upcoming commits if an HTLC is on the inbound edge of a forwarded HTLC and we need to ack a monitor event on the outbound edge once this HTLC is irrevocably removed. The MonitorEventSource::channel_id identifies said outbound edge. Not persisted to disk because monitor events are re-provided on startup, so after restart the outbound edge's replayed event will be acked via the duplicate-claim path instead.

If we stored a monitor event id with a pending inbound HTLC and that HTLC is about to be fully removed from the channel via revoke_and_ack, we should ack the monitor event corresponding to that id when the monitor update associated with the RAA is complete. We do this so the monitor event will keep being re-provided to us on startup until the HTLC is removed, to ensure the HTLC gets resolved even if we lose the holding cell.

The first time we claim an HTLC in a channel, we may not have an associated monitor event id. Once we later process the monitor event for the claim, we need to update the channel's internal htlc state to include the monitor event id, so the event can be properly acked afer the htlc is removed.

We'll be adding another parameter to this closure soon. Also this makes it a little clearer what's going on at the callsites.

Indicates we should emit an Event::PaymentForwarded and possibly ack a monitor event via Watch::ack_monitor_event. This is generated when we've completed an inbound edge preimage update for an HTLC forward, at which point it's safe to generate the forward event. If the inbound edge is closed, a monitor event id may be included so we can tell the outbound edge to stop telling us about the claim once the forward event is processed by the user.

We need this level of precision when generating this event for off-chain claims. One test failed due to us previously overreporting a fee due to this lack of precision, fixed now.

Currently, the resolution of HTLCs (and decisions on when HTLCs can be forwarded) is the responsibility of Channel objects (a part of ChannelManager) until the channel is closed, and then the ChannelMonitor thereafter. This leads to some complexity around race conditions for HTLCs right around channel closure. Additionally, there is lots of complexity reconstructing the state of all HTLCs in the ChannelManager deserialization/loading logic. Instead, we want to do all resolution in ChannelMonitors (in response to ChannelMonitorUpdates) and pass them back to ChannelManager in the form of MonitorEvents (similar to how HTLCs are resolved after channels are closed). In order to have reliable resolution, we'll need to keep MonitorEvents around in the ChannelMonitor until the ChannelManager has finished processing them. This will simplify things - on restart instead of examining the set of HTLCs in monitors we can simply replay all the pending MonitorEvents. Building on the work of prior commits, here we stop immediately acking monitor events for HTLC claims for forwarded HTLCs, and instead ack the monitor event when the forward is fully resolved and we don't want the monitor event to be re-provided back to us on startup. If the inbound edge channel is on-chain, we'll ack the event when the inbound edge preimage monitor update completes and the user processes a PaymentForwarded event. If the inbound edge is open, we'll ack the event when the inbound HTLC is fully removed via revoke_and_ack (this ensures we'll remember to resolve the htlc off-chain even if we lose the holding cell).

We're moving towards having the ChannelMonitor be responsible for resolving HTLCs, via MonitorEvents, so we need to start tracking more payment information in monitor updates so the monitor later has access to it. Here we add tracking for the skimmed_fee field for claims.

We're moving towards having the ChannelMonitor be responsible for resolving HTLCs, via MonitorEvents, so we need to start surfacing more information in monitor events. Here we start persisting/surfacing a specific HTLC failure reason and the skimmed fee in MonitorEvent::HTLCEvents, which is useful when generating and handling monitor events for off-chain claims/fails. The skimmed fee for forwards is already put to use in this commit for generating correct PaymentForwarded events. The failure reason will be used in upcoming commits. We add the failure reason now to not have to change what we're serializing to disk later.

Useful to create accurate monitor events for HTLC failures in upcoming commits, even when the channel is still open.

Useful to create accurate monitor events for HTLC failures, even when the channel is still open.

Useful for the monitor to surface these failure reasons in MonitorEvent::HTLCEvents, which the manager will use to resolve these HTLCs.

These events are a no-op right now, which preserves the previous behavior. In upcoming commits, they will enable us to move onchain htlc failure resolution from Channels to ChannelMonitors as part of a larger refactor where monitors resolve all HTLCs.

Generalize fail_htlc's completion-action parameter from PaymentCompleteUpdate to EventCompletionAction so the upcoming persistent-monitor-events path can pass AckMonitorEvent.

We are in the process of moving the resolution of off-chain HTLCs from the Channel to the ChannelMonitor, via MonitorEvents. This will simplify how we reconstruct/reconcile the ChannelManager's pending HTLC set on startup and avoid needing to persist pending HTLCs explicitly in the manager, as we can just replay pending monitor events to reconstruct the set instead. In a recent commit, we started generating persistent monitor events whenever an htlc is failed off-chain, but those events were no-ops at the time. In this commit, we make those monitor events the sole driver of off-chain outbound payment failure. When `persistent_monitor_events` is enabled, we will skip the existing failure path for these HTLCs and instead fail them when the monitor event is processed. Similar to what we do for claims, the monitor event will not be acked and will continue to be re-provided on startup until the resulting PaymentFailed event is processed by the user. Forward and on-chain failure paths are unchanged and will be migrated separately as part of the broader refactor.

DRYs code in an upcoming commit.

We are in the process of moving the resolution of all HTLCs from the Channel to the ChannelMonitor, via persistent MonitorEvents. This will simplify how we reconstruct/reconcile the ChannelManager's pending HTLC set on startup and avoid needing to persist pending HTLCs explicitly in the manager, as we can just replay pending monitor events to reconstruct the set instead. In recent commits, we started generating persistent monitor events whenever an HTLC is failed off-chain. We also made those monitor events the sole driver of off-chain outbound payment failures, failing those HTLCs when the monitor event is processed. This replaced the previous behavior of failing when we initially receive update_fail_htlc from our peer. In this commit, we continue that work by making persistent monitor events the sole drivers of *on*chain outbound payment failures. When persistent_monitor_events is enabled, we will skip existing failure paths for on-chain outbound HTLCs, and instead (a) fail them when the monitor event is processed, and (b) similar to what we do for claims, the monitor event will not be acked and will continue to be re-provided on startup until the resulting PaymentFailed event is processed by the user. Forward HTLC failure paths are unchanged and will be migrated separately in upcoming work.

valentinewallace added 22 commits May 6, 2026 11:26

Split up some tests that have many variants

4ed8022

Helps when debugging to know which variants failed.

Remove unnecessary pending_monitor_events clone

f492b10

Add helper to push monitor events

4b12ecd

Cleans up the next commit

Rename pending_monitor_events to _legacy

be4d80c

This field will be deprecated in upcoming commits when we start persisting MonitorEvent ids alongside the MonitorEvents.

Track recent monitor updates in TestChainMonitor

b914b8b

And log them in check_added_monitors if it fails.

Persist user channel id in monitors

31ab44c

Will be used in upcoming commits when generating MonitorEvents.

Pass best block height to outbound_payments::claim_htlc

283a67c

Used in an upcoming commit to insert a pending payment if it's missing on startup and we need to re-claim.

Pass monitor event id to claim_funds_internal

2942cfd

Used in an upcoming commit to generate an EventCompletionAction::AckMonitorEvent.

Stop hardcoding from_onchain in monitor ev claim_funds

42646ed

In upcoming commits, we'll be generating monitor events for off-chain claims as well as on-chain. As a small prefactor, calculate the from_onchain value rather than hardcoding it to true.

claim_funds api: Event -> ForwardEventContents

30d5bfc

Use new ForwardEventContents struct as the return value of the closure passed to claim_funds_from_htlc_forward_hop. Useful as upcoming commits will add a struct that wants to contain a forward event specifically, not just any Event.

Thread monitor event id to claim_mpp_part

bf64600

claim_mpp_part will use this event id in future commits when generating monitor update completion actions to ack the event corresponding to the id.

Pass monitor event id when claiming in Channel

b98da5f

Channel will store this event id in upcoming commits, to allow the monitor event corresponding to the id to be acked after the HTLC is removed via revoke_and_ack.

valentinewallace added blocked on dependent pr blocked on next release Should Wait Until Next Release To Land labels May 7, 2026

valentinewallace added 4 commits May 7, 2026 17:15

Cache monitor event id in holding cell claims

1a956ef

We don't actually need to write this field to disk because monitor events will be re-provided on startup. Will be used in upcoming commits to ack the monitor events corresponding to these ids after the htlcs are removed via revoke_and_ack.

valentinewallace added 15 commits May 7, 2026 18:00

Abstract params to make claim mon completion action

9c84c30

We'll be adding another parameter to this closure soon. Also this makes it a little clearer what's going on at the callsites.

HTLCUpdate::htlc_value_satoshis -> msats

6c6ebc2

We need this level of precision when generating this event for off-chain claims. One test failed due to us previously overreporting a fee due to this lack of precision, fixed now.

Store HTLC failure reasons in monitor

65de0e7

Useful to create accurate monitor events for HTLC failures in upcoming commits, even when the channel is still open.

Persist outbound HTLC failures in monitor

7fc918a

Useful to create accurate monitor events for HTLC failures, even when the channel is still open.

Pass HTLC failures channel -> monitor

e8adfe3

Useful for the monitor to surface these failure reasons in MonitorEvent::HTLCEvents, which the manager will use to resolve these HTLCs.

fail_htlc: PaymentCompleteUpdate -> EventComplAction

0958b78

Generalize fail_htlc's completion-action parameter from PaymentCompleteUpdate to EventCompletionAction so the upcoming persistent-monitor-events path can pass AckMonitorEvent.

Helper for confirmed_funding_spend_past_anti_reorg

1c9c7ff

DRYs code in an upcoming commit.

valentinewallace force-pushed the 2026-05-mon-resolves-outb-fails branch from cf346fd to 70665c7 Compare May 7, 2026 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve outbound HTLC fails via `MonitorEvent`s#4604

Resolve outbound HTLC fails via `MonitorEvent`s#4604
valentinewallace wants to merge 41 commits intolightningdevkit:mainfrom
valentinewallace:2026-05-mon-resolves-outb-fails

valentinewallace commented May 7, 2026

Uh oh!

ldk-reviews-bot commented May 7, 2026

Uh oh!

codecov Bot commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

valentinewallace commented May 7, 2026

Uh oh!

ldk-reviews-bot commented May 7, 2026

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 7, 2026 •

edited

Loading