Add local signer backup recovery flow#715
Conversation
bf8917e to
c557661
Compare
cdecker
left a comment
There was a problem hiding this comment.
Hm, not quite sure this is the direction we should go. Calling the client API from the signer is not necessary as far as I can see. The idea was to just take a snapshot of the signer state, which contains all the relevant information to recover on its own, whereas this change is a sprawling change, injecting new client connections in a variety of places, and adding strong coupling.
The original issue had the following line:
Conclusion: VLS state contains all SCB data plus much more. Storing VLS state snapshots should be sufficient for disaster recovery.
| use std::io::Write; | ||
| use std::path::Path; |
There was a problem hiding this comment.
This would prevent us from compiling in no_std environments, of which we target wasm as well as embedded environments. This means we need to gate the use and functionality behind a #[cfg(...)] guard, so we can exclude these parts for no_std envs.
There was a problem hiding this comment.
Good point. Done
|
|
||
| mod approver; | ||
| mod auth; | ||
| mod backup; |
There was a problem hiding this comment.
We likely need to #[cfg(...)] guard to the mod, then we have a nice and clean separation.
| async fn process_request( | ||
| &self, | ||
| req: HsmRequest, | ||
| mut node_client: Option<&mut crate::node::ClnClient>, |
There was a problem hiding this comment.
I don't quite understand the logic behind pushing a backup side-effect into the processing itself, when we can do snapshot comparison in the caller.
| } | ||
| } | ||
|
|
||
| fn backup_peerlist_client(&self, channel: Channel) -> Result<Option<node::ClnClient>, Error> { |
There was a problem hiding this comment.
Not sure why we need a node::ClnClient here at all, we have all the necessary data in the signerstate already, so let's just extract from there.
There was a problem hiding this comment.
Following up on your feedback in the sibling MR:
Ah, now I see why retrieving extra information is necessary in the public PR. I think we can work around it though, since a funding must always be preceeded by a
connectcommand, whose IP address we can just stash away. Alternatively, a much cleaner solution would be to just add the IP, if known, to the VLS state itself. That would take a while to propagate through back to us, but it would mean the signer state is a true superset.
I went with adding peer data into the VLS state to make it a real superstate
2f9ce1d to
2237e89
Compare
cdecker
left a comment
There was a problem hiding this comment.
Very good, the functionality is all there, however the implementation and specifically where it hooks into the rest of the functionality is rather strange to me. From what I understand the backup is now interspersed with the signer state update, whereas we could just keep the signer state processing untouched, then pass the updated signer state into the backup for it to extract the changes (regenerate the backup), and then conditionally write to disk when there is a change to before. This would be more aligned with the phase-separation we have currently:
- Pre-flight checks in the form of the end-to-end verification on the requests
- Snapshot of the signer state
- Pass signer state and request to the VLS core for verification, state updates and response generation
- Diff between pre- and post-state
- Pass diff to backup so it can update itself
- Return
(response, post_state)togl-plugin
Can be a followup PR, but probably simpler to separate here, rather than to untangle once merged.
| let private_key = self | ||
| .tls | ||
| .private_key | ||
| .clone() | ||
| .ok_or_else(|| Error::Other(anyhow!("missing TLS private key for CLN auth")))?; |
There was a problem hiding this comment.
I wonder if this may ever happen actually 🤔
| @@ -0,0 +1,111 @@ | |||
| # Signer Backups | |||
|
|
|||
| Greenlight signers can keep a local copy of the VLS signer state to enable disaster recovery or migration to a self-hosted node. This backup is opt-in and disabled by default. When enabled, the backup file contains signer state entries for recoverable channels and known peers. | |||
There was a problem hiding this comment.
Migrating to a new self-hosted node requires the Greenlight node to be forcefully disabled, otherwise we run into the split brain issue, and LN will penalize. The backup really is only for disaster recovery, never for migration, as an unattended and/or uncoordinated migration will result in loss of funds!
Please reword this intro as a safety net, not to be used for migrations.
|
|
||
| ## Convert for Core Lightning | ||
|
|
||
| Convert the signer backup to Core Lightning recovery input: |
There was a problem hiding this comment.
PLease add a warning that this is only ever to be done if the service goes down, and the signer MUST NEVER connect to Greenlight's hosted node ever, otherwise loss of funds may be inevitable.
There was a problem hiding this comment.
Sort of, actually the backup we are building here is the SCB equivalent, not a real resumable backup. That'd involve storing the shachain secrets, and other related secrets in lockstep wtih the node, which this PR does not implement.
So technically, could be safe, but only because CLN will have to immediately close the channels in the backup to recover the funds, which makes concurrent operations less risky, but still quite risky should the GL node not have been immobilized.
| When VLS counterparty revocation secrets are present in the backup, the | ||
| converted CLN SCB entries include the shachain TLV. If that signer state is | ||
| absent, conversion still emits CLN recovery input without the shachain TLV. |
There was a problem hiding this comment.
Very good, I think there isn't a whole lot missing if we have the shachain to just be able to resume the channels in their state without closing them.
| @@ -0,0 +1,96 @@ | |||
| #[cfg(feature = "backup")] | |||
| mod enabled { | |||
There was a problem hiding this comment.
I'm not sure I follow here. I was hoping we'd just have the following in src/signer/mod.rs:
#[cfg(feature = "backup")]
mod backup;
That's all we really need. And then from the callsites we just prepend the callsites with a cfg check inline. Having dummy stubs around is terrible DX.
|
|
||
| network: Network, | ||
| state: Arc<Mutex<crate::persist::State>>, | ||
| backup: backup_runtime::Runtime, |
There was a problem hiding this comment.
#[cfg(feature = "backup")]
Here and the other call sites, and we avoid the weird roundabout way of disabling and enabling a Runtime.
Adds opt-in local VLS signer backups with CLI inspection/conversion tooling. There are two available backup strategies:
new-channels-only: default, low I/O, snapshots when a channel first becomes recoverable.periodic: snapshots new recoverable channels and then refreshes after configured recoverable-channel updates, with more disk writesBackups can be created through:
Backups can be converted to CLN
recoverchannelinput with:Tradeoffs