Skip to content

Add local signer backup recovery flow#715

Open
ihordiachenko wants to merge 12 commits into
mainfrom
feature/state_backup
Open

Add local signer backup recovery flow#715
ihordiachenko wants to merge 12 commits into
mainfrom
feature/state_backup

Conversation

@ihordiachenko
Copy link
Copy Markdown
Collaborator

@ihordiachenko ihordiachenko commented May 1, 2026

Adds opt-in local VLS signer backups with CLI inspection/conversion tooling. There are two available backup strategies:

  • new-channels-only: default, low I/O, snapshots when a channel first becomes recoverable.
  • periodic: snapshots new recoverable channels and then refreshes after configured recoverable-channel updates, with more disk writes

Backups can be created through:

 glcli signer run --backup-path, inspected with inspect-backup

Backups can be converted to CLN recoverchannel input with:

glcli signer convert-backup --format cln --path <backup file>

Tradeoffs

  • Backups are best-effort during signer operation: write failures are logged and do not interrupt signing. The backup file is created only after a snapshot trigger, not immediately at startup.
  • Peer addresses are stored from Greenlight’s peerlist alongside VLS state to close the main recovery-data gap.
  • Only v1 channels supported for now

@ihordiachenko ihordiachenko requested a review from cdecker May 1, 2026 12:04
@ihordiachenko ihordiachenko force-pushed the feature/state_backup branch from bf8917e to c557661 Compare May 1, 2026 12:07
@ihordiachenko ihordiachenko marked this pull request as ready for review May 1, 2026 12:12
Copy link
Copy Markdown
Collaborator

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not quite sure this is the direction we should go. Calling the client API from the signer is not necessary as far as I can see. The idea was to just take a snapshot of the signer state, which contains all the relevant information to recover on its own, whereas this change is a sprawling change, injecting new client connections in a variety of places, and adding strong coupling.

The original issue had the following line:

Conclusion: VLS state contains all SCB data plus much more. Storing VLS state snapshots should be sufficient for disaster recovery.

Comment thread libs/gl-client/src/signer/backup.rs Outdated
Comment on lines +5 to +6
use std::io::Write;
use std::path::Path;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would prevent us from compiling in no_std environments, of which we target wasm as well as embedded environments. This means we need to gate the use and functionality behind a #[cfg(...)] guard, so we can exclude these parts for no_std envs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done


mod approver;
mod auth;
mod backup;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We likely need to #[cfg(...)] guard to the mod, then we have a nice and clean separation.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread libs/gl-client/src/signer/mod.rs Outdated
async fn process_request(
&self,
req: HsmRequest,
mut node_client: Option<&mut crate::node::ClnClient>,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the logic behind pushing a backup side-effect into the processing itself, when we can do snapshot comparison in the caller.

Comment thread libs/gl-client/src/signer/mod.rs Outdated
}
}

fn backup_peerlist_client(&self, channel: Channel) -> Result<Option<node::ClnClient>, Error> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need a node::ClnClient here at all, we have all the necessary data in the signerstate already, so let's just extract from there.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on your feedback in the sibling MR:

Ah, now I see why retrieving extra information is necessary in the public PR. I think we can work around it though, since a funding must always be preceeded by a connect command, whose IP address we can just stash away. Alternatively, a much cleaner solution would be to just add the IP, if known, to the VLS state itself. That would take a while to propagate through back to us, but it would mean the signer state is a true superset.

I went with adding peer data into the VLS state to make it a real superstate

@ihordiachenko ihordiachenko force-pushed the feature/state_backup branch from 2f9ce1d to 2237e89 Compare May 12, 2026 23:45
Copy link
Copy Markdown
Collaborator

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good, the functionality is all there, however the implementation and specifically where it hooks into the rest of the functionality is rather strange to me. From what I understand the backup is now interspersed with the signer state update, whereas we could just keep the signer state processing untouched, then pass the updated signer state into the backup for it to extract the changes (regenerate the backup), and then conditionally write to disk when there is a change to before. This would be more aligned with the phase-separation we have currently:

  1. Pre-flight checks in the form of the end-to-end verification on the requests
  2. Snapshot of the signer state
  3. Pass signer state and request to the VLS core for verification, state updates and response generation
  4. Diff between pre- and post-state
  5. Pass diff to backup so it can update itself
  6. Return (response, post_state) to gl-plugin

Can be a followup PR, but probably simpler to separate here, rather than to untangle once merged.

Comment thread libs/gl-client/src/signer/mod.rs Outdated
Comment on lines +761 to +765
let private_key = self
.tls
.private_key
.clone()
.ok_or_else(|| Error::Other(anyhow!("missing TLS private key for CLN auth")))?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this may ever happen actually 🤔

@@ -0,0 +1,111 @@
# Signer Backups

Greenlight signers can keep a local copy of the VLS signer state to enable disaster recovery or migration to a self-hosted node. This backup is opt-in and disabled by default. When enabled, the backup file contains signer state entries for recoverable channels and known peers.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrating to a new self-hosted node requires the Greenlight node to be forcefully disabled, otherwise we run into the split brain issue, and LN will penalize. The backup really is only for disaster recovery, never for migration, as an unattended and/or uncoordinated migration will result in loss of funds!

Please reword this intro as a safety net, not to be used for migrations.


## Convert for Core Lightning

Convert the signer backup to Core Lightning recovery input:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PLease add a warning that this is only ever to be done if the service goes down, and the signer MUST NEVER connect to Greenlight's hosted node ever, otherwise loss of funds may be inevitable.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of, actually the backup we are building here is the SCB equivalent, not a real resumable backup. That'd involve storing the shachain secrets, and other related secrets in lockstep wtih the node, which this PR does not implement.

So technically, could be safe, but only because CLN will have to immediately close the channels in the backup to recover the funds, which makes concurrent operations less risky, but still quite risky should the GL node not have been immobilized.

Comment on lines +85 to +87
When VLS counterparty revocation secrets are present in the backup, the
converted CLN SCB entries include the shachain TLV. If that signer state is
absent, conversion still emits CLN recovery input without the shachain TLV.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good, I think there isn't a whole lot missing if we have the shachain to just be able to resume the channels in their state without closing them.

@@ -0,0 +1,96 @@
#[cfg(feature = "backup")]
mod enabled {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow here. I was hoping we'd just have the following in src/signer/mod.rs:

#[cfg(feature = "backup")]
mod backup;

That's all we really need. And then from the callsites we just prepend the callsites with a cfg check inline. Having dummy stubs around is terrible DX.


network: Network,
state: Arc<Mutex<crate::persist::State>>,
backup: backup_runtime::Runtime,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[cfg(feature = "backup")]

Here and the other call sites, and we avoid the weird roundabout way of disabling and enabling a Runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants