Skip to content

[Bug]: step ca renew --daemon caches keypair at startup; produces cert/key mismatch after external key replacement #1632

@euvitudo

Description

@euvitudo

Steps to Reproduce

  1. Issue an initial server cert/key:

    step ca certificate "test.local" /tmp/server.crt /tmp/server.key \
        --provisioner=admin --password-file=/tmp/pw --not-after=24h
    
  2. Start the renewal daemon. To make the cycle observable quickly, use --renew-period:

    step ca renew --daemon --renew-period=2m \
        --password-file=/tmp/pw \
        --exec /bin/true \
        /tmp/server.crt /tmp/server.key &
    
  3. Wait for at least one successful renewal (certificate renewed, next in 2m0s in the daemon log). Confirm cert/key still match:

    diff <(openssl x509 -in /tmp/server.crt -pubkey -noout | openssl pkey -pubin -outform DER | sha256sum) \
         <(openssl pkey -in /tmp/server.key -pubout -outform DER | sha256sum)
    # → identical
    
  4. Externally replace the keypair without notifying the daemon — equivalent to a manual key rotation:

    step ca certificate "test.local" /tmp/server.crt /tmp/server.key \
        --provisioner=admin --password-file=/tmp/pw --not-after=24h --force
    
  5. Confirm cert and key still match (they were just minted together):

    diff <(openssl x509 -in /tmp/server.crt -pubkey -noout | openssl pkey -pubin -outform DER | sha256sum) \
         <(openssl pkey -in /tmp/server.key -pubout -outform DER | sha256sum)
    # → identical
    
  6. Wait for the renewal daemon's next cycle (~2 minutes). Re-compare:

    diff <(openssl x509 -in /tmp/server.crt -pubkey -noout | openssl pkey -pubin -outform DER | sha256sum) \
         <(openssl pkey -in /tmp/server.key -pubout -outform DER | sha256sum)
    # → DIFFER. cert.pubkey is the keypair from step 1 (cached in the daemon),
    #   key.pubkey is the keypair from step 4 (on disk).
    

The daemon's stdout reports the renewal as successful. No error, no warning.

Your Environment

  • step-cli 0.30.2 (Linux/amd64)
  • step-ca 0.30.2 (admin provisioner, JWK, badger v2 db)
  • Reproducible on Alpine 3.23 container, Linux 5.15 host

Expected Behavior

One of:

  • Per-renewal re-read. Before each renewal, re-read key_path from disk and use that key to sign the CSR. Slight added I/O cost per cycle, but matches the user expectation that the daemon "operates on the files at the given paths."
  • Detect change and exit / warn. Stat the key file at renewal time; if mtime or inode has changed since startup, log an error and exit (or refuse to renew) rather than silently producing a mismatched pair.
  • Documentation. If caching is intentional and considered correct behavior, the man page / docs should explicitly state: "the key file MUST NOT be modified while the daemon is running; replace it only by stopping the daemon, replacing the file, and restarting." Currently this constraint is invisible to users.

Actual Behavior

The daemon silently signs CSRs with its in-memory cached key indefinitely, producing certs that fail validation against the on-disk key for any downstream consumer.

Additional Context

step ca renew --daemon reads the certificate and private key once at process startup and re-uses the cached private key to sign the CSR for every subsequent renewal. It never re-reads the key file from disk. If the on-disk key is replaced externally (manual rotation, backup restore, GitOps deploy, a separate step ca certificate invocation, etc.) while the daemon is running, every following renewal produces a certificate whose pubkey matches the cached key, not the current on-disk key. The renewal itself reports success — exit 0, --exec hook fires, log says "certificate renewed" — and the mismatch only surfaces when a downstream consumer reloads TLS from the same files and fails with tls: private key does not match public key.

Impact

This bit our cert-rotation pipeline (multi-cert, openbao stack). Recovery procedure for any mismatch involves regenerating the keypair via a separate provisioner-direct issuance — which, when run while the daemon is alive, creates the very mismatch we were trying to fix. The pattern is identifiable: the bad cert's pubkey is identical across multiple consecutive failed renewals even though the on-disk key has been regenerated to different values between failures. The "ghost" pubkey is the value of the key file at daemon-startup time.

Anyone running the daemon long-lived and rotating keys out-of-band — GitOps pipelines, backup/restore drills, multi-cert rotators, our own setup — is exposed.

Proposed fix

PR #1441 (currently open against issue #1343) is already extending (r *renewer).Daemon in command/ca/renew.go with a per-cycle rekeyFunc callback:

// PR #1441 — current state
func (r *renewer) Daemon(
    outFile string,
    next, expiresIn, renewPeriod time.Duration,
    afterRenew func() error,
    rekeyFunc func() error,
) error {

rekeyFunc fires after each successful renewal cycle and, when set by rekey --daemon, generates a fresh key. For plain renew --daemon, rekeyFunc is nil, so the existing renewal path is untouched and the cached-key bug described above persists.

The natural extension is a symmetric callback that re-loads the on-disk key before the CSR is constructed:

// proposed extension on top of PR #1441
func (r *renewer) Daemon(
    outFile string,
    next, expiresIn, renewPeriod time.Duration,
    afterRenew func() error,
    rekeyFunc func() error,
    reloadKeyFunc func() error,  // <-- new; nil = preserve current behavior
) error {
    // ... existing loop ...
    for {
        select {
        case <-tickerC:
            if reloadKeyFunc != nil {
                if err := reloadKeyFunc(); err != nil {
                    errLog.Println(err)
                    continue  // skip this cycle; try again next tick
                }
            }
            // ... existing renew code that constructs the CSR + calls /renew ...
        }
    }
}

reloadKeyFunc would be implemented by renewCertificateAction (the renew command's setup) as something like:

reloadKey := func() error {
    signer, err := cryptoutil.CreateSigner(kmsURI, keyFile,
        pemutil.WithFilename(keyFile),
        pemutil.WithPasswordFile(passFile),
    )
    if err != nil {
        return err
    }
    r.signer = signer  // or whatever field the renewer stores the cached key in
    return nil
}

Properties of this extension:

  • Backwards compatible. Callers that pass nil (or that haven't been updated for the new signature) get exactly the current behavior. No surprise regressions for users who rely on the cached-key semantics (if any do).
  • Mirrors PR Fix: Ensure step ca rekey --daemon generates new keys as expected #1441's structure. Same callback-injection pattern, same place in the loop, just on the input side of the renewal instead of the output side. Reviewer load is minimal.
  • Closes the failure class. Combined with PR Fix: Ensure step ca rekey --daemon generates new keys as expected #1441, both daemon modes (renew --daemon and rekey --daemon) become robust against external key changes between cycles.
  • Optional KMS consideration. If the key is in a KMS rather than a file, reloadKeyFunc is a no-op — the KMS handle is already a live reference, not a cached blob. Worth a sentence in the docs but no code change.

Workaround

Replace step ca renew --daemon with a bash poll loop calling one-shot step ca renew --force per cycle. Each invocation re-reads cert + key fresh from disk, sidestepping the cache. We've shipped this fix locally and it eliminates the failure class. Not a long-term answer — step ca renew --daemon should Just Work for the documented use case.

Related

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugneeds triageWaiting for discussion / prioritization by team

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions