Skip to content

Comments

Add per-node secret rotation tracking with drift detection#1781

Open
lmiccini wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
lmiccini:nodeset_rmqu_finalizer_configmap
Open

Add per-node secret rotation tracking with drift detection#1781
lmiccini wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
lmiccini:nodeset_rmqu_finalizer_configmap

Conversation

@lmiccini
Copy link
Contributor

@lmiccini lmiccini commented Jan 27, 2026

Implements persistent tracking of secret versions deployed to each node
in OpenStackDataPlaneNodeSet to coordinate safe deletion of old
credentials during gradual rollouts.

Implementation:

- ConfigMap-based storage (`<nodeset-name>-secret-tracking`) records
  which secret versions are deployed to each node

- Tracks "Current" (deployed) vs "Expected" (cluster) secret states:
  - Current: Hash of secrets actually deployed to nodes
  - Expected: Hash of secrets currently in cluster
  - Drift detected when Current != Expected

- Deployment processing updates tracking data per node with secret
  hashes, skipping stale deployments (hash != cluster hash)

- Drift detection runs after each reconciliation, comparing cluster
  secrets against tracking ConfigMap, using APIReader to bypass cache

- Status field SecretDeployment reports:
  - UpdatedNodes: count of nodes on current secret versions
  - AllNodesUpdated: whether all nodes have current versions
  - ConfigMapName, TotalNodes, LastUpdateTime

- APIReader field added to reconciler to read directly from Kubernetes
  API, bypassing controller-runtime cache for accurate drift detection

This enables safe credential deletion only when all nodes across all
nodesets sharing the credentials have been updated.

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/56ac80bd0e7547ad88350eb0206886b5

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 18m 47s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 23m 38s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 37m 31s
adoption-standalone-to-crc-ceph-provider FAILURE in 3h 01m 55s
✔️ openstack-operator-tempest-multinode SUCCESS in 1h 51m 23s
openstack-operator-docs-preview POST_FAILURE in 2m 32s

@stuggi stuggi requested a review from slagle January 28, 2026 08:13
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/db62c9cd33b34a538c7eccf243769b6a

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 02m 26s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 20m 56s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 36m 03s
adoption-standalone-to-crc-ceph-provider FAILURE in 1h 46m 57s
✔️ openstack-operator-tempest-multinode SUCCESS in 1h 34m 08s
openstack-operator-docs-preview POST_FAILURE in 3m 15s

@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch 2 times, most recently from 3885c4a to c1fe8f8 Compare February 7, 2026 18:56
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/b5d3972863e64857b2da5055f867ef55

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 20m 43s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 21m 41s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 36m 22s
adoption-standalone-to-crc-ceph-provider FAILURE in 2h 05m 30s
✔️ openstack-operator-tempest-multinode SUCCESS in 1h 43m 01s
✔️ openstack-operator-docs-preview SUCCESS in 3m 14s

@lmiccini
Copy link
Contributor Author

lmiccini commented Feb 8, 2026

/retest

@lmiccini
Copy link
Contributor Author

lmiccini commented Feb 8, 2026

recheck

@lmiccini
Copy link
Contributor Author

lmiccini commented Feb 8, 2026

/test openstack-operator-build-deploy-kuttl-4-18

@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch 2 times, most recently from cbfbb7c to f52529a Compare February 8, 2026 15:01
@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch from f52529a to 017d2ca Compare February 10, 2026 06:41
@lmiccini
Copy link
Contributor Author

/test functional

@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch 2 times, most recently from 97bb482 to b1d9350 Compare February 12, 2026 15:46
@lmiccini
Copy link
Contributor Author

/test openstack-operator-build-deploy-kuttl-4-18

@lmiccini
Copy link
Contributor Author

/test openstack-operator-build-deploy-kuttl-4-18

@slagle
Copy link
Contributor

slagle commented Feb 17, 2026

Is preventing the deletion of in use rabbitmq users the point of this PR? Why do we need these finalizers to enable "safe rotation"?

I'm concerned about the size and complexity of this PR. Personally, this is difficult to review. We might want to come up with a simpler design that we code without AI, and then let AI build on top of that. I'm having a hard time reasoning about all the different changes here.

This also adds some service specific code to the dataplane (nova, neutron, ironic). While we have some instances of that, we have really tried to avoid that in the past, and do things generically and let CRD fields drive the generic code.

I'm just brainstorming, but a simpler solution might be:

  • We know the Secret/ConfigMaps in use at service deployment time.
  • Services have a field whose value we use to inspect the Secret/ConfigMap and we save the value found (such as transportURL) on the NodeSet or Deployment Status when the Deployment succeeds
  • rabbitmq user deletion checks NodeSet or Deployment Status and if it find that user in use, blocks the deletion.

For example, the nova Service has in the spec:

serviceTrackingFields:

  • dataSource: # ConfigMapRef or SecretRef
    fieldPattern: "nova-transport-url-pattern"

Then during Service Deployment, there is similar logic to GetNovaCellRabbitMqUserFromSecret, we get the value of the user and save it on the NodeSet and/or Deployment Status. If we attempt to rotate or delete the user, and that user is still set on a Status, the operation is blocked.

I would also delay solving the problem of enforcing that all nodes in the nodeset have been updated by a Deployment. This is a wider problem that should be solved separately from the user rotation problem.

Copy link
Contributor

@slagle slagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment

@slagle
Copy link
Contributor

slagle commented Feb 17, 2026

Or even simpler...we already have the Secret and ConfigMap hashes saved in the Deployment statuses. If the rabbitmq user rotation see that those hashes are out of date, the rotation, or at least the old user deletion part of the rotation is blocked.

@lmiccini
Copy link
Contributor Author

lmiccini commented Feb 18, 2026

Is preventing the deletion of in use rabbitmq users the point of this PR? Why do we need these finalizers to enable "safe rotation"?

I'm concerned about the size and complexity of this PR. Personally, this is difficult to review. We might want to come up with a simpler design that we code without AI, and then let AI build on top of that. I'm having a hard time reasoning about all the different changes here.

This also adds some service specific code to the dataplane (nova, neutron, ironic). While we have some instances of that, we have really tried to avoid that in the past, and do things generically and let CRD fields drive the generic code.

I'm just brainstorming, but a simpler solution might be:

* We know the Secret/ConfigMaps in use at service deployment time.

* Services have a field whose value we use to inspect the Secret/ConfigMap and we save the value found (such as transportURL) on the NodeSet or Deployment Status when the Deployment succeeds

* rabbitmq user deletion checks NodeSet or Deployment Status and if it find that user in use, blocks the deletion.

For example, the nova Service has in the spec:

serviceTrackingFields:

* dataSource:  # ConfigMapRef or SecretRef
  fieldPattern: "nova-transport-url-pattern"

Then during Service Deployment, there is similar logic to GetNovaCellRabbitMqUserFromSecret, we get the value of the user and save it on the NodeSet and/or Deployment Status. If we attempt to rotate or delete the user, and that user is still set on a Status, the operation is blocked.

I would also delay solving the problem of enforcing that all nodes in the nodeset have been updated by a Deployment. This is a wider problem that should be solved separately from the user rotation problem.

Thanks @slagle , appreciate you taking the time.
The logic is more or less what you are proposing here.
We add finalizers to the rabbitmq users so that each service can "signal" they are in use, and do garbage collection when no finalizer is present, following the same pattern that we use in other places, to avoid having leftover credentials that could pose a security risk.

The additional stuff "on top" is required because we could have different rabbitmq users for nova_compute, neutron and ironic agents running in the dataplane, so I try to track which node in a nodeset ran a deployment for the aforementioned services and store that in a configmap that we update until all have reconciled to the hashes that you mention in the last comment. Here how it could look like:

[zuul@localhost ~]$ oc get configmap openstack-edpm-ipam-service-tracking -o yaml
apiVersion: v1
data:
  neutron.secretHash: 6e657574726f6e2d646863702d6167656e742d6e657574726f6e2d636f6e6669673a313737303632353235383b6e657574726f6e2d7372696f762d6167656e742d6e657574726f6e2d636f6e6669673a313737303632353235383b
  neutron.updatedNodes: '[]'
  nova.secretHash: 6e6f76612d63656c6c312d636f6d707574652d636f6e6669673a313737303634333733313b
  nova.updatedNodes: '["edpm-compute-0","edpm-compute-1"]'

If I understand correctly you would like to flip this around and have infra-operator track each nodeset rabbitmq usage instead? Not sure having infra-operator introspect dataplane objects is my preferred approach, especially because we have no way of knowing if one additional service will be added tomorrow that could use rabbitmq, so we would have to play catch up with the dataplane. That said, I can try to prototype something and see how ugly it gets.
Thanks again.

@stuggi
Copy link
Contributor

stuggi commented Feb 23, 2026

If I understand correctly you would like to flip this around and have infra-operator track each nodeset rabbitmq usage instead?

we can not do that. that would introduce a circular dependency because infra would add a dependency on the openstack-operator.

Implements persistent tracking of secret versions deployed to each node
in OpenStackDataPlaneNodeSet to coordinate safe deletion of old
credentials during gradual rollouts.

Implementation:

- ConfigMap-based storage (`<nodeset-name>-secret-tracking`) records
  which secret versions are deployed to each node

- Tracks "Current" (deployed) vs "Expected" (cluster) secret states:
  - Current: Hash of secrets actually deployed to nodes
  - Expected: Hash of secrets currently in cluster
  - Drift detected when Current != Expected

- Deployment processing updates tracking data per node with secret
  hashes, skipping stale deployments (hash != cluster hash)

- Drift detection runs after each reconciliation, comparing cluster
  secrets against tracking ConfigMap, using APIReader to bypass cache

- Status field SecretDeployment reports:
  - UpdatedNodes: count of nodes on current secret versions
  - AllNodesUpdated: whether all nodes have current versions
  - ConfigMapName, TotalNodes, LastUpdateTime

- APIReader field added to reconciler to read directly from Kubernetes
  API, bypassing controller-runtime cache for accurate drift detection

This enables safe credential deletion only when all nodes across all
nodesets sharing the credentials have been updated.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch from b1d9350 to cd2cef1 Compare February 23, 2026 14:10
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 23, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lmiccini
Once this PR has been reviewed and has the lgtm label, please ask for approval from slagle. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lmiccini lmiccini changed the title Nodeset rabbitmquser finalizer management and status tracking via configmap Add per-node secret rotation tracking with drift detection Feb 23, 2026
@lmiccini
Copy link
Contributor Author

  • NodeSet: openstack-edpm-ipam with 2 nodes: compute-0, compute-1
  • Shared Secret: nova-cell1-compute-config (contains RabbitMQ credentials)
  • Initial User: user7 (hash: n5b4h...)
  • Rotated User: user8 (hash: n656h...)

Stage 1: Initial State - All Nodes on Old Credentials

Secret (cluster state)

apiVersion: v1
kind: Secret
metadata:
  name: nova-cell1-compute-config
  resourceVersion: "12345"
data:
  transport_url: "rabbit://user7:pass@rabbitmq:5672/"  # Old credentials

ConfigMap (tracking state)

apiVersion: v1
kind: ConfigMap
metadata:
  name: openstack-edpm-ipam-secret-tracking
data:
  nova-cell1-compute-config: |
    {
      "currentHash": "n5b4h...",
      "expectedHash": "n5b4h...",
      "nodes": {
        "compute-0": {
          "secretHash": "n5b4h...",
          "deploymentName": "edpm-deployment-initial",
          "lastUpdated": "2026-02-20T10:00:00Z"
        },
        "compute-1": {
          "secretHash": "n5b4h...",
          "deploymentName": "edpm-deployment-initial",
          "lastUpdated": "2026-02-20T10:00:00Z"
        }
      }
    }

NodeSet Status

status:
  secretDeployment:
    configMapName: openstack-edpm-ipam-secret-tracking
    totalNodes: 2
    updatedNodes: 2              # ✓ All nodes on current version
    allNodesUpdated: true         # ✓ Safe to delete old credentials (if they existed)
    lastUpdateTime: "2026-02-20T10:00:00Z"

State: All nodes running with user7, no drift, system stable.


Stage 2: Credential Rotation - Cluster Secret Changes

Administrator rotates RabbitMQ credentials by updating the openstackcontrolplane, switching cell1 to use a different user.

Secret (cluster state) - CHANGED

apiVersion: v1
kind: Secret
metadata:
  name: nova-cell1-compute-config
  resourceVersion: "67890"      # ← Changed
data:
  transport_url: "rabbit://user8:pass@rabbitmq:5672/"  # ← New credentials

ConfigMap (tracking state) - UNCHANGED

apiVersion: v1
kind: ConfigMap
metadata:
  name: openstack-edpm-ipam-secret-tracking
data:
  nova-cell1-compute-config: |
    {
      "currentHash": "n5b4h...",     # Still old hash
      "expectedHash": "n5b4h...",    # Still old hash
      "nodes": {
        "compute-0": {
          "secretHash": "n5b4h...",  # Still old hash
          "deploymentName": "edpm-deployment-initial",
          "lastUpdated": "2026-02-20T10:00:00Z"
        },
        "compute-1": {
          "secretHash": "n5b4h...",  # Still old hash
          "deploymentName": "edpm-deployment-initial",
          "lastUpdated": "2026-02-20T10:00:00Z"
        }
      }
    }

NodeSet Status - DRIFT DETECTED

status:
  secretDeployment:
    configMapName: openstack-edpm-ipam-secret-tracking
    totalNodes: 2
    updatedNodes: 0               # ← Changed: drift detected, reset to 0
    allNodesUpdated: false        # ← Changed: drift exists
    lastUpdateTime: "2026-02-23T11:36:46Z"  # ← Updated by drift detection

State: Drift detected! Nodes still have user7, but cluster expects user8.
Action Required: Deploy to update nodes.
Credential Status: ⚠️ Cannot delete user7 - nodes still using it!


Stage 3: Partial Deployment - Update compute-0 Only

Administrator creates deployment with ansibleLimit: compute-0.

Deployment

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: edpm-deployment-c0-limit
spec:
  nodeSets:
    - openstack-edpm-ipam
  ansibleLimit: compute-0        # Only this node

After deployment completes:

Secret (cluster state) - UNCHANGED

data:
  transport_url: "rabbit://user8:pass@rabbitmq:5672/"  # Still user8

ConfigMap (tracking state) - PARTIALLY UPDATED

apiVersion: v1
kind: ConfigMap
metadata:
  name: openstack-edpm-ipam-secret-tracking
data:
  nova-cell1-compute-config: |
    {
      "currentHash": "n5b4h...",     # ← NOT updated (compute-1 still on n5b4h)
      "expectedHash": "n656h...",    # ← Updated to cluster hash
      "nodes": {
        "compute-0": {
          "secretHash": "n656h...",  # ← Updated to user8
          "deploymentName": "edpm-deployment-c0-limit",
          "lastUpdated": "2026-02-23T12:00:00Z"
        },
        "compute-1": {
          "secretHash": "n5b4h...",  # ← Still on user7
          "deploymentName": "edpm-deployment-initial",
          "lastUpdated": "2026-02-20T10:00:00Z"
        }
      }
    }

NodeSet Status - PARTIAL UPDATE

status:
  secretDeployment:
    configMapName: openstack-edpm-ipam-secret-tracking
    totalNodes: 2
    updatedNodes: 1               # Only 1 of 2 nodes updated
    allNodesUpdated: false        # ← Still false
    lastUpdateTime: "2026-02-23T12:00:00Z"

State: compute-0 now has user8, compute-1 still has user7.
Credential Status: ⚠️ CRITICAL - Cannot delete user7! compute-1 still needs it!


Stage 4: Full Deployment - Update All Remaining Nodes

Administrator deploys to all nodes (or remaining nodes).

Deployment

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: edpm-deployment-full
spec:
  nodeSets:
    - openstack-edpm-ipam
  # No ansibleLimit - all nodes

After deployment completes:

Secret (cluster state) - UNCHANGED

data:
  transport_url: "rabbit://user8:pass@rabbitmq:5672/"  # Still user8

ConfigMap (tracking state) - FULLY UPDATED

apiVersion: v1
kind: ConfigMap
metadata:
  name: openstack-edpm-ipam-secret-tracking
data:
  nova-cell1-compute-config: |
    {
      "currentHash": "n656h...",     # ← Updated: all nodes on n656h
      "expectedHash": "n656h...",    # Matches cluster
      "nodes": {
        "compute-0": {
          "secretHash": "n656h...",  # user8
          "deploymentName": "edpm-deployment-full",
          "lastUpdated": "2026-02-23T13:00:00Z"
        },
        "compute-1": {
          "secretHash": "n656h...",  # ← Updated to user8
          "deploymentName": "edpm-deployment-full",
          "lastUpdated": "2026-02-23T13:00:00Z"
        }
      }
    }

NodeSet Status - ALL UPDATED

status:
  secretDeployment:
    configMapName: openstack-edpm-ipam-secret-tracking
    totalNodes: 2
    updatedNodes: 2               # ← All nodes updated
    allNodesUpdated: true         # ← Safe to proceed!
    lastUpdateTime: "2026-02-23T13:00:00Z"

State: All nodes now have user8, no drift.
Credential Status: ✓ SAFE - Can now delete user7 credentials!


Stage 5: Multiple NodeSets Scenario

What if multiple NodeSets share the same credentials?

Setup

  • NodeSet 1: openstack-edpm-compute (2 nodes: compute-0, compute-1)
  • NodeSet 2: openstack-edpm-storage (2 nodes: storage-0, storage-1)
  • Shared Secret: nova-cell1-compute-config (both use it)
  • Total Nodes: 4 nodes across 2 NodeSets

After Partial Deployment (compute NodeSet only)

Compute NodeSet Status

status:
  secretDeployment:
    totalNodes: 2
    updatedNodes: 2
    allNodesUpdated: true         # ✓ Compute NodeSet is done

Storage NodeSet Status

status:
  secretDeployment:
    totalNodes: 2
    updatedNodes: 0
    allNodesUpdated: false        # ✗ Storage NodeSet not updated

Credential Status: ⚠️ BLOCKED - Even though compute NodeSet shows allNodesUpdated: true, storage nodes still need user7!

After Deploying Both NodeSets

Compute NodeSet Status

status:
  secretDeployment:
    totalNodes: 2
    updatedNodes: 2
    allNodesUpdated: true         #

Storage NodeSet Status

status:
  secretDeployment:
    totalNodes: 2
    updatedNodes: 2
    allNodesUpdated: true         #

Credential Status: ✓ SAFE - All 4 nodes across both NodeSets updated. Now safe to delete user7!


Key Observations

currentHash vs expectedHash

  • currentHash: The hash of secrets actually deployed to nodes

    • Only updated when ALL nodes have the same version
    • Used to detect when it's safe to delete old credentials
  • expectedHash: The hash of secrets in the cluster (desired state)

    • Always matches current cluster secret hash
    • Used to detect drift

Drift Detection Logic

if currentHash != expectedHash:
    drift_detected = true
    updatedNodes = 0
    allNodesUpdated = false

Credential Deletion Safety

Old credentials can ONLY be deleted when:

  1. ALL NodeSets sharing the secret show allNodesUpdated: true
  2. Each NodeSet's currentHash == expectedHash
  3. No deployments in progress

Stale Deployment Handling

If deployment edpm-deployment-old was created before rotation but completes after:

  • Deployment has stale secret hash (n5b4h) from when it was created
  • Cluster now has new secret hash (n656h)
  • Action: Skip this deployment entirely - don't update tracking
  • Reason: Prevents incorrectly marking nodes as "updated" when they got old credentials

@lmiccini
Copy link
Contributor Author

This new approach can be used directly by openstack-operator to set and remove finalizers on rabbitmq users, or by infra-operator to read the nodeset status and do the finalizer management (openstack-k8s-operators/infra-operator@main...lmiccini:infra-operator:track_dataplaneusers)

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 23, 2026

@lmiccini: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/openstack-operator-build-deploy-kuttl 1698305 link true /test openstack-operator-build-deploy-kuttl
ci/prow/precommit-check cd2cef1 link true /test precommit-check
ci/prow/openstack-operator-build-deploy-kuttl-4-18 cd2cef1 link true /test openstack-operator-build-deploy-kuttl-4-18

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants