Add per-node secret rotation tracking with drift detection#1781
Add per-node secret rotation tracking with drift detection#1781lmiccini wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
Conversation
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/56ac80bd0e7547ad88350eb0206886b5 ✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 18m 47s |
04cc55a to
1698305
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/db62c9cd33b34a538c7eccf243769b6a ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 02m 26s |
3885c4a to
c1fe8f8
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/b5d3972863e64857b2da5055f867ef55 ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 20m 43s |
|
/retest |
|
recheck |
|
/test openstack-operator-build-deploy-kuttl-4-18 |
cbfbb7c to
f52529a
Compare
f52529a to
017d2ca
Compare
|
/test functional |
97bb482 to
b1d9350
Compare
|
/test openstack-operator-build-deploy-kuttl-4-18 |
|
/test openstack-operator-build-deploy-kuttl-4-18 |
|
Is preventing the deletion of in use rabbitmq users the point of this PR? Why do we need these finalizers to enable "safe rotation"? I'm concerned about the size and complexity of this PR. Personally, this is difficult to review. We might want to come up with a simpler design that we code without AI, and then let AI build on top of that. I'm having a hard time reasoning about all the different changes here. This also adds some service specific code to the dataplane (nova, neutron, ironic). While we have some instances of that, we have really tried to avoid that in the past, and do things generically and let CRD fields drive the generic code. I'm just brainstorming, but a simpler solution might be:
For example, the nova Service has in the spec: serviceTrackingFields:
Then during Service Deployment, there is similar logic to GetNovaCellRabbitMqUserFromSecret, we get the value of the user and save it on the NodeSet and/or Deployment Status. If we attempt to rotate or delete the user, and that user is still set on a Status, the operation is blocked. I would also delay solving the problem of enforcing that all nodes in the nodeset have been updated by a Deployment. This is a wider problem that should be solved separately from the user rotation problem. |
|
Or even simpler...we already have the Secret and ConfigMap hashes saved in the Deployment statuses. If the rabbitmq user rotation see that those hashes are out of date, the rotation, or at least the old user deletion part of the rotation is blocked. |
Thanks @slagle , appreciate you taking the time. The additional stuff "on top" is required because we could have different rabbitmq users for nova_compute, neutron and ironic agents running in the dataplane, so I try to track which node in a nodeset ran a deployment for the aforementioned services and store that in a configmap that we update until all have reconciled to the hashes that you mention in the last comment. Here how it could look like: If I understand correctly you would like to flip this around and have infra-operator track each nodeset rabbitmq usage instead? Not sure having infra-operator introspect dataplane objects is my preferred approach, especially because we have no way of knowing if one additional service will be added tomorrow that could use rabbitmq, so we would have to play catch up with the dataplane. That said, I can try to prototype something and see how ugly it gets. |
we can not do that. that would introduce a circular dependency because infra would add a dependency on the openstack-operator. |
Implements persistent tracking of secret versions deployed to each node in OpenStackDataPlaneNodeSet to coordinate safe deletion of old credentials during gradual rollouts. Implementation: - ConfigMap-based storage (`<nodeset-name>-secret-tracking`) records which secret versions are deployed to each node - Tracks "Current" (deployed) vs "Expected" (cluster) secret states: - Current: Hash of secrets actually deployed to nodes - Expected: Hash of secrets currently in cluster - Drift detected when Current != Expected - Deployment processing updates tracking data per node with secret hashes, skipping stale deployments (hash != cluster hash) - Drift detection runs after each reconciliation, comparing cluster secrets against tracking ConfigMap, using APIReader to bypass cache - Status field SecretDeployment reports: - UpdatedNodes: count of nodes on current secret versions - AllNodesUpdated: whether all nodes have current versions - ConfigMapName, TotalNodes, LastUpdateTime - APIReader field added to reconciler to read directly from Kubernetes API, bypassing controller-runtime cache for accurate drift detection This enables safe credential deletion only when all nodes across all nodesets sharing the credentials have been updated. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
b1d9350 to
cd2cef1
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lmiccini The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Stage 1: Initial State - All Nodes on Old CredentialsSecret (cluster state)apiVersion: v1
kind: Secret
metadata:
name: nova-cell1-compute-config
resourceVersion: "12345"
data:
transport_url: "rabbit://user7:pass@rabbitmq:5672/" # Old credentialsConfigMap (tracking state)apiVersion: v1
kind: ConfigMap
metadata:
name: openstack-edpm-ipam-secret-tracking
data:
nova-cell1-compute-config: |
{
"currentHash": "n5b4h...",
"expectedHash": "n5b4h...",
"nodes": {
"compute-0": {
"secretHash": "n5b4h...",
"deploymentName": "edpm-deployment-initial",
"lastUpdated": "2026-02-20T10:00:00Z"
},
"compute-1": {
"secretHash": "n5b4h...",
"deploymentName": "edpm-deployment-initial",
"lastUpdated": "2026-02-20T10:00:00Z"
}
}
}NodeSet Statusstatus:
secretDeployment:
configMapName: openstack-edpm-ipam-secret-tracking
totalNodes: 2
updatedNodes: 2 # ✓ All nodes on current version
allNodesUpdated: true # ✓ Safe to delete old credentials (if they existed)
lastUpdateTime: "2026-02-20T10:00:00Z"State: All nodes running with user7, no drift, system stable. Stage 2: Credential Rotation - Cluster Secret ChangesAdministrator rotates RabbitMQ credentials by updating the openstackcontrolplane, switching cell1 to use a different user. Secret (cluster state) - CHANGEDapiVersion: v1
kind: Secret
metadata:
name: nova-cell1-compute-config
resourceVersion: "67890" # ← Changed
data:
transport_url: "rabbit://user8:pass@rabbitmq:5672/" # ← New credentialsConfigMap (tracking state) - UNCHANGEDapiVersion: v1
kind: ConfigMap
metadata:
name: openstack-edpm-ipam-secret-tracking
data:
nova-cell1-compute-config: |
{
"currentHash": "n5b4h...", # Still old hash
"expectedHash": "n5b4h...", # Still old hash
"nodes": {
"compute-0": {
"secretHash": "n5b4h...", # Still old hash
"deploymentName": "edpm-deployment-initial",
"lastUpdated": "2026-02-20T10:00:00Z"
},
"compute-1": {
"secretHash": "n5b4h...", # Still old hash
"deploymentName": "edpm-deployment-initial",
"lastUpdated": "2026-02-20T10:00:00Z"
}
}
}NodeSet Status - DRIFT DETECTEDstatus:
secretDeployment:
configMapName: openstack-edpm-ipam-secret-tracking
totalNodes: 2
updatedNodes: 0 # ← Changed: drift detected, reset to 0
allNodesUpdated: false # ← Changed: drift exists
lastUpdateTime: "2026-02-23T11:36:46Z" # ← Updated by drift detectionState: Drift detected! Nodes still have user7, but cluster expects user8. Stage 3: Partial Deployment - Update compute-0 OnlyAdministrator creates deployment with DeploymentapiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: edpm-deployment-c0-limit
spec:
nodeSets:
- openstack-edpm-ipam
ansibleLimit: compute-0 # Only this nodeAfter deployment completes: Secret (cluster state) - UNCHANGEDdata:
transport_url: "rabbit://user8:pass@rabbitmq:5672/" # Still user8ConfigMap (tracking state) - PARTIALLY UPDATEDapiVersion: v1
kind: ConfigMap
metadata:
name: openstack-edpm-ipam-secret-tracking
data:
nova-cell1-compute-config: |
{
"currentHash": "n5b4h...", # ← NOT updated (compute-1 still on n5b4h)
"expectedHash": "n656h...", # ← Updated to cluster hash
"nodes": {
"compute-0": {
"secretHash": "n656h...", # ← Updated to user8
"deploymentName": "edpm-deployment-c0-limit",
"lastUpdated": "2026-02-23T12:00:00Z"
},
"compute-1": {
"secretHash": "n5b4h...", # ← Still on user7
"deploymentName": "edpm-deployment-initial",
"lastUpdated": "2026-02-20T10:00:00Z"
}
}
}NodeSet Status - PARTIAL UPDATEstatus:
secretDeployment:
configMapName: openstack-edpm-ipam-secret-tracking
totalNodes: 2
updatedNodes: 1 # Only 1 of 2 nodes updated
allNodesUpdated: false # ← Still false
lastUpdateTime: "2026-02-23T12:00:00Z"State: compute-0 now has user8, compute-1 still has user7. Stage 4: Full Deployment - Update All Remaining NodesAdministrator deploys to all nodes (or remaining nodes). DeploymentapiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: edpm-deployment-full
spec:
nodeSets:
- openstack-edpm-ipam
# No ansibleLimit - all nodesAfter deployment completes: Secret (cluster state) - UNCHANGEDdata:
transport_url: "rabbit://user8:pass@rabbitmq:5672/" # Still user8ConfigMap (tracking state) - FULLY UPDATEDapiVersion: v1
kind: ConfigMap
metadata:
name: openstack-edpm-ipam-secret-tracking
data:
nova-cell1-compute-config: |
{
"currentHash": "n656h...", # ← Updated: all nodes on n656h
"expectedHash": "n656h...", # Matches cluster
"nodes": {
"compute-0": {
"secretHash": "n656h...", # user8
"deploymentName": "edpm-deployment-full",
"lastUpdated": "2026-02-23T13:00:00Z"
},
"compute-1": {
"secretHash": "n656h...", # ← Updated to user8
"deploymentName": "edpm-deployment-full",
"lastUpdated": "2026-02-23T13:00:00Z"
}
}
}NodeSet Status - ALL UPDATEDstatus:
secretDeployment:
configMapName: openstack-edpm-ipam-secret-tracking
totalNodes: 2
updatedNodes: 2 # ← All nodes updated
allNodesUpdated: true # ← Safe to proceed!
lastUpdateTime: "2026-02-23T13:00:00Z"State: All nodes now have user8, no drift. Stage 5: Multiple NodeSets ScenarioWhat if multiple NodeSets share the same credentials? Setup
After Partial Deployment (compute NodeSet only)Compute NodeSet Statusstatus:
secretDeployment:
totalNodes: 2
updatedNodes: 2
allNodesUpdated: true # ✓ Compute NodeSet is doneStorage NodeSet Statusstatus:
secretDeployment:
totalNodes: 2
updatedNodes: 0
allNodesUpdated: false # ✗ Storage NodeSet not updatedCredential Status: After Deploying Both NodeSetsCompute NodeSet Statusstatus:
secretDeployment:
totalNodes: 2
updatedNodes: 2
allNodesUpdated: true # ✓Storage NodeSet Statusstatus:
secretDeployment:
totalNodes: 2
updatedNodes: 2
allNodesUpdated: true # ✓Credential Status: ✓ SAFE - All 4 nodes across both NodeSets updated. Now safe to delete user7! Key ObservationscurrentHash vs expectedHash
Drift Detection LogicCredential Deletion SafetyOld credentials can ONLY be deleted when:
Stale Deployment HandlingIf deployment
|
|
This new approach can be used directly by openstack-operator to set and remove finalizers on rabbitmq users, or by infra-operator to read the nodeset status and do the finalizer management (openstack-k8s-operators/infra-operator@main...lmiccini:infra-operator:track_dataplaneusers) |
|
@lmiccini: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Uh oh!
There was an error while loading. Please reload this page.