Skip to content

fix(controllers): replace a persistently crash-looping member#336

Open
Andrey Kolkov (androndo) wants to merge 1 commit into
mainfrom
fix/fail-pvc-on-member
Open

fix(controllers): replace a persistently crash-looping member#336
Andrey Kolkov (androndo) wants to merge 1 commit into
mainfrom
fix/fail-pvc-on-member

Conversation

@androndo

@androndo Andrey Kolkov (androndo) commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

A non-bootstrap PVC member whose etcd cannot start — classically because its data dir was lost while the cluster membership moved on, leaving its frozen --initial-cluster stale (etcd refuses to boot: "error validating peerURLs ... member count is unequal") — crash-loops forever with no recovery path. The existing self-heal covers only memory-medium members (pod lost => data lost); for PVC members the PVC survives pod restarts, so "pod lost" != "data lost" and the stale --initial-cluster is never escaped.

Detect such a member (etcd container not ready and restarted past a threshold, excluding OOMKilled) and delete it for replacement — but only when the rest of the cluster still has quorum, so a cluster-wide outage never cascades into mass deletion (the finalizer's MemberRemove is quorum-gated too). The cluster controller then gap-fills a fresh member with a current --initial-cluster.

Also extend the Kamaji e2e to wait for readyMembers==3, then churn all three members one at a time and re-verify the tenant API still roundtrips through the fully replaced (hash-named) member set — guarding that member naming/replacement stays transparent to a Kamaji DataStore (stable -client Service + wildcard server-cert SAN).

Summary by CodeRabbit

  • Bug Fixes
    • Implemented automatic detection and recovery for persistently crashing etcd members, automatically replacing them when the cluster can safely maintain quorum.

A non-bootstrap PVC member whose etcd cannot start — classically because its
data dir was lost while the cluster membership moved on, leaving its frozen
--initial-cluster stale (etcd refuses to boot: "error validating peerURLs ...
member count is unequal") — crash-loops forever with no recovery path. The
existing self-heal covers only memory-medium members (pod lost => data lost);
for PVC members the PVC survives pod restarts, so "pod lost" != "data lost"
and the stale --initial-cluster is never escaped.

Detect such a member (etcd container not ready and restarted past a threshold,
excluding OOMKilled) and delete it for replacement — but only when the rest of
the cluster still has quorum, so a cluster-wide outage never cascades into mass
deletion (the finalizer's MemberRemove is quorum-gated too). The cluster
controller then gap-fills a fresh member with a current --initial-cluster.

Also extend the Kamaji e2e to wait for readyMembers==3, then churn all three
members one at a time and re-verify the tenant API still roundtrips through the
fully replaced (hash-named) member set — guarding that member naming/replacement
stays transparent to a Kamaji DataStore (stable <cluster>-client Service +
wildcard server-cert SAN).

Signed-off-by: Andrey Kolkov <androndo@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds self-healing for etcd members stuck in persistent crash loops. The controller gains etcdContainerStuck and clusterHasQuorumWithout helpers; updateStatus now deletes non-bootstrap EtcdMember CRs when the etcd container is persistently failing and the remaining cluster retains quorum. Unit tests and an e2e member-churn scenario are included.

Changes

Crash-loop self-heal for EtcdMember

Layer / File(s) Summary
Self-heal helpers and updateStatus integration
controllers/etcdmember_controller.go
Introduces dataLossRestartThreshold, etcdContainerStuck (detects stuck non-OOMKilled etcd container), and clusterHasQuorumWithout (quorum check excluding the current member). updateStatus deletes a non-bootstrap PVC-backed EtcdMember when both conditions hold.
Unit tests for stuck detection and updateStatus
controllers/etcdmember_controller_test.go
Adds TestEtcdContainerStuck for threshold/OOMKilled cases, crashLoopPod/clusterWithReady test helpers, and three updateStatus scenario tests: replace-on-quorum, preserve-without-quorum, preserve-bootstrap.
E2E member churn test
test/e2e/kamaji_datastore_test.go
Extends TestKamajiDataStore with a sequential member-deletion churn phase, waiting for each member to be gone and the cluster to recover to 3 ready members, then asserting no original names remain and tenant etcd writes survive. Adds memberNames helper.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 A bunny watched the etcd loop,
Crash, restart, crash — oh what a droop!
"Enough!" cried the controller with a hop,
"Quorum's safe, so let this member drop!"
New pod springs up, fresh from the ground,
And all the cluster data is found. 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main objective of the PR—introducing self-healing logic to replace persistently crash-looping etcd members.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/fail-pvc-on-member

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces self-healing capabilities for unrecoverable etcd members by automatically deleting and replacing persistently crash-looping members, provided the rest of the cluster maintains quorum. It includes helper functions to detect stuck containers and verify quorum, along with comprehensive unit and E2E tests. Feedback on the changes suggests checking the Pod's deletion timestamp and current container state to avoid false-positive self-healing triggers during graceful termination or active OOMKilled events. Additionally, it is recommended to account for stale status by decrementing the ready member count if the member under evaluation is still marked as ready, preventing potential race conditions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +917 to +931
func etcdContainerStuck(pod *corev1.Pod) bool {
for _, cs := range pod.Status.ContainerStatuses {
if cs.Name != "etcd" {
continue
}
if cs.Ready || cs.RestartCount < dataLossRestartThreshold {
return false
}
if t := cs.LastTerminationState.Terminated; t != nil && t.Reason == "OOMKilled" {
return false
}
return true
}
return false
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Issue 1: Missing check for Pod deletion state

If a Pod is currently being deleted (e.g., due to a manual restart, node drain, or eviction), its DeletionTimestamp is set, and its containers will eventually terminate. If we evaluate it as "stuck" during this graceful termination window, we might prematurely delete the EtcdMember CR and trigger an expensive member replacement instead of letting the Pod naturally restart or reschedule.

Issue 2: Missing check for current OOMKilled state

The current implementation only checks cs.LastTerminationState.Terminated for the "OOMKilled" reason. However, if the container has just been OOMKilled and is currently in the Terminated state (but has not yet transitioned to Waiting / CrashLoopBackOff), the "OOMKilled" reason will be in cs.State.Terminated instead. Checking both states ensures we don't falsely trigger self-healing for resource-limit issues.

func etcdContainerStuck(pod *corev1.Pod) bool {
	if pod.DeletionTimestamp != nil {
		return false
	}
	for _, cs := range pod.Status.ContainerStatuses {
		if cs.Name != "etcd" {
			continue
		}
		if cs.Ready || cs.RestartCount < dataLossRestartThreshold {
			return false
		}
		if t := cs.State.Terminated; t != nil && t.Reason == "OOMKilled" {
			return false
		}
		if t := cs.LastTerminationState.Terminated; t != nil && t.Reason == "OOMKilled" {
			return false
		}
		return true
	}
	return false
}

Comment on lines +939 to +955
func (r *EtcdMemberReconciler) clusterHasQuorumWithout(ctx context.Context, member *lll.EtcdMember) bool {
cluster, err := r.clusterFor(ctx, member)
if err != nil {
return false
}
desired := 0
if cluster.Status.Observed != nil {
desired = int(cluster.Status.Observed.Replicas)
}
if desired == 0 && cluster.Spec.Replicas != nil {
desired = int(*cluster.Spec.Replicas)
}
if desired == 0 {
return false
}
return int(cluster.Status.ReadyMembers) >= desired/2+1
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Issue: Potential race condition with stale ReadyMembers count

cluster.Status.ReadyMembers is updated asynchronously by the cluster controller. If this member was previously healthy (and thus counted as ready in ReadyMembers), but has recently started crash-looping, cluster.Status.ReadyMembers might still include it if the cluster controller hasn't reconciled yet.

If another member is also down, ReadyMembers might incorrectly indicate that we still have quorum to perform a deletion, when in reality we do not. To prevent cascading outages during multi-member failures, we should decrement the ready count if this member's CR status still lists it as Ready.

func (r *EtcdMemberReconciler) clusterHasQuorumWithout(ctx context.Context, member *lll.EtcdMember) bool {
	cluster, err := r.clusterFor(ctx, member)
	if err != nil {
		return false
	}
	desired := 0
	if cluster.Status.Observed != nil {
		desired = int(cluster.Status.Observed.Replicas)
	}
	if desired == 0 && cluster.Spec.Replicas != nil {
		desired = int(*cluster.Spec.Replicas)
	}
	if desired == 0 {
		return false
	}
	readyMembers := int(cluster.Status.ReadyMembers)
	for _, cond := range member.Status.Conditions {
		if cond.Type == lll.MemberReady && cond.Status == metav1.ConditionTrue {
			readyMembers--
			break
		}
	}
	return readyMembers >= desired/2+1
}

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/kamaji_datastore_test.go`:
- Around line 181-183: The memberNames helper function calls t.Fatalf internally
when encountering List errors, which causes the test to fail immediately when
invoked inside the waitFor callback, bypassing the retry logic. To fix this,
create a variant of memberNames that returns an error value instead of calling
t.Fatalf, or inline the list call directly within the waitFor callback body to
handle errors gracefully and allow waitFor to retry on transient API failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76cd05b0-e161-44f0-8a10-4521832db9dd

📥 Commits

Reviewing files that changed from the base of the PR and between 4248ade and b54a52f.

📒 Files selected for processing (3)
  • controllers/etcdmember_controller.go
  • controllers/etcdmember_controller_test.go
  • test/e2e/kamaji_datastore_test.go

Comment on lines +181 to +183
if names := memberNames(ctx, t); len(names) != 3 {
return fmt.Errorf("have %d members, want 3: %v", len(names), names)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

memberNames calls t.Fatalf inside waitFor callback, bypassing retry logic.

The memberNames helper (line 406) calls t.Fatalf on List errors. When invoked inside the waitFor callback, a transient API error will immediately fail the test instead of allowing waitFor to retry. Consider using a local error-returning variant or inline the list call with error return.

Proposed inline fix
-			if names := memberNames(ctx, t); len(names) != 3 {
-				return fmt.Errorf("have %d members, want 3: %v", len(names), names)
-			}
+			list := &etcdv1alpha2.EtcdMemberList{}
+			if err := kube.List(ctx, list, client.InNamespace(e2eNamespace),
+				client.MatchingLabels{"etcd-operator.cozystack.io/cluster": clusterName}); err != nil {
+				return err
+			}
+			if len(list.Items) != 3 {
+				return fmt.Errorf("have %d members, want 3", len(list.Items))
+			}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if names := memberNames(ctx, t); len(names) != 3 {
return fmt.Errorf("have %d members, want 3: %v", len(names), names)
}
list := &etcdv1alpha2.EtcdMemberList{}
if err := kube.List(ctx, list, client.InNamespace(e2eNamespace),
client.MatchingLabels{"etcd-operator.cozystack.io/cluster": clusterName}); err != nil {
return err
}
if len(list.Items) != 3 {
return fmt.Errorf("have %d members, want 3", len(list.Items))
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/kamaji_datastore_test.go` around lines 181 - 183, The memberNames
helper function calls t.Fatalf internally when encountering List errors, which
causes the test to fail immediately when invoked inside the waitFor callback,
bypassing the retry logic. To fix this, create a variant of memberNames that
returns an error value instead of calling t.Fatalf, or inline the list call
directly within the waitFor callback body to handle errors gracefully and allow
waitFor to retry on transient API failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant