docs (k8s): Document rolling restart impact during Helm to Operator migration#1625
docs (k8s): Document rolling restart impact during Helm to Operator migration#1625david-yu wants to merge 4 commits intoredpanda-data:mainfrom
Conversation
When migrating from Helm to the Redpanda Operator, all broker pods are restarted sequentially even if clusterSpec values match existing Helm values. This adds a new section explaining why the restart happens, impact by cluster configuration, and recommended producer settings to avoid message loss during migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for redpanda-docs-preview ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis change adds an "Expected impact" section to the Helm-to-operator migration documentation that explains the rolling restart behavior triggered when applying a Redpanda custom resource. The section details the adoption mechanism via server-side apply, StatefulSet spec re-rendering effects, pod deletion sequence with health checks, and timing considerations. It provides scenario-specific guidance based on cluster topology (broker count and replication factor) and recommends producer configuration adjustments for operational resilience during the transition. Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Fix passive voice ("is flagged" → "the operator flags")
- Fix nested list continuation under ordered list (add +/-- block)
- Use active voice consistently in list items
- Use "Apache Kafka" on first reference per terminology standards
- Add serial commas where missing
- Ensure parallel structure in list items
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
modules/migrate/pages/kubernetes/helm-to-operator.adoc (2)
109-117: Consider mentioning write caching as a tuning alternative.The current recommendations prioritize maximum durability with
acks=all. For users who prioritize lower latency during migration and can tolerate some risk, write caching (documented inmodules/develop/pages/config-topics.adoc#configure-write-caching) offers a relaxed mode ofacks=allthat acknowledges on majority without waiting for fsync.This is optional—the current conservative recommendation is appropriate for most migration scenarios. However, mentioning write caching would give advanced users awareness of tuning options for their specific durability/performance trade-offs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modules/migrate/pages/kubernetes/helm-to-operator.adoc` around lines 109 - 117, Add a brief note after the "Recommended producer settings" block mentioning write caching as an optional, lower-latency tuning alternative for advanced users: reference the existing settings `acks=all`, `retries`, and `enable.idempotence=true`, explain that write caching relaxes fsync waits while still acknowledging to a majority (link to the existing docs at modules/develop/pages/config-topics.adoc#configure-write-caching), and state it is only recommended when users can tolerate reduced durability during migration.
93-108: Well-structured impact scenarios.The table clearly differentiates impact based on broker count and replication factor. The explanation of Raft quorum semantics (2 of 3 brokers for RF=3) correctly justifies why writes continue during individual broker restarts.
Consider using the
glosstermmacro for domain-specific terms like "replication factor", "quorum", and "Raft consensus" for consistency with the repository's glossary usage. Based on learnings, the syntax would beglossterm:replication-factor[]orglossterm:raft-consensus[].🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modules/migrate/pages/kubernetes/helm-to-operator.adoc` around lines 93 - 108, Update the document to wrap domain-specific terms with the repository glossary macro: replace occurrences of "replication factor", "quorum", and "Raft consensus" in the "Impact by cluster configuration" section (the table content in helm-to-operator.adoc) with the glossterm macro forms (e.g., glossterm:replication-factor[], glossterm:quorum[], glossterm:raft-consensus[]) to ensure consistent glossary linking and styling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modules/migrate/pages/kubernetes/helm-to-operator.adoc`:
- Around line 109-117: Replace the duplicated producer configuration block (the
recommendations for `acks=all`, `retries`, and `enable.idempotence=true` and the
NOTE about Raft) with a single xref to the authoritative producer guide
`modules/develop/pages/produce-data/configure-producers.adoc` using
empty-bracket xref syntax
`xref:modules/develop/pages/produce-data/configure-producers.adoc[]`; in
addition, keep a short migration-specific callout that explicitly highlights two
critical bits of context from the guide: that `enable.idempotence=true` must be
used together with `acks=all` to guarantee idempotent behavior, and that high
`retries` can interact with `max.in.flight.requests.per.connection` to affect
ordering (point readers to the corresponding sections in the referenced doc for
full details).
---
Nitpick comments:
In `@modules/migrate/pages/kubernetes/helm-to-operator.adoc`:
- Around line 109-117: Add a brief note after the "Recommended producer
settings" block mentioning write caching as an optional, lower-latency tuning
alternative for advanced users: reference the existing settings `acks=all`,
`retries`, and `enable.idempotence=true`, explain that write caching relaxes
fsync waits while still acknowledging to a majority (link to the existing docs
at modules/develop/pages/config-topics.adoc#configure-write-caching), and state
it is only recommended when users can tolerate reduced durability during
migration.
- Around line 93-108: Update the document to wrap domain-specific terms with the
repository glossary macro: replace occurrences of "replication factor",
"quorum", and "Raft consensus" in the "Impact by cluster configuration" section
(the table content in helm-to-operator.adoc) with the glossterm macro forms
(e.g., glossterm:replication-factor[], glossterm:quorum[],
glossterm:raft-consensus[]) to ensure consistent glossary linking and styling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 448631bc-9409-4fdb-918b-d99cd8987722
📒 Files selected for processing (1)
modules/migrate/pages/kubernetes/helm-to-operator.adoc
| === Recommended producer settings | ||
|
|
||
| To avoid message loss during the rolling restart, configure producers with: | ||
|
|
||
| - `acks=all` (or `-1`): Waits for the write to be committed by the Raft quorum (majority of replicas). | ||
| - `retries`: Set to a high value to handle `NOT_LEADER_FOR_PARTITION` errors during leader elections. | ||
| - `enable.idempotence=true`: Prevents duplicate messages from retries. | ||
|
|
||
| NOTE: Redpanda uses Raft-based replication, not Kafka's in-sync replica (ISR) mechanism. With Raft, writes require a majority quorum (2 of 3 for RF=3). There is no `min.insync.replicas` setting to configure. The replication factor alone determines fault tolerance. |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Replace inline producer guidance with xref to comprehensive guide.
This section duplicates content from the existing comprehensive producer configuration guide at modules/develop/pages/produce-data/configure-producers.adoc. Inline duplication creates maintenance burden and risks inconsistency.
Additionally, the recommendations are missing critical context:
- Line 115 suggests
enable.idempotence=truebut doesn't mention that true idempotent behavior requires pairing withacks=all. The comprehensive guide (configure-producers.adoc:138-143) explicitly states: "To guarantee true idempotent behavior, you must also setacks=all." - Line 114 suggests high retries without mentioning ordering risks related to
max.in.flight.requests.per.connection, which is documented in configure-producers.adoc:91-116.
♻️ Proposed refactor to reference comprehensive producer guide
Replace lines 109-117 with a focused reference to the authoritative guide:
=== Recommended producer settings
-To avoid message loss during the rolling restart, configure producers with:
-
-- `acks=all` (or `-1`): Waits for the write to be committed by the Raft quorum (majority of replicas).
-- `retries`: Set to a high value to handle `NOT_LEADER_FOR_PARTITION` errors during leader elections.
-- `enable.idempotence=true`: Prevents duplicate messages from retries.
-
-NOTE: Redpanda uses Raft-based replication, not Kafka's in-sync replica (ISR) mechanism. With Raft, writes require a majority quorum (2 of 3 for RF=3). There is no `min.insync.replicas` setting to configure. The replication factor alone determines fault tolerance.
+To avoid message loss during the rolling restart, configure producers for durability and resilience. See xref:develop:produce-data/configure-producers.adoc[] for comprehensive guidance. Key settings include:
+
+- xref:develop:produce-data/configure-producers.adoc#acksall[`acks=all`] to ensure writes are committed by the Raft quorum (majority of replicas)
+- xref:develop:produce-data/configure-producers.adoc#retries[`retries`] set to a high value to handle transient errors during leader elections
+- xref:develop:produce-data/configure-producers.adoc#enableidempotence[`enable.idempotence=true`] paired with `acks=all` to prevent duplicate messages
+
+NOTE: Redpanda uses Raft-based replication, not Kafka's in-sync replica (ISR) mechanism. With Raft, writes require a majority quorum (2 of 3 for RF=3). The replication factor alone determines fault tolerance.This approach maintains migration-specific context while linking to the authoritative source for configuration details. Based on learnings, xref links should use empty brackets [] to automatically pull the target document's title when publishing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modules/migrate/pages/kubernetes/helm-to-operator.adoc` around lines 109 -
117, Replace the duplicated producer configuration block (the recommendations
for `acks=all`, `retries`, and `enable.idempotence=true` and the NOTE about
Raft) with a single xref to the authoritative producer guide
`modules/develop/pages/produce-data/configure-producers.adoc` using
empty-bracket xref syntax
`xref:modules/develop/pages/produce-data/configure-producers.adoc[]`; in
addition, keep a short migration-specific callout that explicitly highlights two
critical bits of context from the guide: that `enable.idempotence=true` must be
used together with `acks=all` to guarantee idempotent behavior, and that high
`retries` can interact with `max.in.flight.requests.per.connection` to affect
ordering (point readers to the corresponding sections in the referenced doc for
full details).
- Rename "Why pods restart" to "How the rolling restart works" - Remove NOTE about Raft vs ISR comparison as it is not needed here Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
acks=all, retries, idempotence) for zero-downtime migrationmin.insync.replicasto configureContext
Customers migrating from Helm to Operator were surprised that broker pods restart during migration, even when
clusterSpecvalues match existing Helm values andupdateStrategy.type: OnDeleteis set. The current docs don't warn about this. The restart is unavoidable because the operator re-renders the StatefulSet from its own templates, creating a new ControllerRevision that triggers pod rolling.Test plan
🤖 Generated with Claude Code