Skip to content

docs (k8s): Document rolling restart impact during Helm to Operator migration#1625

Open
david-yu wants to merge 4 commits intoredpanda-data:mainfrom
david-yu:doc/helm-to-operator-rolling-restart-impact
Open

docs (k8s): Document rolling restart impact during Helm to Operator migration#1625
david-yu wants to merge 4 commits intoredpanda-data:mainfrom
david-yu:doc/helm-to-operator-rolling-restart-impact

Conversation

@david-yu
Copy link
Contributor

Summary

  • Adds a new "Expected impact: rolling restart of broker pods" section to the Helm-to-Operator migration guide
  • Explains why pods restart (resource adoption, StatefulSet re-render, ControllerRevision mismatch)
  • Documents impact by cluster configuration (3+ brokers RF>=3, RF=1, single broker)
  • Adds recommended producer settings (acks=all, retries, idempotence) for zero-downtime migration
  • Clarifies that Redpanda uses Raft consensus (not Kafka ISR), so there is no min.insync.replicas to configure

Context

Customers migrating from Helm to Operator were surprised that broker pods restart during migration, even when clusterSpec values match existing Helm values and updateStrategy.type: OnDelete is set. The current docs don't warn about this. The restart is unavoidable because the operator re-renders the StatefulSet from its own templates, creating a new ControllerRevision that triggers pod rolling.

Test plan

  • Verify AsciiDoc renders correctly in the docs site preview
  • Confirm technical accuracy of the rolling restart sequence against operator source code
  • Review producer settings recommendations

🤖 Generated with Claude Code

When migrating from Helm to the Redpanda Operator, all broker pods are
restarted sequentially even if clusterSpec values match existing Helm
values. This adds a new section explaining why the restart happens,
impact by cluster configuration, and recommended producer settings
to avoid message loss during migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@david-yu david-yu requested a review from a team as a code owner March 20, 2026 16:15
@netlify
Copy link

netlify bot commented Mar 20, 2026

Deploy Preview for redpanda-docs-preview ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit f46c77b
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-docs-preview/deploys/69bd7a153b8e980008a97203
😎 Deploy Preview https://deploy-preview-1625--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 20, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 725a73c3-66bf-492a-a37c-16de963c46c0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This change adds an "Expected impact" section to the Helm-to-operator migration documentation that explains the rolling restart behavior triggered when applying a Redpanda custom resource. The section details the adoption mechanism via server-side apply, StatefulSet spec re-rendering effects, pod deletion sequence with health checks, and timing considerations. It provides scenario-specific guidance based on cluster topology (broker count and replication factor) and recommends producer configuration adjustments for operational resilience during the transition.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Suggested reviewers

  • JakeSCahill
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description provides context, motivation, and test plan items, but is missing required template sections: JIRA ticket link, review deadline, and page preview URL. Add the missing template sections: Resolves link with JIRA ticket number, review deadline, and Netlify page preview URL for the modified documentation page.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: documenting rolling restart impact during Helm to Operator migration, which matches the PR's core objective.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Fix passive voice ("is flagged" → "the operator flags")
- Fix nested list continuation under ordered list (add +/-- block)
- Use active voice consistently in list items
- Use "Apache Kafka" on first reference per terminology standards
- Add serial commas where missing
- Ensure parallel structure in list items

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
modules/migrate/pages/kubernetes/helm-to-operator.adoc (2)

109-117: Consider mentioning write caching as a tuning alternative.

The current recommendations prioritize maximum durability with acks=all. For users who prioritize lower latency during migration and can tolerate some risk, write caching (documented in modules/develop/pages/config-topics.adoc#configure-write-caching) offers a relaxed mode of acks=all that acknowledges on majority without waiting for fsync.

This is optional—the current conservative recommendation is appropriate for most migration scenarios. However, mentioning write caching would give advanced users awareness of tuning options for their specific durability/performance trade-offs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modules/migrate/pages/kubernetes/helm-to-operator.adoc` around lines 109 -
117, Add a brief note after the "Recommended producer settings" block mentioning
write caching as an optional, lower-latency tuning alternative for advanced
users: reference the existing settings `acks=all`, `retries`, and
`enable.idempotence=true`, explain that write caching relaxes fsync waits while
still acknowledging to a majority (link to the existing docs at
modules/develop/pages/config-topics.adoc#configure-write-caching), and state it
is only recommended when users can tolerate reduced durability during migration.

93-108: Well-structured impact scenarios.

The table clearly differentiates impact based on broker count and replication factor. The explanation of Raft quorum semantics (2 of 3 brokers for RF=3) correctly justifies why writes continue during individual broker restarts.

Consider using the glossterm macro for domain-specific terms like "replication factor", "quorum", and "Raft consensus" for consistency with the repository's glossary usage. Based on learnings, the syntax would be glossterm:replication-factor[] or glossterm:raft-consensus[].

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modules/migrate/pages/kubernetes/helm-to-operator.adoc` around lines 93 -
108, Update the document to wrap domain-specific terms with the repository
glossary macro: replace occurrences of "replication factor", "quorum", and "Raft
consensus" in the "Impact by cluster configuration" section (the table content
in helm-to-operator.adoc) with the glossterm macro forms (e.g.,
glossterm:replication-factor[], glossterm:quorum[], glossterm:raft-consensus[])
to ensure consistent glossary linking and styling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modules/migrate/pages/kubernetes/helm-to-operator.adoc`:
- Around line 109-117: Replace the duplicated producer configuration block (the
recommendations for `acks=all`, `retries`, and `enable.idempotence=true` and the
NOTE about Raft) with a single xref to the authoritative producer guide
`modules/develop/pages/produce-data/configure-producers.adoc` using
empty-bracket xref syntax
`xref:modules/develop/pages/produce-data/configure-producers.adoc[]`; in
addition, keep a short migration-specific callout that explicitly highlights two
critical bits of context from the guide: that `enable.idempotence=true` must be
used together with `acks=all` to guarantee idempotent behavior, and that high
`retries` can interact with `max.in.flight.requests.per.connection` to affect
ordering (point readers to the corresponding sections in the referenced doc for
full details).

---

Nitpick comments:
In `@modules/migrate/pages/kubernetes/helm-to-operator.adoc`:
- Around line 109-117: Add a brief note after the "Recommended producer
settings" block mentioning write caching as an optional, lower-latency tuning
alternative for advanced users: reference the existing settings `acks=all`,
`retries`, and `enable.idempotence=true`, explain that write caching relaxes
fsync waits while still acknowledging to a majority (link to the existing docs
at modules/develop/pages/config-topics.adoc#configure-write-caching), and state
it is only recommended when users can tolerate reduced durability during
migration.
- Around line 93-108: Update the document to wrap domain-specific terms with the
repository glossary macro: replace occurrences of "replication factor",
"quorum", and "Raft consensus" in the "Impact by cluster configuration" section
(the table content in helm-to-operator.adoc) with the glossterm macro forms
(e.g., glossterm:replication-factor[], glossterm:quorum[],
glossterm:raft-consensus[]) to ensure consistent glossary linking and styling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 448631bc-9409-4fdb-918b-d99cd8987722

📥 Commits

Reviewing files that changed from the base of the PR and between 4230c3f and 43ab2e4.

📒 Files selected for processing (1)
  • modules/migrate/pages/kubernetes/helm-to-operator.adoc

Comment on lines +109 to +117
=== Recommended producer settings

To avoid message loss during the rolling restart, configure producers with:

- `acks=all` (or `-1`): Waits for the write to be committed by the Raft quorum (majority of replicas).
- `retries`: Set to a high value to handle `NOT_LEADER_FOR_PARTITION` errors during leader elections.
- `enable.idempotence=true`: Prevents duplicate messages from retries.

NOTE: Redpanda uses Raft-based replication, not Kafka's in-sync replica (ISR) mechanism. With Raft, writes require a majority quorum (2 of 3 for RF=3). There is no `min.insync.replicas` setting to configure. The replication factor alone determines fault tolerance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Replace inline producer guidance with xref to comprehensive guide.

This section duplicates content from the existing comprehensive producer configuration guide at modules/develop/pages/produce-data/configure-producers.adoc. Inline duplication creates maintenance burden and risks inconsistency.

Additionally, the recommendations are missing critical context:

  • Line 115 suggests enable.idempotence=true but doesn't mention that true idempotent behavior requires pairing with acks=all. The comprehensive guide (configure-producers.adoc:138-143) explicitly states: "To guarantee true idempotent behavior, you must also set acks=all."
  • Line 114 suggests high retries without mentioning ordering risks related to max.in.flight.requests.per.connection, which is documented in configure-producers.adoc:91-116.
♻️ Proposed refactor to reference comprehensive producer guide

Replace lines 109-117 with a focused reference to the authoritative guide:

 === Recommended producer settings
 
-To avoid message loss during the rolling restart, configure producers with:
-
-- `acks=all` (or `-1`): Waits for the write to be committed by the Raft quorum (majority of replicas).
-- `retries`: Set to a high value to handle `NOT_LEADER_FOR_PARTITION` errors during leader elections.
-- `enable.idempotence=true`: Prevents duplicate messages from retries.
-
-NOTE: Redpanda uses Raft-based replication, not Kafka's in-sync replica (ISR) mechanism. With Raft, writes require a majority quorum (2 of 3 for RF=3). There is no `min.insync.replicas` setting to configure. The replication factor alone determines fault tolerance.
+To avoid message loss during the rolling restart, configure producers for durability and resilience. See xref:develop:produce-data/configure-producers.adoc[] for comprehensive guidance. Key settings include:
+
+- xref:develop:produce-data/configure-producers.adoc#acksall[`acks=all`] to ensure writes are committed by the Raft quorum (majority of replicas)
+- xref:develop:produce-data/configure-producers.adoc#retries[`retries`] set to a high value to handle transient errors during leader elections
+- xref:develop:produce-data/configure-producers.adoc#enableidempotence[`enable.idempotence=true`] paired with `acks=all` to prevent duplicate messages
+
+NOTE: Redpanda uses Raft-based replication, not Kafka's in-sync replica (ISR) mechanism. With Raft, writes require a majority quorum (2 of 3 for RF=3). The replication factor alone determines fault tolerance.

This approach maintains migration-specific context while linking to the authoritative source for configuration details. Based on learnings, xref links should use empty brackets [] to automatically pull the target document's title when publishing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modules/migrate/pages/kubernetes/helm-to-operator.adoc` around lines 109 -
117, Replace the duplicated producer configuration block (the recommendations
for `acks=all`, `retries`, and `enable.idempotence=true` and the NOTE about
Raft) with a single xref to the authoritative producer guide
`modules/develop/pages/produce-data/configure-producers.adoc` using
empty-bracket xref syntax
`xref:modules/develop/pages/produce-data/configure-producers.adoc[]`; in
addition, keep a short migration-specific callout that explicitly highlights two
critical bits of context from the guide: that `enable.idempotence=true` must be
used together with `acks=all` to guarantee idempotent behavior, and that high
`retries` can interact with `max.in.flight.requests.per.connection` to affect
ordering (point readers to the corresponding sections in the referenced doc for
full details).

- Rename "Why pods restart" to "How the rolling restart works"
- Remove NOTE about Raft vs ISR comparison as it is not needed here

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@david-yu david-yu changed the title docs: Document rolling restart impact during Helm to Operator migration docs (k8s): Document rolling restart impact during Helm to Operator migration Mar 20, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant