feat(planner): re-apply config.toml as part of buildNodeUpdatePlan by bdchatham · Pull Request #374 · sei-protocol/sei-k8s-controller

bdchatham · 2026-05-30T17:39:27Z

Problem

buildNodeUpdatePlan (the image-drift plan) progression was:

ApplyStatefulSet → ApplyService → ReplacePod → ObserveImage → MarkReady

No TaskConfigApply. Result: any change to fields the controller owns via commonOverrides — most notably Spec.ExternalAddress from the publishable-P2P feature — never reached the running pod's config.toml. Inbound NLB→pod worked; outbound advertising (NodeInfo / PEX) didn't, because seid had external-address = ''.

Caught during the arctic-1 canary rollout (platform#764–771). With AWS LBC + harbor SEI_NLB_TARGET_TYPE=instance now in place, the NLB path is fully working, but seid still doesn't tell other peers how to reach it.

Fix

Add TaskConfigApply + TaskConfigValidate to the update progression, between ApplyService and ReplacePod:

ApplyStatefulSet → ApplyService → TaskConfigApply → TaskConfigValidate → ReplacePod → ObserveImage → MarkReady

Conceptual reframe: "deployment" stops being "swap image" and becomes "rebuild the running state from current spec." commonOverrides already produces the correct config from current Spec.ExternalAddress; we just weren't re-running TaskConfigApply outside init.

Two correctness fixes from pre-implementation expert review

Both Platform and K8s specialists independently flagged these before I touched a keyboard:

1. Real ConfigIntent (not nil). Old buildNodeUpdatePlan called paramsForTaskType(node, taskType, nil, nil). With configIntent=nil, the params builder returned an empty ConfigIntent{} — empty Mode — and sei-config's ResolveIntent rejects empty Mode. TaskConfigApply would have failed at runtime on the first image bump.

Fix: extracted NodePlanner.BuildConfigIntent(node) from each mode planner's BuildPlan. buildNodeUpdatePlan calls it via plannerForMode(node).BuildConfigIntent(node). Same source of truth for mode + overrides on init and day-2.

2. Incremental: true on the day-2 intent. The init path uses applyFull — regenerates config.toml from mode defaults + overrides. Running that on a Running node would wipe:

persistent-peers (set by the sidecar's writePeersToConfig)
State-sync trust point
Genesis-peers
Operator-managed TOML keys outside the controller's overrides set

applyIncremental reads on-disk config and patches only the overrides we own. Day-2 must use it.

Tests

TestBuildRunningPlan_ImageDrift_ReturnsNodeUpdatePlan — updated to expect the new 7-task progression.
TestBuildRunningPlan_ImageDriftWinsOverSidecar — same.
TestBuildNodeUpdatePlan_ConfigApplyHasIntentWithMode — asserts Mode is set + Overrides[KeyP2PExternalAddress] propagates from Spec.ExternalAddress.
TestBuildNodeUpdatePlan_ConfigApplyIsIncremental — pins Incremental: true.
TestBuildNodeUpdatePlan_ConfigApplyIntentMatchesModePlanner — table-driven coverage across full/archive/validator modes.

Existing unit + envtest suites: green.

Progression ordering — why between ApplyService and ReplacePod

ConfigApply must run before ReplacePod: the OLD pod's sidecar writes config.toml to the PVC; the new pod mounts the PVC and reads the fresh file on seid start.
ConfigApply after ReplacePod would race against new-pod sidecar liveness and waste a restart cycle.
external-address has no hot-reload path in seid today (seictl/sidecar/tasks/config_reload.go literally has a TODO: signal seid to re-read config), so the pod restart is the trigger.
Matches the init baseProgression order.

classifyPlan unchanged

classifyPlan (planner.go:217) identifies "node-update" plans by TaskTypeObserveImage presence — still in the progression, classification still works.

Out of scope

Autonomous Spec.ExternalAddress drift detection on Running nodes. With this change, ExternalAddress propagates on the next trigger of buildNodeUpdatePlan (image bump or other image drift). Operators currently trigger via an image SHA pin. If autonomous propagation becomes operationally necessary (NLB hostname rotation, Spec.Overrides drift, publishable-P2P opt-out), file a follow-up to add drift detection in buildRunningPlan that routes to the same buildNodeUpdatePlan.

Test plan

Unit + envtest green on CI.
After merge + ECR build + harbor controller image bump: pinning arctic-1 syncer image to the same digest (no-op semantically) triggers buildNodeUpdatePlan with the new progression. config.toml on the new pod has external-address = "syncer-0-0-p2p.arctic-1.harbor.platform.sei.io:26656". seid logs show PEX advertising the NLB hostname.

🤖 Generated with Claude Code

The image-drift plan previously ran ApplyStatefulSet → ApplyService → ReplacePod → ObserveImage → MarkReady. config.toml was never re-applied, so any change to fields the controller owns via commonOverrides — most notably Spec.ExternalAddress — never reached the running pod. Inbound publishable P2P could work (NLB to pod), but seid never advertised the NLB hostname via PEX/NodeInfo because external-address stayed empty. Add TaskConfigApply + TaskConfigValidate to the progression, between ApplyService and ReplacePod. ConfigApply runs against the OLD pod's sidecar (writes to the PVC); ReplacePod then cycles the pod and the new seid reads the fresh config.toml. external-address has no hot-reload path in seid today (seictl config_reload.go has a TODO for SIGHUP). Two correctness fixes folded in per pre-implementation expert review: 1. buildNodeUpdatePlan now constructs a real ConfigIntent via the mode planner. Previously paramsForTaskType received nil and returned an empty-Mode ConfigIntent — sei-config's ResolveIntent rejects empty Mode, so TaskConfigApply would have failed at runtime on first image bump. New NodePlanner.BuildConfigIntent method extracted from each mode's BuildPlan; same code path on init and day-2, shared source of truth for mode + overrides. 2. Day-2 intent sets Incremental=true. The init applyFull regenerates config.toml from mode defaults + overrides; running that on a Running node would wipe persistent-peers, state-sync trust point, and any operator-managed TOML keys outside the controller's overrides set. applyIncremental reads on-disk config and patches only the keys we own. Tests: - TestBuildRunningPlan_ImageDrift_ReturnsNodeUpdatePlan updated to expect the new 7-task progression (was 5). - TestBuildRunningPlan_ImageDriftWinsOverSidecar updated similarly. - TestBuildNodeUpdatePlan_ConfigApplyHasIntentWithMode — asserts Mode is set and Overrides carries KeyP2PExternalAddress from Spec.ExternalAddress. - TestBuildNodeUpdatePlan_ConfigApplyIsIncremental — pins Incremental=true. - TestBuildNodeUpdatePlan_ConfigApplyIntentMatchesModePlanner — table- driven coverage across full/archive/validator modes. Out of scope (separate follow-up, both experts flagged): - Autonomous ExternalAddress drift detection on Running nodes. With this change ExternalAddress propagates on next image bump (or any other trigger of buildNodeUpdatePlan). Operators currently trigger via an image SHA pin. If autonomous propagation becomes operationally necessary (e.g., NLB hostname rotation, Spec.Overrides drift, opt-out of publishable P2P), file a follow-up to add drift detection in buildRunningPlan that routes to the same buildNodeUpdatePlan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor · 2026-05-30T17:39:33Z

PR Summary

Medium Risk
Changes production rollout ordering and mutates on-disk TOML before pod restart; mis-patches could affect P2P advertising, though scope is limited to controller-owned keys versus full config re-apply.

Overview
Running-node image-drift plans now surgically update config.toml before pod replacement instead of only rolling the workload.

Day-2 planning moves from a shared buildNodeUpdatePlan into per-mode buildRunningPlan helpers (full, archive, replayer, validator) that call assembleDay2Plan. On drift, the progression becomes Apply STS/Service → config-patch → config-validate → replace pod → observe image → mark ready. The patch payload is built by externalAddressPatch, stamping p2p.external-address from Spec.ExternalAddress (including empty string for opt-out). Validators prepend signing/node/operator key validation tasks before any StatefulSet change.

A new task.ConfigPatchTask wires config-patch through the task registry to the sidecar’s generic TOML merge (no full sei-config apply on day-2). Tests cover the 7-task progression, external-address in patch params, archive/replayer parity, validator gates, and sidecar re-approval precedence.

^{Reviewed by Cursor Bugbot for commit b656d5e. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 1d4dc7d. Configure here.}

Cross-review feedback from kubernetes-specialist: 1. Defensive nil-guard on each mode planner's BuildConfigIntent. Today plannerForMode() gates which planner gets called, so the nil-deref isn't reachable. But the method is now part of the NodePlanner interface contract — making it independently safe means future refactors that call it through other paths don't accidentally NPE. 2. Pin init plans to Incremental=false via a new test. The day-2 path sets Incremental=true intentionally; the init path must stay on applyFull (regenerate from mode defaults). Without the assertion, a future refactor could silently flip init and we'd only notice when a fresh node came up with stale state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bdchatham · 2026-05-31T18:18:56Z

 	setNodeUpdateCondition(node, metav1.ConditionTrue, "UpdateStarted",
 		fmt.Sprintf("image drift detected: spec=%s current=%s", node.Spec.Image, node.Status.CurrentImage))

+	mode, err := plannerForMode(node)


can we refactor mode -> nodePlanner? Mode is very misleading but I understand we want to avoid shadowing.

bdchatham · 2026-05-31T18:19:24Z

+	if err != nil {
+		return nil, fmt.Errorf("resolving mode planner: %w", err)
+	}
+	intent, err := mode.BuildConfigIntent(node)


Having a separate method for this is a smell. We should be able to determine this in planners

bdchatham · 2026-05-31T18:31:53Z

+func (p *fullNodePlanner) BuildConfigIntent(node *seiv1alpha1.SeiNode) (*seiconfig.ConfigIntent, error) {
+	if node.Spec.FullNode == nil {
+		return nil, fmt.Errorf("fullNode sub-spec is nil")
+	}
+	return &seiconfig.ConfigIntent{
+		Mode:      seiconfig.ModeFull,
+		Overrides: mergeOverrides(mergeOverrides(commonOverrides(node), p.controllerOverrides(node)), node.Spec.Overrides),
+	}, nil
+}
+


We shouldn't need a new method to determine config intent. In my ideal scenario, we determine this in the planner itself. It's okay if we do this in multiple places, repeating code. Doing it this way makes config intent a special case, which it's really just another task/sequence of tasks

Address PR feedback: drop the BuildConfigIntent interface method (the abstraction made config-intent a special case when it's just task params), delete the free buildNodeUpdatePlan / buildRunningPlan helpers, and have each mode planner own its full BuildPlan including the day-2 dispatch. Reconciler/executor still just call BuildPlan; the startup-vs-day-2 distinction stays inside the planner. NodePlanner interface unchanged from main (Validate, BuildPlan, Mode). Switch the day-2 task from config-apply (Incremental=true) to config-patch: - config-patch already exists in seictl as a registered sidecar task (sidecar/engine/types.go:15, handler at sidecar/tasks/config.go). Generic TOML merge-patch — read file, merge, write — with no sei-config involvement. - The controller's job becomes "stamp these specific fields the controller directly owns" rather than "regenerate the config via sei-config's typed mode-defaulted resolver." Aligns with the forward-looking design where seid owns config-generation and migrations; controller just patches. - Surgical: only touches the keys we explicitly name. No forced logging.level=info reset on every image bump. No mode-defaulted backfill clobbering operator edits. - For publishable-P2P the day-2 patch is literally {"config.toml": {"p2p": {"external-address": Spec.ExternalAddress}}}. TaskConfigValidate stays after the patch as the parser-level gate (losing sei-config Diagnostics is an explicit, documented trade-off; operator-debt is "have seid grow doctor/migrate" not "keep coupling the controller to sei-config's typed registry"). Validator's day-2 progression prepends ValidateSigningKey / ValidateNodeKey / ValidateOperatorKeyring before ApplyStatefulSet — mirrors the validator's init plan and addresses the Cursor Bugbot high-severity finding (image-drift rollouts must not recreate the pod without verifying mounted key secrets first; kubelet mount failures with no controller-side error context). Notes documented in the planner doc: - pelletier/go-toml/v2 does not preserve comments / key ordering on re-encode. First day-2 patch erases operator-added comments. - "Init writes the whole file, day-2 patches named keys, never mix in one plan" — sharp boundary on Phase == Running. Tests rewritten: - Per-mode "day-2 progression matches expected slice" tests (full/archive share the 7-task list; validator gets 10 with the 3 validation-gate prefix). - "Config-patch payload carries external-address from Spec.ExternalAddress." - Determinism: two builds of the same spec produce identical patch payloads. - All existing buildRunningPlan free-function callsites updated to go through (&fullNodePlanner{}).BuildPlan(node). Drops: - BuildConfigIntent interface method + 4 implementations. - Free helpers buildRunningPlan, buildNodeUpdatePlan in planner.go. - TestBuildNodeUpdatePlan_ConfigApplyIsIncremental and friends (no longer applicable — day-2 doesn't go through ConfigIntent). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor Bot reviewed May 30, 2026

View reviewed changes

Comment thread internal/planner/planner.go Outdated

bdchatham commented May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(planner): re-apply config.toml as part of buildNodeUpdatePlan#374

feat(planner): re-apply config.toml as part of buildNodeUpdatePlan#374
bdchatham wants to merge 3 commits into
mainfrom
feat/update-plan-includes-config-apply

bdchatham commented May 30, 2026

Uh oh!

cursor Bot commented May 30, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

bdchatham May 31, 2026

Uh oh!

bdchatham May 31, 2026

Uh oh!

bdchatham May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdchatham commented May 30, 2026

Problem

Fix

Two correctness fixes from pre-implementation expert review

Tests

Progression ordering — why between ApplyService and ReplacePod

classifyPlan unchanged

Out of scope

Test plan

Uh oh!

cursor Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bdchatham May 31, 2026

Choose a reason for hiding this comment

Uh oh!

bdchatham May 31, 2026

Choose a reason for hiding this comment

Uh oh!

bdchatham May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented May 30, 2026 •

edited

Loading