feat(controller/node): resolve label peers to NLB addresses + controller-side node_id (#368)#369
Conversation
… node_id controller-side Move node_id resolution for the LabelPeerSource flow from the sidecar's :26657/status query to the controller's existing per-peer sidecar gRPC. Status.ResolvedPeers wire format shifts to fully-composed <node_id>@<host>:<port> strings. Why: the sidecar's :26657/status path is in-cluster-only (NLBs only forward 26656). After PR #365, a peer's Spec.ExternalAddress is the publishable vanity hostname. Using that address in ResolvedPeers means the sidecar can no longer query :26657 to learn node_id externally. The controller already has clean in-cluster access to each peer's sidecar gRPC (port 7777) — same path the genesis CollectAndSetPeers uses to learn node_id from node_key.json. Changes: - internal/controller/node/peers.go: extend resolveLabelPeers to call the peer's sidecar via r.Planner.BuildSidecarClient + GetNodeID, and compose <node_id>@<host>:<port>. host is Spec.ExternalAddress when set, else headless Service DNS at p2p port. On per-peer failure the whole resolve errors and controller-runtime retries — matches the genesis path semantics. - internal/planner/planner.go: Label sources route to PeerSourceStatic (same source path the genesis-Static flow already uses) instead of PeerSourceDNSEndpoints. The sidecar's DNSEndpointsSource is unchanged and still serves EC2Tags peers. - internal/task/task.go: widen task.SidecarClient interface with GetNodeID(ctx) (string, error). The full *sidecar.SidecarClient already implements this; mocks in tests get one-line stubs. - api/v1alpha1/seinode_types.go: ResolvedPeers godoc updated for the new composed wire format. Tests: - New TestReconcilePeers_PrefersExternalAddress covers the Spec.ExternalAddress branch. - Existing peers tests updated for the new composed format. - Envtest StubSidecarClient gains a GetNodeID stub returning "stub-node-id". Suite still passes (~64s). Wire-format change to Status.ResolvedPeers is the only one-way door. The field was previously bare hosts; consumers are the planner only, which is updated in this same PR. Refs #368 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR SummaryMedium Risk Overview The discover-peers planner path for label sources now passes those strings as static addresses (replacing DNS-endpoint resolution in the sidecar). API/CRD docs and tests were updated for the new wire format and failure modes. Reviewed by Cursor Bugbot for commit 3b71ffc. Bugbot is set up for automated code reviews on this repo. Configure here. |
…ecar failure Per platform-engineer's HOLD on PR #369: atomic resolve semantics wedge fleet-wide reconciles when any one label-matched peer's sidecar is transient-unavailable. In pacific-1 (6+ SNDs with mutual label selectors), a single archive sidecar restart would block every consumer's reconcile loop — not just peer updates, all of it, because reconcilePeers runs before the status patch. Fix: - On per-peer BuildSidecarClient or GetNodeID error, preserve that peer's prior entry from Status.ResolvedPeers if available. Indexed by host:port to look up the existing <node_id>@host:port. - If no prior entry exists (peer never resolved), skip with a structured log line, do not fail the reconcile. - The genesis CollectAndSetPeers path keeps its atomic semantics — that's a one-shot bounded-cohort operation where retry is correct. - New tests: TestReconcilePeers_PreservesPriorEntryOnTransientFailure, TestReconcilePeers_SkipsNewPeerOnSidecarFailure. Addresses the platform-engineer HOLD reason on PR #369. K8s-specialist + sei-network-specialist already GO'd the atomic version; the "silent partial list" concern they raised is mitigated by the structured log line on every skip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Three-way cross-review verdict:
Reasoning from first principles per coral discipline: atomic is right for the genesis path (bounded cohort, one-shot, co-bootstrapping) but wrong for the steady-state label flow (every reconcile forever, arbitrary peer lifecycle). Platform-engineer's blocker is the load-bearing one. Fix pushed in
Tests + envtest still green. |
5-9 lines → 2-4 across resolveLabelPeers, indexResolvedPeersByHost, peerAddress, the planner-side Label→Static branch, the SidecarClient interface, Status.ResolvedPeers godoc, and the envtest stub. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 10a8cbf. Configure here.
…d contract
NodeResolver.BuildSidecarClient is documented as nilable ("Nil factory
skips the sidecar probe; used by tests") and ResolvePlan nil-guards it.
resolveLabelPeers was calling it unconditionally — a panic waiting on a
test or future refactor that leaves the factory unset. Treat nil as a
transient per-peer failure (matches existing fallback semantics: preserve
prior entry or skip with a log line). Caught by Cursor Bugbot review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
K8s-specialist cross-review on the nil-guard fix asked for the missing test variant: nil factory + prior entry present must route to preserve-prior, not skip. Locks errNoSidecarFactory into the same recovery path as runtime transient sidecar errors. Also: drop the var ( ) block at the call site for a single var + sentinel init — closer to the boring-clear-code style used elsewhere in the file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
errNoSidecarFactory doc 3 lines → 2; new test docstrings shortened to match the file's existing one-line convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier comment sweep on api/v1alpha1/seinode_types.go didn't carry into the generated CRD YAMLs; verify-generated caught the drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Summary
First slice of the peer-discovery harmonization (#368). Moves node_id resolution for the
LabelPeerSourceflow from the sidecar's:26657/statusquery to the controller's existing per-peer sidecar gRPC.Status.ResolvedPeersshifts to fully-composed<node_id>@<host>:<port>strings.Why
The sidecar's
:26657/statuslookup path is in-cluster-only — NLBs only forward 26656, so it can't reach a publishable peer over the public endpoint. After PR #365, a peer'sSpec.ExternalAddressis the publishable vanity hostname; if the controller writes that intoStatus.ResolvedPeersand the sidecar tries:26657/status, the lookup fails permanently.The controller already has in-cluster access to each peer's sidecar gRPC (port 7777) — the same path the genesis
CollectAndSetPeerstask uses to readnode_idfromnode_key.json. Reusing that path for the Label flow keeps node_id resolution authoritative (it's from the keyfile, not from:26657) and decouples discovery from publishable-vs-private addressing.Behavior
Spec.ExternalAddress<vanity>:26656)<node_id>@<vanity>:26656<node_id>@<peer>-0.<peer>.<ns>.svc.cluster.local:26656node_idis fetched via the existingSidecarClient.GetNodeID(in-cluster gRPC to port 7777). On per-peer failure the whole resolve errors and controller-runtime retries — same atomic semantic as the genesisCollectAndSetPeerstask. Seid only readspersistent_peersat boot, so transient peer-sidecar failures during reconcile don't disturb running pods.Files
internal/controller/node/peers.go—resolveLabelPeerscomposes the new wire format; newpeerAddresshelper picksSpec.ExternalAddressor falls back to headless DNS.internal/planner/planner.go— Label sources map toPeerSourceStaticinstead ofPeerSourceDNSEndpoints. The sidecar'sDNSEndpointsSourceis untouched — still servesEC2Tagspeers.internal/task/task.go— widenstask.SidecarClientinterface withGetNodeID(ctx) (string, error). The full*sidecar.SidecarClientalready implements it; mocks gain one-line stubs.api/v1alpha1/seinode_types.go—ResolvedPeersgodoc updated for the new composed wire format.Wire-format change (one-way door)
Status.ResolvedPeerswas previously a[]stringof bare hosts (<peer>-0.<peer>.<ns>.svc.cluster.local). It's now a[]stringof fully-composed<node_id>@<host>:<port>entries. The only consumer is the planner, which is updated in this same PR. No external consumer is affected.Backward compatibility
LabelPeerSourcekeep working — theirResolvedPeerswill repopulate with the new format on next reconcile, and the planner routes them through the sameStaticsidecar source path that genesis peers already use.Spec.Peersuser-facing shape is unchanged.EC2TagsandStaticPeerSources are unchanged.DNSEndpointsSource(the:26657/statuspath) stays inseictl— still servesEC2Tags. A follow-up could retire it onceEC2Tagsalso moves to controller-side; out of scope here.Test plan
TestReconcilePeers_PrefersExternalAddress. Existing peers tests updated for the new format.go test ./api/v1alpha1/... ./internal/...— green.make test-integration— envtest green (~64s).golangci-lint run --new-from-rev=main ./...— 0 issues.make manifests generate— clean.Refs
sei-protocol/platform#747(env wiring),sei-protocol/platform#748(SG opening)🤖 Generated with Claude Code