Skip to content

Commit d1eb79e

Browse files
authored
fix(helm): preserve STS serviceName + networkPolicy.egress back-compat (#4569)
* fix(helm): preserve STS serviceName + networkPolicy.egress back-compat Greptile flagged two real upgrade-breaking changes vs the prior chart: 1. statefulset-postgresql spec.serviceName flipped from <name>-postgresql to <name>-postgresql-headless. spec.serviceName is immutable, so any existing install would hit 'Forbidden: updates to statefulset spec ...' on helm upgrade. Revert to the original name (the headless Service in services.yaml is added alongside, not as a swap). 2. networkPolicy.egress changed from a list to a map ({extraRules, exceptCidrs}), silently dropping any custom egress list set by existing users. Restore the original list semantics for networkPolicy.egress and move cloud-metadata blocking to a sibling top-level field networkPolicy.egressExceptCidrs. Adds NOTES.txt upgrade-notes entry covering both + the ESO v1→v1beta1 default flip (functionally a no-op, but worth surfacing). * docs(helm): update README egress reference to new key name * fix(helm): revert copilot-postgresql STS serviceName too (same immutability issue) Audit caught that the main fix in d5c2e8e missed statefulset-copilot-postgres.yaml, which had the identical immutable-field rename from -copilot-postgresql to -copilot-postgresql-headless. Same upgrade-break vector for anyone running copilot.enabled=true on a prior chart version. Mirrors the fix and comment from the main postgresql STS. * improvement(helm): postgres startupProbe + otel-collector NetworkPolicy - add startupProbe defaults for both postgresql + copilot-postgresql STSs to shield liveness from slow first-boot (pgvector init, WAL replay) - render a dedicated NetworkPolicy for the otel-collector when telemetry.enabled=true (OTLP ingress from app/realtime/copilot, DNS + HTTPS egress for forwarding to external observability backends) - document why copilot + copilot-postgresql intentionally do NOT ship dedicated NetworkPolicies (Redis URL is unknowable at render time) - regression test pins the otel-collector NP at documentIndex 3 * test(helm): assert custom egress applied to realtime NP too The prior test claimed coverage of both app and realtime NPs but only asserted documentIndex 0. Split into two tests so a regression that drops custom egress from realtime would fail loudly. * docs(helm-skill): trim narrative bloat in values-model Cut the historical 'Layer 2 was added in chart 1.0.0' note and the generic 'single source of truth' framing. Kept the two actionable points: ESO requires mapping Layer 1 keys; app.env overrides envDefaults.
1 parent 05892f7 commit d1eb79e

8 files changed

Lines changed: 190 additions & 28 deletions

File tree

helm/sim/.claude/skills/sim-helm/references/values-model.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,9 @@ The Sim chart splits configuration across **four** layers. Understanding which l
4444

4545
## Why this layering exists
4646

47-
**Single source of truth per concern.** Secrets live in a Secret. Operational defaults live where users can override them. Chart-computed values live where the chart can authoritatively compute them.
47+
**ESO compatibility.** When `externalSecrets.enabled=true`, the chart-managed Secret is **not rendered** — ESO renders one instead. Anything in Layer 1 must be mapped via `remoteRefs.app.<KEY>` or it's silently missing. Layers 2–4 are unaffected by ESO.
4848

49-
**ESO compatibility.** When `externalSecrets.enabled=true`, the chart-managed Secret is **not rendered** — ESO renders one instead. Anything in Layer 1 must be mapped via `remoteRefs.app.<KEY>` or it's silently missing. Layers 2–4 are unaffected by ESO. Putting operational tunables in `envDefaults` instead of `env` means ESO users don't have to map dozens of tunables — just the real secrets.
50-
51-
**Backwards compatibility.** Layer 2 was added in chart 1.0.0 (formerly all defaults lived in `app.env`). The override-skip logic in the Deployment template means existing users who set values in `app.env` continue to work — those values win over `envDefaults`.
49+
**Override precedence.** Values set in `app.env` (Layer 1 overrides) win over `envDefaults` (Layer 2) — so users who already had operational tunables in `app.env` continue to work.
5250

5351
## Where keys live — the canonical list
5452

helm/sim/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ Before installing in production, confirm each of the following:
217217
* **Pinned images** — override `image.tag` (or `image.digest`) with an explicit version. Do not rely on the chart's default tag in production.
218218
* **Secrets management** — provide secrets via External Secrets Operator (ESO) or pre-created Kubernetes Secrets. Never commit secrets to `values.yaml`.
219219
* **TLS / Ingress** — set the `cert-manager.io/cluster-issuer` annotation on the ingress and tune `proxy-body-size` / `proxy-read-timeout` for your workload. See commented examples in `values.yaml`.
220-
* **Network policy egress** — review `networkPolicy.egress.exceptCidrs`. Defaults block cloud metadata endpoints (`169.254.169.254/32`, `169.254.170.2/32`); add your cluster's API server CIDR for stronger isolation.
220+
* **Network policy egress** — review `networkPolicy.egressExceptCidrs`. Defaults block cloud metadata endpoints (`169.254.169.254/32`, `169.254.170.2/32`); add your cluster's API server CIDR for stronger isolation. Custom egress rules go in `networkPolicy.egress` (a list).
221221
* **Namespace hardening** — label the install namespace with Pod Security Standards `restricted` enforcement (`pod-security.kubernetes.io/enforce=restricted`).
222222
* **Env validation** — keys under `app.env`, `realtime.env`, and `copilot.env` are passed through to the application and validated at startup. The JSON Schema intentionally does not enforce `additionalProperties: false` (would break custom user envs), so typos like `OPENA_API_KEY` (instead of `OPENAI_API_KEY`) surface as missing-key errors at runtime, not at `helm install` time. Review your env block carefully.
223223
* **Set public URLs**`app.env.NEXT_PUBLIC_APP_URL` and `app.env.BETTER_AUTH_URL` must match your public origin (e.g. `https://sim.example.com`). Leaving them as `localhost` breaks sign-in.

helm/sim/templates/NOTES.txt

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,16 @@ Your release is named {{ .Release.Name }} in namespace {{ .Release.Namespace }}.
8181
# Upgrade after changing values
8282
helm upgrade {{ .Release.Name }} ./helm/sim --namespace {{ .Release.Namespace }} -f your-values.yaml
8383

84-
5. Where to go next:
84+
5. Upgrade notes (read before upgrading from a chart version released before this one):
85+
86+
* externalSecrets.apiVersion default is "v1beta1" (was "v1"). v1beta1 is
87+
supported by every ESO release from v0.7+ through current. If you're on
88+
ESO v0.17+ and want the graduated v1 API, set externalSecrets.apiVersion: "v1".
89+
* networkPolicy.egress remains a list of custom egress rules (unchanged).
90+
Cloud-metadata CIDR blocking is now configured via networkPolicy.egressExceptCidrs
91+
(defaults to AWS/GCP/Azure IMDS + ECS task metadata).
92+
93+
6. Where to go next:
8594

8695
* Production checklist: helm/sim/README.md (search "Production checklist")
8796
* Troubleshooting: helm/sim/README.md (search "Troubleshooting")

helm/sim/templates/networkpolicy.yaml

Lines changed: 90 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -107,14 +107,14 @@ spec:
107107
- ipBlock:
108108
cidr: 0.0.0.0/0
109109
except:
110-
{{- range (default (list "169.254.169.254/32" "169.254.170.2/32") .Values.networkPolicy.egress.exceptCidrs) }}
110+
{{- range (default (list "169.254.169.254/32" "169.254.170.2/32") .Values.networkPolicy.egressExceptCidrs) }}
111111
- {{ . | quote }}
112112
{{- end }}
113113
ports:
114114
- protocol: TCP
115115
port: 443
116116
# Allow custom egress rules
117-
{{- with .Values.networkPolicy.egress.extraRules }}
117+
{{- with .Values.networkPolicy.egress }}
118118
{{- toYaml . | nindent 2 }}
119119
{{- end }}
120120

@@ -189,14 +189,14 @@ spec:
189189
- ipBlock:
190190
cidr: 0.0.0.0/0
191191
except:
192-
{{- range (default (list "169.254.169.254/32" "169.254.170.2/32") .Values.networkPolicy.egress.exceptCidrs) }}
192+
{{- range (default (list "169.254.169.254/32" "169.254.170.2/32") .Values.networkPolicy.egressExceptCidrs) }}
193193
- {{ . | quote }}
194194
{{- end }}
195195
ports:
196196
- protocol: TCP
197197
port: 443
198198
# Allow custom egress rules
199-
{{- with .Values.networkPolicy.egress.extraRules }}
199+
{{- with .Values.networkPolicy.egress }}
200200
{{- toYaml . | nindent 2 }}
201201
{{- end }}
202202
{{- end }}
@@ -296,11 +296,96 @@ spec:
296296
- ipBlock:
297297
cidr: 0.0.0.0/0
298298
except:
299-
{{- range (default (list "169.254.169.254/32" "169.254.170.2/32") .Values.networkPolicy.egress.exceptCidrs) }}
299+
{{- range (default (list "169.254.169.254/32" "169.254.170.2/32") .Values.networkPolicy.egressExceptCidrs) }}
300300
- {{ . | quote }}
301301
{{- end }}
302302
ports:
303303
- protocol: TCP
304304
port: 443
305305
{{- end }}
306+
307+
{{- if .Values.telemetry.enabled }}
308+
---
309+
# Network Policy for OpenTelemetry Collector
310+
apiVersion: networking.k8s.io/v1
311+
kind: NetworkPolicy
312+
metadata:
313+
name: {{ include "sim.fullname" . }}-otel-collector
314+
namespace: {{ .Release.Namespace }}
315+
labels:
316+
{{- include "sim.labels" . | nindent 4 }}
317+
app.kubernetes.io/component: telemetry
318+
spec:
319+
podSelector:
320+
matchLabels:
321+
{{- include "sim.selectorLabels" . | nindent 6 }}
322+
app.kubernetes.io/component: telemetry
323+
policyTypes:
324+
- Ingress
325+
- Egress
326+
ingress:
327+
# OTLP from app
328+
- from:
329+
- podSelector:
330+
matchLabels:
331+
{{- include "sim.app.selectorLabels" . | nindent 10 }}
332+
ports:
333+
- protocol: TCP
334+
port: 4317
335+
- protocol: TCP
336+
port: 4318
337+
# OTLP from realtime
338+
{{- if .Values.realtime.enabled }}
339+
- from:
340+
- podSelector:
341+
matchLabels:
342+
{{- include "sim.realtime.selectorLabels" . | nindent 10 }}
343+
ports:
344+
- protocol: TCP
345+
port: 4317
346+
- protocol: TCP
347+
port: 4318
348+
{{- end }}
349+
# OTLP from copilot
350+
{{- if .Values.copilot.enabled }}
351+
- from:
352+
- podSelector:
353+
matchLabels:
354+
{{- include "sim.selectorLabels" . | nindent 10 }}
355+
app.kubernetes.io/component: copilot
356+
ports:
357+
- protocol: TCP
358+
port: 4317
359+
- protocol: TCP
360+
port: 4318
361+
{{- end }}
362+
egress:
363+
# DNS
364+
- to: []
365+
ports:
366+
- protocol: UDP
367+
port: 53
368+
- protocol: TCP
369+
port: 53
370+
# HTTPS for forwarding to external observability backends (Datadog, Honeycomb, etc.)
371+
- to:
372+
- ipBlock:
373+
cidr: 0.0.0.0/0
374+
except:
375+
{{- range (default (list "169.254.169.254/32" "169.254.170.2/32") .Values.networkPolicy.egressExceptCidrs) }}
376+
- {{ . | quote }}
377+
{{- end }}
378+
ports:
379+
- protocol: TCP
380+
port: 443
381+
{{- end }}
382+
383+
{{- /*
384+
Copilot + copilot-postgresql intentionally do NOT ship dedicated NetworkPolicies.
385+
Copilot requires REDIS_URL (external Redis on a non-443 port), and the chart
386+
cannot know the user's Redis host/port at render time — a default egress rule
387+
would silently block Redis on most installs. Users running networkPolicy.enabled=true
388+
with copilot enabled should add their own NPs (or extend networkPolicy.egress
389+
with the appropriate egress rules).
390+
*/}}
306391
{{- end }}

helm/sim/templates/statefulset-copilot-postgres.yaml

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,11 @@ metadata:
6666
{{- include "sim.labels" . | nindent 4 }}
6767
app.kubernetes.io/component: copilot-postgresql
6868
spec:
69-
serviceName: {{ include "sim.fullname" . }}-copilot-postgresql-headless
69+
# Must remain {{ include "sim.fullname" . }}-copilot-postgresql (not the
70+
# -headless name) — spec.serviceName is immutable on a StatefulSet, and
71+
# the prior chart shipped with this value. Same rationale as the main
72+
# postgresql STS; see statefulset-postgresql.yaml for details.
73+
serviceName: {{ include "sim.fullname" . }}-copilot-postgresql
7074
replicas: 1
7175
podManagementPolicy: OrderedReady
7276
updateStrategy:
@@ -111,6 +115,10 @@ spec:
111115
envFrom:
112116
- secretRef:
113117
name: {{ include "sim.fullname" . }}-copilot-postgresql-secret
118+
{{- if .Values.copilot.postgresql.startupProbe }}
119+
startupProbe:
120+
{{- toYaml .Values.copilot.postgresql.startupProbe | nindent 12 }}
121+
{{- end }}
114122
{{- if .Values.copilot.postgresql.livenessProbe }}
115123
livenessProbe:
116124
{{- toYaml .Values.copilot.postgresql.livenessProbe | nindent 12 }}

helm/sim/templates/statefulset-postgresql.yaml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,12 @@ metadata:
9090
labels:
9191
{{- include "sim.postgresql.labels" . | nindent 4 }}
9292
spec:
93-
serviceName: {{ include "sim.fullname" . }}-postgresql-headless
93+
# Must remain {{ include "sim.fullname" . }}-postgresql (not the -headless
94+
# name) — spec.serviceName is immutable on a StatefulSet, and the prior
95+
# chart shipped with this value. Changing it would break `helm upgrade` for
96+
# every existing install with `Forbidden: updates to statefulset spec ...`.
97+
# The headless Service in services.yaml is added alongside, not as a swap.
98+
serviceName: {{ include "sim.fullname" . }}-postgresql
9499
replicas: 1
95100
minReadySeconds: 10
96101
podManagementPolicy: OrderedReady
@@ -135,6 +140,10 @@ spec:
135140
name: {{ include "sim.fullname" . }}-postgresql-env
136141
- secretRef:
137142
name: {{ include "sim.postgresqlSecretName" . }}
143+
{{- if .Values.postgresql.startupProbe }}
144+
startupProbe:
145+
{{- toYaml .Values.postgresql.startupProbe | nindent 12 }}
146+
{{- end }}
138147
{{- if .Values.postgresql.livenessProbe }}
139148
livenessProbe:
140149
{{- toYaml .Values.postgresql.livenessProbe | nindent 12 }}

helm/sim/tests/networkpolicy_test.yaml

Lines changed: 33 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -128,10 +128,23 @@ tests:
128128
- protocol: TCP
129129
port: 3000
130130

131-
- it: egress.extraRules are appended to both app and realtime NetworkPolicies
131+
- it: telemetry collector NetworkPolicy renders when telemetry.enabled=true
132132
set:
133133
<<: *defaults
134-
networkPolicy.egress.extraRules:
134+
telemetry.enabled: true
135+
documentIndex: 3
136+
asserts:
137+
- equal:
138+
path: kind
139+
value: NetworkPolicy
140+
- equal:
141+
path: metadata.name
142+
value: t-sim-otel-collector
143+
144+
- it: networkPolicy.egress (custom rules) are appended to the app NetworkPolicy
145+
set:
146+
<<: *defaults
147+
networkPolicy.egress:
135148
- to: []
136149
ports:
137150
- protocol: TCP
@@ -145,3 +158,21 @@ tests:
145158
ports:
146159
- protocol: TCP
147160
port: 5432
161+
162+
- it: networkPolicy.egress (custom rules) are appended to the realtime NetworkPolicy
163+
set:
164+
<<: *defaults
165+
networkPolicy.egress:
166+
- to: []
167+
ports:
168+
- protocol: TCP
169+
port: 5432
170+
documentIndex: 1
171+
asserts:
172+
- contains:
173+
path: spec.egress
174+
content:
175+
to: []
176+
ports:
177+
- protocol: TCP
178+
port: 5432

helm/sim/values.yaml

Lines changed: 35 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -622,12 +622,22 @@ postgresql:
622622
targetPort: 5432
623623

624624
# Health checks
625+
# startupProbe shields liveness from slow first-boot scenarios (pgvector
626+
# extension init, WAL replay after a crash on a large data dir). Gives
627+
# postgres up to 150s (30 * 5s) to become ready before liveness takes over.
628+
startupProbe:
629+
exec:
630+
command: ["pg_isready", "-U", "postgres", "-d", "sim"]
631+
periodSeconds: 5
632+
failureThreshold: 30
633+
timeoutSeconds: 5
634+
625635
livenessProbe:
626636
exec:
627637
command: ["pg_isready", "-U", "postgres", "-d", "sim"]
628638
initialDelaySeconds: 10
629639
periodSeconds: 5
630-
640+
631641
readinessProbe:
632642
exec:
633643
command: ["pg_isready", "-U", "postgres", "-d", "sim"]
@@ -954,7 +964,7 @@ monitoring:
954964
# to each other and to required external services (DNS, HTTPS) while blocking
955965
# everything else. The egress block additionally blacklists cloud metadata
956966
# endpoints (169.254.169.254/32, 169.254.170.2/32) by default — extend
957-
# egress.exceptCidrs with your cluster's API server CIDR for tighter isolation.
967+
# egressExceptCidrs with your cluster's API server CIDR for tighter isolation.
958968
# Your CNI must support NetworkPolicy (Calico, Cilium, GKE Dataplane V2, etc.).
959969
networkPolicy:
960970
enabled: false
@@ -973,16 +983,18 @@ networkPolicy:
973983
# Custom ingress rules appended to the policy
974984
ingress: []
975985

976-
# Egress configuration
977-
egress:
978-
# CIDRs excluded from broad HTTPS (443) egress.
979-
# Defaults block AWS/GCP/Azure IMDS (169.254.169.254/32) and ECS task metadata
980-
# (169.254.170.2/32). Add your cluster's API server CIDR for stronger isolation.
981-
exceptCidrs:
982-
- "169.254.169.254/32"
983-
- "169.254.170.2/32"
984-
# Custom egress rules appended to the policy
985-
extraRules: []
986+
# Custom egress rules appended to the policy.
987+
# Kept as a top-level list (not a map) for backward compatibility with the
988+
# pre-1.0 chart that shipped `networkPolicy.egress: []`. Existing values
989+
# files continue to work without changes.
990+
egress: []
991+
992+
# CIDRs excluded from broad HTTPS (443) egress.
993+
# Defaults block AWS/GCP/Azure IMDS (169.254.169.254/32) and ECS task metadata
994+
# (169.254.170.2/32). Add your cluster's API server CIDR for stronger isolation.
995+
egressExceptCidrs:
996+
- "169.254.169.254/32"
997+
- "169.254.170.2/32"
986998

987999
# Shared storage for enterprise workflows requiring data sharing between pods
9881000
sharedStorage:
@@ -1438,14 +1450,24 @@ copilot:
14381450
targetPort: 5432
14391451

14401452
# Health checks
1453+
# startupProbe shields liveness from slow first-boot scenarios (pgvector
1454+
# extension init, WAL replay after a crash). Gives postgres up to 150s
1455+
# (30 * 5s) to become ready before liveness takes over.
1456+
startupProbe:
1457+
exec:
1458+
command: ["pg_isready", "-U", "copilot", "-d", "copilot"]
1459+
periodSeconds: 5
1460+
failureThreshold: 30
1461+
timeoutSeconds: 5
1462+
14411463
livenessProbe:
14421464
exec:
14431465
command: ["pg_isready", "-U", "copilot", "-d", "copilot"]
14441466
initialDelaySeconds: 10
14451467
periodSeconds: 5
14461468
timeoutSeconds: 5
14471469
failureThreshold: 10
1448-
1470+
14491471
readinessProbe:
14501472
exec:
14511473
command: ["pg_isready", "-U", "copilot", "-d", "copilot"]

0 commit comments

Comments
 (0)