Commit 9d2dd8f
authored
improvement(helm): helm chart updates with security, ESO, and docs overhaul (#4565)
* improvement(helm): production-ready chart with security, ESO, and docs overhaul
Comprehensive Helm chart improvements bringing the chart up to industry
standards for security, secret management, and documentation.
Security
- Pod Security Standards "restricted" defaults on every pod and container
(runAsNonRoot, allowPrivilegeEscalation=false, capabilities.drop=[ALL],
seccompProfile=RuntimeDefault)
- automountServiceAccountToken=false on ServiceAccount and every pod
- NetworkPolicy egress blocks cloud metadata endpoints by default
- Sensitive app/realtime env keys auto-partitioned into chart-managed Secret
via envFrom; no more plaintext secrets on container specs
Secret management
- Three modes: inline, existingSecret, ExternalSecrets Operator (ESO)
- ESO sync supports arbitrary sensitive keys
- Fail-fast template rendering when ESO enabled but sensitive key unmapped
- AWS/Azure/GCP example files document all three modes
Reliability
- Headless Services for both Postgres StatefulSets
- HPA-aware replicas (omits spec.replicas when autoscaling.enabled)
- PodDisruptionBudget auto-activates when replicaCount > 1
- Startup / liveness / readiness probes with distinct timings
- CronJob ttlSecondsAfterFinished for automatic cleanup
Chart hygiene
- Image tags default to Chart.AppVersion; pullPolicy IfNotPresent
- Optional image.digest pin for content-addressed deploys
- kubeVersion >=1.25.0-0 enforced
- Ollama pinned to 0.23.2; mount moved to /data
Documentation
- README rewritten in cert-manager / Bitnami style
- NOTES.txt with post-install guidance
- Example values files annotated with usage and secret-strategy guidance
* fix(helm): correct resource names in README (sim-sim-* → sim-*)
The sim.fullname helper collapses to the release name when the release
name contains the chart name. With the documented release name 'sim',
actual resources are 'sim-app', 'sim-postgresql', etc. — not the
'sim-sim-*' form previously documented. Fixes copy-paste commands in the
pre-1.0.0 upgrade walkthrough and several troubleshooting snippets.
Also expands the cronjobs component description to reflect the full set
of 13 scheduled jobs (was understated as just Gmail/Outlook polling).
* improvement(helm): split app/realtime env into Secret-bound + inline defaults
- Add app.envDefaults / realtime.envDefaults for chart-shipped operational
tunables (rate limits, timeouts, IVM, feature-flag defaults, localhost URL
fallbacks). Rendered inline on the container, not into the Secret
- Remove operational defaults from app.env / realtime.env so the chart-managed
Secret stays minimal and External Secrets Operator users only map keys they
actually set, not every chart default
- Skip an envDefaults key when the user explicitly sets it in env (K8s `env`
overrides `envFrom`, so an inline default would otherwise mask a Secret
value at runtime)
- Relax values.schema.json to allow empty strings on NEXT_PUBLIC_APP_URL,
BETTER_AUTH_URL, NEXT_PUBLIC_SUPPORT_EMAIL (defaults supplied via envDefaults)
* fix(helm): address PR review — cronjob validation, ESO apiVersion, secret merge order, image guard
- CronJobs reference CRON_SECRET via secretKeyRef; fail-fast at template
time when cronjobs.enabled=true and app.env.CRON_SECRET is empty so users
get a clear error instead of a CreateContainerConfigError loop
- Default externalSecrets.apiVersion to "v1beta1" (supported by every ESO
release since v0.7). The previous "v1" default targets only ESO v0.17+
- Swap merge order in secrets-app.yaml so app.env wins over realtime.env
for shared keys (BETTER_AUTH_SECRET, BETTER_AUTH_URL, …) — both pods
consume the same Secret via envFrom, so the app value must be canonical
- Add `required` guard on sim.image so an empty tag + empty digest +
empty Chart.AppVersion surfaces as a clear template-time error instead
of rendering an invalid `repo:` reference
* fix(helm): require critical secrets to be mapped when ESO is enabled
Previously, enabling externalSecrets without mapping BETTER_AUTH_SECRET /
ENCRYPTION_KEY / INTERNAL_API_SECRET (and CRON_SECRET when cronjobs are
on) rendered cleanly but produced CrashLoopBackOff at runtime with
cryptic missing-env errors. Fail at template time instead.
* fix(helm): auto-enable PDB when HPA minReplicas > 1
Previously the auto-enable predicate only checked the static
app.replicaCount, which defaults to 1 even when autoscaling is on
(HPA owns spec.replicas). PDB now also activates when
autoscaling.enabled=true and minReplicas > 1.
* fix(helm): prevent realtime envDefaults from masking app.env Secret values; add StatefulSet upgrade NOTES
- Realtime override-skip now considers keys set in either app.env or
realtime.env. The shared app Secret is mounted via envFrom on the
realtime pod, so a key set in app.env (e.g. NEXT_PUBLIC_APP_URL) would
previously be masked by the realtime envDefault (inline env overrides
envFrom in K8s).
- NOTES.txt now prints a StatefulSet orphan-delete reminder on upgrade,
surfacing the immutable serviceName issue documented in the README.
* feat(helm): add Claude Skill for chart deployment
Adds a skill at helm/sim/.claude/skills/sim-helm/ that teaches agents how
to deploy and troubleshoot the Sim Helm chart: install path selection
(inline / existingSecret / ESO), secret generation, the values.yaml
four-layer mental model, common-failure troubleshooting, and the
pre-1.0.0 StatefulSet orphan-delete upgrade procedure.
Skill is loadable by Claude Code, Codex, and OpenCode via the standard
skills convention (directory name matches frontmatter name).
* docs(helm): add CRON_SECRET to TL;DR, dry-run, and example install headers
The validateSecrets guard requires CRON_SECRET when cronjobs.enabled=true
(the default), but the quickstart and example file install commands
omitted it — users following the docs hit a hard template-render failure.
Adds CRON_SECRET to README TL;DR, validate-the-install dry-run snippet,
and the install command headers in all example values files.
* fix(helm): require INTERNAL_API_SECRET in inline secret mode
The ESO coverage validator already required INTERNAL_API_SECRET, but the
inline validateSecrets path only checked BETTER_AUTH_SECRET, ENCRYPTION_KEY,
and CRON_SECRET — letting inline installs render successfully and then
crash at runtime when the realtime↔app shared auth secret was missing.
Adds the same fail-fast check to the inline path.
* docs(helm): surface INTERNAL_API_SECRET upgrade requirement in NOTES.txt
The new validateSecrets check makes app.env.INTERNAL_API_SECRET mandatory
on upgrade. Existing installs that never set it would hit a template
render failure with no in-context guidance. Adds an upgrade-only note
with the generation snippet and storage guidance alongside the existing
StatefulSet orphan-delete instructions.
* fix(helm): NetworkPolicy egress to OTEL collector + external-db example format
- Add app/realtime NetworkPolicy egress rules for the OpenTelemetry
collector pod on ports 4317 (OTLP gRPC) and 4318 (OTLP HTTP) when
telemetry.enabled=true. Without these, traces and metrics were silently
dropped with connection-refused errors when both telemetry and
networkPolicy were enabled.
- Migrate values-external-db.yaml from the legacy list-shaped egress
format to the new {exceptCidrs, extraRules} object. The list form would
replace the default object on merge and crash template rendering when
the chart tried to access .exceptCidrs on a list.
* fix(helm): NOTES.txt no longer prints false secret warning for ESO users
The secrets-empty warning only checked app.secrets.existingSecret.enabled
before scanning app.env. ESO users intentionally leave app.env empty —
secrets come from the ESO-synced Secret — so every ESO install/upgrade
printed a misleading 'pods will fail to start' warning.
Reorders the branches so externalSecrets.enabled takes precedence: ESO
users now see a confirmation message with kubectl commands to verify the
ExternalSecret has synced. The empty-app.env warning only fires when
both ESO and existingSecret are disabled.
* fix(helm): existingSecret mode no longer drops app.env / realtime.env values
In existingSecret mode the chart-managed Secret is not rendered, so non-empty
values in app.env / realtime.env had nowhere to land — yet the envDefaults
skip logic still suppressed the matching defaults. Result: keys like
NEXT_PUBLIC_APP_URL, BETTER_AUTH_URL, and NODE_ENV silently went missing
on both pods (the example values-existing-secret.yaml hit this directly).
Both app and realtime deployments now inline non-empty values from app.env
(plus realtime.env on the realtime container) when existingSecret is enabled
and ESO is not. Inline / ESO modes are unchanged: inline still flows through
the chart-managed Secret, ESO still owns the synced Secret.
* fix(helm): correct realtime env overlay + filter chart-computed keys in existingSecret mode
Realtime: Sprig merge gives the first source precedence and treats "" as a
real value, so realtime.env empty defaults for shared keys shadowed
non-empty app.env values. Replace with deepCopy($appEnv) base + manual
non-empty overlay of $rtEnv.
Both deployments: exclude DATABASE_URL/SOCKET_SERVER_URL/OLLAMA_URL from
the existingSecret inline path so user-supplied values can't override
chart-computed ones via last-wins env semantics.
* fix(helm): skip envDefaults in existingSecret mode + document egress rename
In existingSecret mode the user's pre-existing Secret is the source of
truth (loaded via envFrom). Inlining localhost envDefaults for URL keys
(BETTER_AUTH_URL, NEXT_PUBLIC_APP_URL, ALLOWED_ORIGINS) silently shadowed
the Secret-bound values because K8s env always wins over envFrom. Skip
envDefaults entirely on both deployments when existingSecret is enabled.
Also call out the networkPolicy.egress shape change (list -> map with
exceptCidrs + extraRules) in the NOTES.txt upgrade block so operators
migrate their custom rules rather than silently losing them.
* fix(helm): copy-pasteable install commands in copilot + ESO examples
values-copilot.yaml: the install header was missing every required
copilot.server.env.* secret (AGENT_API_DB_ENCRYPTION_KEY, INTERNAL_API_SECRET,
LICENSE_KEY, SIM_BASE_URL, SIM_AGENT_API_KEY, REDIS_URL, one model key) plus
copilot.postgresql.auth.password. Pasting it as-is failed at template render.
values-external-secrets.yaml: NEXT_PUBLIC_APP_URL, BETTER_AUTH_URL, etc. were
declared under app.env / realtime.env. In ESO mode the chart-managed Secret
isn't rendered, so the validator (rightly) rejects keys in app.env that
aren't mapped under externalSecrets.remoteRefs. Moved non-secret URL/config
to envDefaults, which is inlined and not subject to the ESO mapping rule.
* polish(helm): configurable NetworkPolicy ingress peers + clearer API_ENCRYPTION_KEY comment
- networkPolicy.ingressFrom lets operators scope the ingress-controller
rule to a specific namespace/podSelector. Defaults to a single empty
peer (`- {}`), which is the explicit form of "any source" — same
effective behavior as the old `from: []` but unambiguous across CNIs.
To restrict, override with e.g.:
networkPolicy:
ingressFrom:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
- API_ENCRYPTION_KEY comment: drop the "must be exactly 64 hex
characters" phrasing that sat awkwardly next to `openssl rand -hex 32`.
The generation command already produces the required length.
* test(helm): add helm-unittest suites + CI workflow + ci values matrix
- 7 helm-unittest suites covering smoke, validators, secret modes,
envDefaults secret-mode-aware inlining (round-9 regression net),
chart-computed env keys (round-8 regression net), NetworkPolicy
shape, and PDB/HPA conditional rendering (38 tests, ~265ms).
- ci/*.yaml render fixtures for default, production, existingSecret,
ESO, and external-db install modes.
- GitHub Actions workflow runs helm lint --strict, helm unittest,
helm template across the ci matrix, and kubeconform validation
against Kubernetes 1.30 schemas.
- CONTRIBUTING.md documents how to run the same gates locally.
* test(helm): add helm test hook + kind apiserver dry-run in CI
- New templates/tests/test-connection.yaml renders a Pod with
helm.sh/hook=test that wgets the app Service (and realtime when
enabled). Lets users run `helm test <release>` after install for
a real in-cluster connectivity check. Restricted PSS context.
- tests.* values block (image, timeoutSeconds, resources) is the
knob to disable or tune the probe; documented in values.schema.json.
- 3 helm-unittest tests cover the hook annotations, PSS context,
and tests.enabled=false skip path (41 tests total).
- New CI job spins up a kind v1.30 cluster and runs
`kubectl apply --dry-run=server` against the rendered manifests
for the CRD-free ci fixtures (default / existing-secret /
external-db). Catches admission and validation issues the static
kubeconform schema check can't see.
* chore(helm): remove pre-1.0.0 upgrade fluff + tighten .helmignore
This is the 1.0.0 release of the chart — there is no pre-1.0.0
predecessor for users to upgrade from, so all of the dedicated upgrade
narration was hypothetical.
- Drop the 'Upgrading from a pre-1.0.0 build' README section and the
matching troubleshooting entry.
- Drop the .Release.IsUpgrade block from NOTES.txt: items 5 (StatefulSet
orphan-delete), 6 (INTERNAL_API_SECRET 'new in 1.0.0'), 7
(networkPolicy.egress shape change). Each described a migration off a
chart version that never shipped.
- Delete references/upgrade-pre-1.0.0.md and remove the corresponding
pointers from SKILL.md.
- Anchor .helmignore patterns to chart root so /tests/ (unit suites)
and /examples/ are dropped from the packaged tarball without also
dropping templates/tests/ (the helm test hook).
* chore(helm): drop CI workflow + ci/ fixtures + CONTRIBUTING.md
The helm-unittest suites in helm/sim/tests/ and the helm test hook
in helm/sim/templates/tests/ stay — those are chart-internal quality
scaffolding, not CI. Removed:
- .github/workflows/helm-chart.yml
- helm/sim/ci/*.yaml (5 render fixtures used only by the workflow)
- helm/sim/CONTRIBUTING.md (mostly documented those gates)
- dead /ci/ and /CONTRIBUTING.md entries in .helmignore
* feat(helm): pod rollout on Secret change + topologySpreadConstraints
- Add checksum/secret pod annotations on app, realtime, and copilot
Deployments (plus checksum/config on app when branding ConfigMap is
enabled). Closes the long-standing footgun where 'helm upgrade' with
a changed Secret would silently leave pods running the old values
until a manual rollout restart.
- New top-level topologySpreadConstraints value (and sim.topologySpreadConstraints
helper) applied to app and realtime Deployments. Mirrors how affinity
and tolerations are plumbed; users supply their own labelSelector
to mirror Bitnami convention.
- 5 helm-unittest cases cover the checksum annotations and topology
spread rendering (46 tests total).
* fix(helm): drop empty-string shadowing in app/realtime env merge
Sprig 'merge' treats "" as a real value, so a default-empty
app.env.BETTER_AUTH_URL would shadow a non-empty realtime.env override
and the URL would never reach the rendered Secret. Replace 'merge'
with an explicit two-pass overlay that filters empties before writing,
mirroring the same pattern already used in deployment-realtime.yaml's
existingSecret block.
Adds two regression tests: realtime.env-only value reaches the Secret
when app.env is empty, and app.env still wins on collision when both
are non-empty (48 tests total).
* fix(helm): make topologySpreadConstraints per-component to match docstring
Greptile flagged that sim.topologySpreadConstraints helper docstring promised
per-component config (.Values.app, .Values.realtime, ...) but call sites
passed .Values, so any app.topologySpreadConstraints / realtime.topologySpreadConstraints
set by the user was silently dropped. The single global key also prevented
distinct app-vs-realtime spread rules.
Pass .Values.app / .Values.realtime to the helper at each call site; move
the top-level topologySpreadConstraints key into both component sections in
values.yaml. Adds a regression test that app constraints don't leak onto
the realtime pod.
* fix(helm): allow cron pods through app NetworkPolicy
Cursor flagged that when networkPolicy.enabled=true and cronjobs.enabled=true
(the recommended production config), the app NetworkPolicy only allowed
ingress from realtime and the ingress controller — silently blocking every
cron pod's HTTP call to /api/schedules/execute, webhook polls, etc. All 13
default cronjobs would fail.
Tag cron pods with a stable simstudio.ai/component-group: cronjob label so
the app NetworkPolicy can allow them with a single rule (no per-job
enumeration). Rule is conditional on cronjobs.enabled. Adds positive and
negative regression tests.1 parent 1b94424 commit 9d2dd8f
49 files changed
Lines changed: 3266 additions & 1024 deletions
File tree
- helm/sim
- .claude/skills/sim-helm
- references
- examples
- templates
- tests
- tests
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
Lines changed: 192 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
0 commit comments