Skip to content

Commit 9d2dd8f

Browse files
authored
improvement(helm): helm chart updates with security, ESO, and docs overhaul (#4565)
* improvement(helm): production-ready chart with security, ESO, and docs overhaul Comprehensive Helm chart improvements bringing the chart up to industry standards for security, secret management, and documentation. Security - Pod Security Standards "restricted" defaults on every pod and container (runAsNonRoot, allowPrivilegeEscalation=false, capabilities.drop=[ALL], seccompProfile=RuntimeDefault) - automountServiceAccountToken=false on ServiceAccount and every pod - NetworkPolicy egress blocks cloud metadata endpoints by default - Sensitive app/realtime env keys auto-partitioned into chart-managed Secret via envFrom; no more plaintext secrets on container specs Secret management - Three modes: inline, existingSecret, ExternalSecrets Operator (ESO) - ESO sync supports arbitrary sensitive keys - Fail-fast template rendering when ESO enabled but sensitive key unmapped - AWS/Azure/GCP example files document all three modes Reliability - Headless Services for both Postgres StatefulSets - HPA-aware replicas (omits spec.replicas when autoscaling.enabled) - PodDisruptionBudget auto-activates when replicaCount > 1 - Startup / liveness / readiness probes with distinct timings - CronJob ttlSecondsAfterFinished for automatic cleanup Chart hygiene - Image tags default to Chart.AppVersion; pullPolicy IfNotPresent - Optional image.digest pin for content-addressed deploys - kubeVersion >=1.25.0-0 enforced - Ollama pinned to 0.23.2; mount moved to /data Documentation - README rewritten in cert-manager / Bitnami style - NOTES.txt with post-install guidance - Example values files annotated with usage and secret-strategy guidance * fix(helm): correct resource names in README (sim-sim-* → sim-*) The sim.fullname helper collapses to the release name when the release name contains the chart name. With the documented release name 'sim', actual resources are 'sim-app', 'sim-postgresql', etc. — not the 'sim-sim-*' form previously documented. Fixes copy-paste commands in the pre-1.0.0 upgrade walkthrough and several troubleshooting snippets. Also expands the cronjobs component description to reflect the full set of 13 scheduled jobs (was understated as just Gmail/Outlook polling). * improvement(helm): split app/realtime env into Secret-bound + inline defaults - Add app.envDefaults / realtime.envDefaults for chart-shipped operational tunables (rate limits, timeouts, IVM, feature-flag defaults, localhost URL fallbacks). Rendered inline on the container, not into the Secret - Remove operational defaults from app.env / realtime.env so the chart-managed Secret stays minimal and External Secrets Operator users only map keys they actually set, not every chart default - Skip an envDefaults key when the user explicitly sets it in env (K8s `env` overrides `envFrom`, so an inline default would otherwise mask a Secret value at runtime) - Relax values.schema.json to allow empty strings on NEXT_PUBLIC_APP_URL, BETTER_AUTH_URL, NEXT_PUBLIC_SUPPORT_EMAIL (defaults supplied via envDefaults) * fix(helm): address PR review — cronjob validation, ESO apiVersion, secret merge order, image guard - CronJobs reference CRON_SECRET via secretKeyRef; fail-fast at template time when cronjobs.enabled=true and app.env.CRON_SECRET is empty so users get a clear error instead of a CreateContainerConfigError loop - Default externalSecrets.apiVersion to "v1beta1" (supported by every ESO release since v0.7). The previous "v1" default targets only ESO v0.17+ - Swap merge order in secrets-app.yaml so app.env wins over realtime.env for shared keys (BETTER_AUTH_SECRET, BETTER_AUTH_URL, …) — both pods consume the same Secret via envFrom, so the app value must be canonical - Add `required` guard on sim.image so an empty tag + empty digest + empty Chart.AppVersion surfaces as a clear template-time error instead of rendering an invalid `repo:` reference * fix(helm): require critical secrets to be mapped when ESO is enabled Previously, enabling externalSecrets without mapping BETTER_AUTH_SECRET / ENCRYPTION_KEY / INTERNAL_API_SECRET (and CRON_SECRET when cronjobs are on) rendered cleanly but produced CrashLoopBackOff at runtime with cryptic missing-env errors. Fail at template time instead. * fix(helm): auto-enable PDB when HPA minReplicas > 1 Previously the auto-enable predicate only checked the static app.replicaCount, which defaults to 1 even when autoscaling is on (HPA owns spec.replicas). PDB now also activates when autoscaling.enabled=true and minReplicas > 1. * fix(helm): prevent realtime envDefaults from masking app.env Secret values; add StatefulSet upgrade NOTES - Realtime override-skip now considers keys set in either app.env or realtime.env. The shared app Secret is mounted via envFrom on the realtime pod, so a key set in app.env (e.g. NEXT_PUBLIC_APP_URL) would previously be masked by the realtime envDefault (inline env overrides envFrom in K8s). - NOTES.txt now prints a StatefulSet orphan-delete reminder on upgrade, surfacing the immutable serviceName issue documented in the README. * feat(helm): add Claude Skill for chart deployment Adds a skill at helm/sim/.claude/skills/sim-helm/ that teaches agents how to deploy and troubleshoot the Sim Helm chart: install path selection (inline / existingSecret / ESO), secret generation, the values.yaml four-layer mental model, common-failure troubleshooting, and the pre-1.0.0 StatefulSet orphan-delete upgrade procedure. Skill is loadable by Claude Code, Codex, and OpenCode via the standard skills convention (directory name matches frontmatter name). * docs(helm): add CRON_SECRET to TL;DR, dry-run, and example install headers The validateSecrets guard requires CRON_SECRET when cronjobs.enabled=true (the default), but the quickstart and example file install commands omitted it — users following the docs hit a hard template-render failure. Adds CRON_SECRET to README TL;DR, validate-the-install dry-run snippet, and the install command headers in all example values files. * fix(helm): require INTERNAL_API_SECRET in inline secret mode The ESO coverage validator already required INTERNAL_API_SECRET, but the inline validateSecrets path only checked BETTER_AUTH_SECRET, ENCRYPTION_KEY, and CRON_SECRET — letting inline installs render successfully and then crash at runtime when the realtime↔app shared auth secret was missing. Adds the same fail-fast check to the inline path. * docs(helm): surface INTERNAL_API_SECRET upgrade requirement in NOTES.txt The new validateSecrets check makes app.env.INTERNAL_API_SECRET mandatory on upgrade. Existing installs that never set it would hit a template render failure with no in-context guidance. Adds an upgrade-only note with the generation snippet and storage guidance alongside the existing StatefulSet orphan-delete instructions. * fix(helm): NetworkPolicy egress to OTEL collector + external-db example format - Add app/realtime NetworkPolicy egress rules for the OpenTelemetry collector pod on ports 4317 (OTLP gRPC) and 4318 (OTLP HTTP) when telemetry.enabled=true. Without these, traces and metrics were silently dropped with connection-refused errors when both telemetry and networkPolicy were enabled. - Migrate values-external-db.yaml from the legacy list-shaped egress format to the new {exceptCidrs, extraRules} object. The list form would replace the default object on merge and crash template rendering when the chart tried to access .exceptCidrs on a list. * fix(helm): NOTES.txt no longer prints false secret warning for ESO users The secrets-empty warning only checked app.secrets.existingSecret.enabled before scanning app.env. ESO users intentionally leave app.env empty — secrets come from the ESO-synced Secret — so every ESO install/upgrade printed a misleading 'pods will fail to start' warning. Reorders the branches so externalSecrets.enabled takes precedence: ESO users now see a confirmation message with kubectl commands to verify the ExternalSecret has synced. The empty-app.env warning only fires when both ESO and existingSecret are disabled. * fix(helm): existingSecret mode no longer drops app.env / realtime.env values In existingSecret mode the chart-managed Secret is not rendered, so non-empty values in app.env / realtime.env had nowhere to land — yet the envDefaults skip logic still suppressed the matching defaults. Result: keys like NEXT_PUBLIC_APP_URL, BETTER_AUTH_URL, and NODE_ENV silently went missing on both pods (the example values-existing-secret.yaml hit this directly). Both app and realtime deployments now inline non-empty values from app.env (plus realtime.env on the realtime container) when existingSecret is enabled and ESO is not. Inline / ESO modes are unchanged: inline still flows through the chart-managed Secret, ESO still owns the synced Secret. * fix(helm): correct realtime env overlay + filter chart-computed keys in existingSecret mode Realtime: Sprig merge gives the first source precedence and treats "" as a real value, so realtime.env empty defaults for shared keys shadowed non-empty app.env values. Replace with deepCopy($appEnv) base + manual non-empty overlay of $rtEnv. Both deployments: exclude DATABASE_URL/SOCKET_SERVER_URL/OLLAMA_URL from the existingSecret inline path so user-supplied values can't override chart-computed ones via last-wins env semantics. * fix(helm): skip envDefaults in existingSecret mode + document egress rename In existingSecret mode the user's pre-existing Secret is the source of truth (loaded via envFrom). Inlining localhost envDefaults for URL keys (BETTER_AUTH_URL, NEXT_PUBLIC_APP_URL, ALLOWED_ORIGINS) silently shadowed the Secret-bound values because K8s env always wins over envFrom. Skip envDefaults entirely on both deployments when existingSecret is enabled. Also call out the networkPolicy.egress shape change (list -> map with exceptCidrs + extraRules) in the NOTES.txt upgrade block so operators migrate their custom rules rather than silently losing them. * fix(helm): copy-pasteable install commands in copilot + ESO examples values-copilot.yaml: the install header was missing every required copilot.server.env.* secret (AGENT_API_DB_ENCRYPTION_KEY, INTERNAL_API_SECRET, LICENSE_KEY, SIM_BASE_URL, SIM_AGENT_API_KEY, REDIS_URL, one model key) plus copilot.postgresql.auth.password. Pasting it as-is failed at template render. values-external-secrets.yaml: NEXT_PUBLIC_APP_URL, BETTER_AUTH_URL, etc. were declared under app.env / realtime.env. In ESO mode the chart-managed Secret isn't rendered, so the validator (rightly) rejects keys in app.env that aren't mapped under externalSecrets.remoteRefs. Moved non-secret URL/config to envDefaults, which is inlined and not subject to the ESO mapping rule. * polish(helm): configurable NetworkPolicy ingress peers + clearer API_ENCRYPTION_KEY comment - networkPolicy.ingressFrom lets operators scope the ingress-controller rule to a specific namespace/podSelector. Defaults to a single empty peer (`- {}`), which is the explicit form of "any source" — same effective behavior as the old `from: []` but unambiguous across CNIs. To restrict, override with e.g.: networkPolicy: ingressFrom: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: ingress-nginx - API_ENCRYPTION_KEY comment: drop the "must be exactly 64 hex characters" phrasing that sat awkwardly next to `openssl rand -hex 32`. The generation command already produces the required length. * test(helm): add helm-unittest suites + CI workflow + ci values matrix - 7 helm-unittest suites covering smoke, validators, secret modes, envDefaults secret-mode-aware inlining (round-9 regression net), chart-computed env keys (round-8 regression net), NetworkPolicy shape, and PDB/HPA conditional rendering (38 tests, ~265ms). - ci/*.yaml render fixtures for default, production, existingSecret, ESO, and external-db install modes. - GitHub Actions workflow runs helm lint --strict, helm unittest, helm template across the ci matrix, and kubeconform validation against Kubernetes 1.30 schemas. - CONTRIBUTING.md documents how to run the same gates locally. * test(helm): add helm test hook + kind apiserver dry-run in CI - New templates/tests/test-connection.yaml renders a Pod with helm.sh/hook=test that wgets the app Service (and realtime when enabled). Lets users run `helm test <release>` after install for a real in-cluster connectivity check. Restricted PSS context. - tests.* values block (image, timeoutSeconds, resources) is the knob to disable or tune the probe; documented in values.schema.json. - 3 helm-unittest tests cover the hook annotations, PSS context, and tests.enabled=false skip path (41 tests total). - New CI job spins up a kind v1.30 cluster and runs `kubectl apply --dry-run=server` against the rendered manifests for the CRD-free ci fixtures (default / existing-secret / external-db). Catches admission and validation issues the static kubeconform schema check can't see. * chore(helm): remove pre-1.0.0 upgrade fluff + tighten .helmignore This is the 1.0.0 release of the chart — there is no pre-1.0.0 predecessor for users to upgrade from, so all of the dedicated upgrade narration was hypothetical. - Drop the 'Upgrading from a pre-1.0.0 build' README section and the matching troubleshooting entry. - Drop the .Release.IsUpgrade block from NOTES.txt: items 5 (StatefulSet orphan-delete), 6 (INTERNAL_API_SECRET 'new in 1.0.0'), 7 (networkPolicy.egress shape change). Each described a migration off a chart version that never shipped. - Delete references/upgrade-pre-1.0.0.md and remove the corresponding pointers from SKILL.md. - Anchor .helmignore patterns to chart root so /tests/ (unit suites) and /examples/ are dropped from the packaged tarball without also dropping templates/tests/ (the helm test hook). * chore(helm): drop CI workflow + ci/ fixtures + CONTRIBUTING.md The helm-unittest suites in helm/sim/tests/ and the helm test hook in helm/sim/templates/tests/ stay — those are chart-internal quality scaffolding, not CI. Removed: - .github/workflows/helm-chart.yml - helm/sim/ci/*.yaml (5 render fixtures used only by the workflow) - helm/sim/CONTRIBUTING.md (mostly documented those gates) - dead /ci/ and /CONTRIBUTING.md entries in .helmignore * feat(helm): pod rollout on Secret change + topologySpreadConstraints - Add checksum/secret pod annotations on app, realtime, and copilot Deployments (plus checksum/config on app when branding ConfigMap is enabled). Closes the long-standing footgun where 'helm upgrade' with a changed Secret would silently leave pods running the old values until a manual rollout restart. - New top-level topologySpreadConstraints value (and sim.topologySpreadConstraints helper) applied to app and realtime Deployments. Mirrors how affinity and tolerations are plumbed; users supply their own labelSelector to mirror Bitnami convention. - 5 helm-unittest cases cover the checksum annotations and topology spread rendering (46 tests total). * fix(helm): drop empty-string shadowing in app/realtime env merge Sprig 'merge' treats "" as a real value, so a default-empty app.env.BETTER_AUTH_URL would shadow a non-empty realtime.env override and the URL would never reach the rendered Secret. Replace 'merge' with an explicit two-pass overlay that filters empties before writing, mirroring the same pattern already used in deployment-realtime.yaml's existingSecret block. Adds two regression tests: realtime.env-only value reaches the Secret when app.env is empty, and app.env still wins on collision when both are non-empty (48 tests total). * fix(helm): make topologySpreadConstraints per-component to match docstring Greptile flagged that sim.topologySpreadConstraints helper docstring promised per-component config (.Values.app, .Values.realtime, ...) but call sites passed .Values, so any app.topologySpreadConstraints / realtime.topologySpreadConstraints set by the user was silently dropped. The single global key also prevented distinct app-vs-realtime spread rules. Pass .Values.app / .Values.realtime to the helper at each call site; move the top-level topologySpreadConstraints key into both component sections in values.yaml. Adds a regression test that app constraints don't leak onto the realtime pod. * fix(helm): allow cron pods through app NetworkPolicy Cursor flagged that when networkPolicy.enabled=true and cronjobs.enabled=true (the recommended production config), the app NetworkPolicy only allowed ingress from realtime and the ingress controller — silently blocking every cron pod's HTTP call to /api/schedules/execute, webhook polls, etc. All 13 default cronjobs would fail. Tag cron pods with a stable simstudio.ai/component-group: cronjob label so the app NetworkPolicy can allow them with a single rule (no per-job enumeration). Rule is conditional on cronjobs.enabled. Adds positive and negative regression tests.
1 parent 1b94424 commit 9d2dd8f

49 files changed

Lines changed: 3266 additions & 1024 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
---
2+
name: sim-helm
3+
description: Install, upgrade, and operate the Sim Helm chart on Kubernetes. Covers install path selection (inline / existingSecret / External Secrets Operator), required secret generation, the values.yaml mental model (env vs envDefaults vs Secret), and common failure triage. Invoke when a user asks about deploying Sim to a cluster, authoring a Sim values.yaml, debugging a Sim pod that won't start, upgrading a Sim release, or wiring Sim into a secret manager.
4+
license: Apache-2.0
5+
---
6+
7+
# Sim Helm Chart — Operations Skill
8+
9+
This skill helps an agent deploy and operate the **Sim** Helm chart at `helm/sim/` in the [simstudioai/sim](https://github.com/simstudioai/sim) repository. Use it when the user is installing, upgrading, troubleshooting, or authoring values for the Sim chart.
10+
11+
The skill is **diagnostic-first**: capture context, classify the situation, load only the references that apply, then act. Do not dump the README at the user. Do not invent values that are not in their current state.
12+
13+
---
14+
15+
## Workflow — follow in order
16+
17+
### 1. Capture context
18+
19+
Before recommending anything, ask (or infer from the conversation) all of these. **Never skip this step.** A wrong assumption here corrupts every downstream step.
20+
21+
| Question | Why it matters |
22+
|---|---|
23+
| Cluster: EKS / GKE / AKS / OpenShift / kind / other? | Storage class, ingress class, identity provider differ |
24+
| Secret strategy: inline `--set`, pre-existing K8s Secret, or External Secrets Operator (ESO)? | The chart has three distinct code paths |
25+
| Postgres: chart-bundled, or external (RDS / Cloud SQL / Azure DB)? | Different value blocks (`postgresql.*` vs `externalDatabase.*`) |
26+
| Public-facing? Ingress class? TLS? | `ingress.enabled`, `ingress.className`, cert-manager wiring |
27+
| HA? (target replicas) | Drives `autoscaling.enabled`, `app.replicaCount`, PDB activation |
28+
| Existing values.yaml the user is editing? | Always read it before proposing a diff — never write blind |
29+
30+
If the user has a `values.yaml`, read it. If they don't, ask before writing one.
31+
32+
### 2. Diagnose
33+
34+
Map the user's request to one of these categories and load the matching reference(s):
35+
36+
| Situation | Reference |
37+
|---|---|
38+
| User wants to install for the first time | `references/install-paths.md` then `references/secrets.md` |
39+
| User needs to generate the required secrets | `references/secrets.md` |
40+
| User asks "what does this value do" / wants to author values.yaml | `references/values-model.md` |
41+
| Pod won't start, error message, `CrashLoopBackOff`, image pull error, ingress not routing | `references/troubleshooting.md` |
42+
| User asks about ESO / Vault / AWS Secrets Manager / Azure Key Vault / GCP Secret Manager | `references/install-paths.md` (ESO section) |
43+
| User asks "is X production-ready" / autoscaling / network policy / security context | Read the README's "Production checklist" section directly — no separate reference |
44+
45+
Load **only** what the situation requires. Loading every reference burns tokens and produces vague answers.
46+
47+
### 3. Propose
48+
49+
When proposing values changes:
50+
51+
- Show the **minimal diff** against the user's current values.yaml. Don't rewrite the file.
52+
- Name the **risk** (e.g., "this puts the secret in `helm get values` output — fine for dev, not for prod").
53+
- Name the **rollback** (e.g., "if this breaks, `helm rollback sim 1` reverts").
54+
- Cite the canonical source (`helm/sim/values.yaml` line numbers, README section, or this skill's reference file).
55+
56+
### 4. Validate before applying
57+
58+
Always run these before telling the user to `helm install` / `helm upgrade`:
59+
60+
```bash
61+
# Schema + value validation
62+
helm lint helm/sim --values <user-values>.yaml
63+
64+
# Render full manifest set to catch template errors
65+
helm template sim helm/sim --values <user-values>.yaml > /tmp/render.yaml
66+
67+
# For upgrades, render against the live release first
68+
helm upgrade --dry-run sim helm/sim --values <user-values>.yaml
69+
```
70+
71+
If lint or template fails, fix the values — do not work around chart validation. The chart's `fail` statements exist to catch misconfigurations that would otherwise surface as `CrashLoopBackOff` at runtime.
72+
73+
### 5. Deliver
74+
75+
Every recommendation should include:
76+
77+
- The exact command(s) to run
78+
- A one-line summary of what will change
79+
- The success signal (e.g., "`kubectl rollout status deploy/sim-app` returns Ready")
80+
- The rollback command if something breaks
81+
82+
---
83+
84+
## Quick reference — the three secret modes
85+
86+
| Mode | When | Code path |
87+
|---|---|---|
88+
| **Inline (`--set`)** | Dev / kind / dry-run only. Values leak into `helm get values`. | `app.env.<KEY>: "..."` |
89+
| **Pre-existing Secret** | GitOps with Sealed Secrets / SOPS, or hand-managed Secrets. Chart references a Secret you create. | `app.secrets.existingSecret.enabled: true` + `.name` |
90+
| **External Secrets Operator (recommended for prod)** | Vault, AWS SM, Azure KV, GCP SM. Chart renders an `ExternalSecret` that ESO syncs. | `externalSecrets.enabled: true` + `secretStoreRef` + `remoteRefs.app.<KEY>` |
91+
92+
These modes are **mutually exclusive** for the app Secret. ESO takes precedence over inline. `existingSecret` takes precedence over inline. The chart **fails template rendering** when ESO is enabled and a required key (`BETTER_AUTH_SECRET`, `ENCRYPTION_KEY`, `INTERNAL_API_SECRET`, plus `CRON_SECRET` when cronjobs are enabled) is neither in `app.env` nor mapped in `remoteRefs.app` — see `references/install-paths.md`.
93+
94+
---
95+
96+
## Quick reference — the four required secrets
97+
98+
| Key | Generate with | Notes |
99+
|---|---|---|
100+
| `BETTER_AUTH_SECRET` | `openssl rand -hex 32` | Session signing |
101+
| `ENCRYPTION_KEY` | `openssl rand -hex 32` | App-level encryption |
102+
| `INTERNAL_API_SECRET` | `openssl rand -hex 32` | Service-to-service auth (app ↔ realtime) |
103+
| `CRON_SECRET` | `openssl rand -hex 32` | Required iff `cronjobs.enabled=true` (default true) |
104+
105+
Optional but commonly needed:
106+
107+
| Key | Generate with | Notes |
108+
|---|---|---|
109+
| `API_ENCRYPTION_KEY` | `openssl rand -hex 32` | Must be **exactly 64 hex chars**. Required to encrypt user API keys at rest. |
110+
| `postgresql.auth.password` | `openssl rand -base64 24 \| tr -d '/+='` | Only if using chart-bundled Postgres. Must match `^[a-zA-Z0-9._-]+$` for DATABASE_URL compatibility. |
111+
112+
See `references/secrets.md` for storage patterns and rotation guidance.
113+
114+
---
115+
116+
## Rules of engagement
117+
118+
These are non-negotiable. Violating any of these has burned users in the past.
119+
120+
1. **Never recommend `--set` for production secrets.** They land in `helm get values` and Helm release history. Direct users to `existingSecret` or ESO.
121+
2. **Never set `image.tag: latest`.** The chart defaults to `Chart.AppVersion` for a reason — reproducible rollouts. If the user pinned `latest`, push back.
122+
3. **Never edit chart templates to work around a `fail` statement.** The validation exists because a misconfiguration would otherwise surface as a runtime CrashLoopBackOff with cryptic env errors.
123+
4. **Never drop `automountServiceAccountToken: false`** unless the workload genuinely needs in-cluster API access (Sim's app/realtime/postgres pods do not).
124+
5. **Never `kubectl delete sts` without `--cascade=orphan`** on a live Postgres. It deletes the pods and PVCs.
125+
6. **Never tell a user "the chart works on your cluster" without `helm lint` + `helm template` against their values.** Static reading is not validation.
126+
7. **Always confirm before `helm uninstall` in a shared namespace.** PVCs survive but other namespace resources may not.
127+
128+
---
129+
130+
## When the user is stuck and you can't diagnose
131+
132+
Get logs from every component in parallel. This single block answers ~80% of "it's broken" questions:
133+
134+
```bash
135+
kubectl --namespace <ns> get pods,events --sort-by='.lastTimestamp'
136+
kubectl --namespace <ns> logs deploy/sim-app --tail=200
137+
kubectl --namespace <ns> logs deploy/sim-realtime --tail=200
138+
kubectl --namespace <ns> logs sts/sim-postgresql --tail=200
139+
kubectl --namespace <ns> logs job/sim-migrations --tail=200 2>/dev/null
140+
kubectl --namespace <ns> describe pod -l app.kubernetes.io/name=sim
141+
```
142+
143+
Then map the symptom to `references/troubleshooting.md`.
144+
145+
---
146+
147+
## What this skill does **not** cover
148+
149+
- Sim application configuration beyond env vars (provider keys, knowledge base setup, etc.) — that's the Sim app docs at https://docs.sim.ai
150+
- Kubernetes cluster setup (creating an EKS cluster, installing ingress-nginx, etc.) — that's cloud-provider docs
151+
- Authoring new chart templates — that's `helm/sim/templates/_helpers.tpl` and the chart's own contributor docs
152+
- Running Sim outside Kubernetes (Docker Compose, bare-metal) — see the root `README.md`
153+
154+
If the user's question falls outside this scope, say so and point them at the right doc.
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
# Install Path Selection
2+
3+
Three mutually-exclusive paths for the app Secret. Pick exactly one. The chart enforces this at template time.
4+
5+
## Decision tree
6+
7+
```
8+
Is this a production install?
9+
├── No (dev / kind / minikube / dry-run)
10+
│ → Inline `--set` is fine. Skip to "Path A".
11+
12+
└── Yes
13+
14+
Do you already manage secrets with Vault / AWS Secrets Manager /
15+
Azure Key Vault / GCP Secret Manager / 1Password Connect?
16+
17+
├── Yes → External Secrets Operator. Path C.
18+
19+
└── No
20+
21+
Do you use GitOps with Sealed Secrets, SOPS, or
22+
hand-managed Kubernetes Secrets?
23+
24+
├── Yes → Pre-existing Secret. Path B.
25+
26+
└── No → Install ESO and go to Path C.
27+
(Don't skip to inline `--set` for prod —
28+
secrets land in `helm get values` and release history.)
29+
```
30+
31+
---
32+
33+
## Path A — Inline `--set` (dev only)
34+
35+
```bash
36+
helm install sim ./helm/sim \
37+
--namespace sim --create-namespace \
38+
--set app.env.BETTER_AUTH_SECRET=$(openssl rand -hex 32) \
39+
--set app.env.ENCRYPTION_KEY=$(openssl rand -hex 32) \
40+
--set app.env.INTERNAL_API_SECRET=$(openssl rand -hex 32) \
41+
--set app.env.CRON_SECRET=$(openssl rand -hex 32) \
42+
--set postgresql.auth.password=$(openssl rand -base64 24 | tr -d '/+=')
43+
```
44+
45+
The chart generates a `Secret` named `<release>-app-secrets` containing every non-empty key from `app.env` + `realtime.env`. Both `app` and `realtime` Deployments mount it via `envFrom`.
46+
47+
**Risks:**
48+
- Secrets are visible in `helm get values <release>` and `helm history <release>`.
49+
- Anyone with read access to the release's ConfigMap (`sh.helm.release.v1.<release>.v<N>`) can recover the secrets — they're stored base64-encoded inside.
50+
51+
---
52+
53+
## Path B — Pre-existing Kubernetes Secret
54+
55+
Create the Secret first, then point the chart at it.
56+
57+
```bash
58+
kubectl create namespace sim
59+
kubectl create secret generic sim-app-secrets --namespace sim \
60+
--from-literal=BETTER_AUTH_SECRET=$(openssl rand -hex 32) \
61+
--from-literal=ENCRYPTION_KEY=$(openssl rand -hex 32) \
62+
--from-literal=INTERNAL_API_SECRET=$(openssl rand -hex 32) \
63+
--from-literal=CRON_SECRET=$(openssl rand -hex 32)
64+
65+
kubectl create secret generic sim-postgres-secret --namespace sim \
66+
--from-literal=POSTGRES_PASSWORD=$(openssl rand -base64 24 | tr -d '/+=')
67+
```
68+
69+
```yaml
70+
# values.yaml
71+
app:
72+
secrets:
73+
existingSecret:
74+
enabled: true
75+
name: sim-app-secrets
76+
77+
postgresql:
78+
auth:
79+
existingSecret:
80+
enabled: true
81+
name: sim-postgres-secret
82+
passwordKey: POSTGRES_PASSWORD
83+
```
84+
85+
**The chart cannot introspect your Secret.** If you forget a required key, the pod will fail at runtime with `CreateContainerConfigError: secret key "X" not found`. The required keys are: `BETTER_AUTH_SECRET`, `ENCRYPTION_KEY`, `INTERNAL_API_SECRET`, plus `CRON_SECRET` when cronjobs are enabled.
86+
87+
For GitOps (Sealed Secrets / SOPS), seal/encrypt the Secret YAML before committing — never commit a plain `kubectl create secret` output.
88+
89+
---
90+
91+
## Path C — External Secrets Operator (production recommended)
92+
93+
ESO syncs from your existing secret store (Vault, AWS SM, Azure KV, GCP SM, etc.) into a Kubernetes Secret on a refresh interval. The chart renders the `ExternalSecret` resource; ESO does the syncing.
94+
95+
### Prerequisites
96+
97+
1. Install ESO once per cluster:
98+
```bash
99+
helm repo add external-secrets https://charts.external-secrets.io
100+
helm install external-secrets external-secrets/external-secrets \
101+
-n external-secrets --create-namespace
102+
```
103+
2. Create a `ClusterSecretStore` (or namespace-scoped `SecretStore`) that points at your secret manager. ESO's docs cover the auth wiring for each provider.
104+
105+
### Values
106+
107+
```yaml
108+
externalSecrets:
109+
enabled: true
110+
apiVersion: v1beta1 # v1beta1 works on ESO >= 0.7. Bump to v1 only on ESO >= 0.17.
111+
refreshInterval: 1h
112+
secretStoreRef:
113+
name: my-cluster-secret-store
114+
kind: ClusterSecretStore # or SecretStore for namespace-scoped
115+
remoteRefs:
116+
app:
117+
BETTER_AUTH_SECRET: sim/app/better-auth-secret
118+
ENCRYPTION_KEY: sim/app/encryption-key
119+
INTERNAL_API_SECRET: sim/app/internal-api-secret
120+
CRON_SECRET: sim/app/cron-secret # required iff cronjobs.enabled
121+
# Optional but commonly mapped:
122+
API_ENCRYPTION_KEY: sim/app/api-encryption-key
123+
OPENAI_API_KEY: sim/providers/openai
124+
postgresql:
125+
password: sim/postgresql/password # required if postgresql.enabled
126+
externalDatabase:
127+
password: sim/postgresql/password # required if externalDatabase.enabled
128+
129+
# Leave app.env empty (or only set non-secret values like NEXT_PUBLIC_APP_URL).
130+
app:
131+
env: {}
132+
```
133+
134+
### Fail-fast behavior
135+
136+
The chart will refuse to render if:
137+
138+
- `externalSecrets.enabled=true` and any of `BETTER_AUTH_SECRET`, `ENCRYPTION_KEY`, `INTERNAL_API_SECRET` (or `CRON_SECRET` when cronjobs are enabled) is **neither** set in `app.env` **nor** mapped in `remoteRefs.app`. Error message names the missing key.
139+
- A key is set in `app.env` with a non-empty value but not mapped in `remoteRefs.app` (would be silently dropped from the rendered Secret).
140+
141+
These checks catch the "renders cleanly, CrashLoopBackOffs at runtime" failure mode that plagued earlier chart versions.
142+
143+
### Remote ref shapes
144+
145+
Each `remoteRefs.app.<KEY>` value can be either:
146+
147+
```yaml
148+
# Shorthand — just the path/key in the store
149+
BETTER_AUTH_SECRET: sim/app/better-auth-secret
150+
```
151+
152+
```yaml
153+
# Full form — pass any field ESO supports
154+
BETTER_AUTH_SECRET:
155+
key: sim/app/better-auth-secret
156+
property: value # for stores that return JSON
157+
version: "v3" # pin a specific version
158+
decodingStrategy: Base64 # for base64-stored values
159+
```
160+
161+
---
162+
163+
## Cross-cutting: things that are NOT secrets
164+
165+
Operational tunables (rate limits, timeouts, IVM pool size, branding) live in `app.envDefaults` and `realtime.envDefaults`. They're rendered as **inline `env:`** on the Deployment, not written to the Secret. See `values-model.md` for the full mental model.
166+
167+
Don't try to push these into ESO — they're not sensitive, they'd just bloat the secret store.
168+
169+
---
170+
171+
## Verifying your choice
172+
173+
After `helm install`:
174+
175+
```bash
176+
# What Secret will the pods mount?
177+
helm template sim helm/sim -f my-values.yaml | grep -A2 "envFrom:"
178+
179+
# For ESO: did the ExternalSecret render?
180+
helm template sim helm/sim -f my-values.yaml | grep -B1 -A10 "kind: ExternalSecret"
181+
182+
# For existingSecret: is your pre-created Secret referenced?
183+
helm template sim helm/sim -f my-values.yaml | grep -E "name: .*-app-secrets"
184+
```
185+
186+
For ESO, after `helm install`, verify the sync:
187+
188+
```bash
189+
kubectl get externalsecret -n sim
190+
kubectl describe externalsecret <release>-app-secrets -n sim
191+
# Status should show 'SecretSynced=True'
192+
```

0 commit comments

Comments
 (0)