v0.6.75: scheduler claim-budget drain, helm chart hardening, mothership md polish#4568
Conversation
TheodoreSpeaks
commented
May 12, 2026
- improvement(helm): helm chart updates with security, ESO, and docs overhaul (improvement(helm): helm chart updates with security, ESO, and docs overhaul #4565)
- improvement(mothership): align markdown blockquote, img, em, del with design tokens (improvement(mothership): align markdown blockquote, img, em, del with design tokens #4566)
- improvement(scheduler): raise per-tick claim budget to drain backlog (improvement(scheduler): raise per-tick claim budget to drain backlog #4567)
… design tokens (#4566) * improvement(mothership): align markdown blockquote, img, em, del with design tokens * fix(mothership): correctly scope blockquote paragraph margin reset to first/last child * improvement(mothership): restore italic on blockquotes * fix(mothership): widen img component prop type to satisfy Streamdown Components
…erhaul (#4565) * improvement(helm): production-ready chart with security, ESO, and docs overhaul Comprehensive Helm chart improvements bringing the chart up to industry standards for security, secret management, and documentation. Security - Pod Security Standards "restricted" defaults on every pod and container (runAsNonRoot, allowPrivilegeEscalation=false, capabilities.drop=[ALL], seccompProfile=RuntimeDefault) - automountServiceAccountToken=false on ServiceAccount and every pod - NetworkPolicy egress blocks cloud metadata endpoints by default - Sensitive app/realtime env keys auto-partitioned into chart-managed Secret via envFrom; no more plaintext secrets on container specs Secret management - Three modes: inline, existingSecret, ExternalSecrets Operator (ESO) - ESO sync supports arbitrary sensitive keys - Fail-fast template rendering when ESO enabled but sensitive key unmapped - AWS/Azure/GCP example files document all three modes Reliability - Headless Services for both Postgres StatefulSets - HPA-aware replicas (omits spec.replicas when autoscaling.enabled) - PodDisruptionBudget auto-activates when replicaCount > 1 - Startup / liveness / readiness probes with distinct timings - CronJob ttlSecondsAfterFinished for automatic cleanup Chart hygiene - Image tags default to Chart.AppVersion; pullPolicy IfNotPresent - Optional image.digest pin for content-addressed deploys - kubeVersion >=1.25.0-0 enforced - Ollama pinned to 0.23.2; mount moved to /data Documentation - README rewritten in cert-manager / Bitnami style - NOTES.txt with post-install guidance - Example values files annotated with usage and secret-strategy guidance * fix(helm): correct resource names in README (sim-sim-* → sim-*) The sim.fullname helper collapses to the release name when the release name contains the chart name. With the documented release name 'sim', actual resources are 'sim-app', 'sim-postgresql', etc. — not the 'sim-sim-*' form previously documented. Fixes copy-paste commands in the pre-1.0.0 upgrade walkthrough and several troubleshooting snippets. Also expands the cronjobs component description to reflect the full set of 13 scheduled jobs (was understated as just Gmail/Outlook polling). * improvement(helm): split app/realtime env into Secret-bound + inline defaults - Add app.envDefaults / realtime.envDefaults for chart-shipped operational tunables (rate limits, timeouts, IVM, feature-flag defaults, localhost URL fallbacks). Rendered inline on the container, not into the Secret - Remove operational defaults from app.env / realtime.env so the chart-managed Secret stays minimal and External Secrets Operator users only map keys they actually set, not every chart default - Skip an envDefaults key when the user explicitly sets it in env (K8s `env` overrides `envFrom`, so an inline default would otherwise mask a Secret value at runtime) - Relax values.schema.json to allow empty strings on NEXT_PUBLIC_APP_URL, BETTER_AUTH_URL, NEXT_PUBLIC_SUPPORT_EMAIL (defaults supplied via envDefaults) * fix(helm): address PR review — cronjob validation, ESO apiVersion, secret merge order, image guard - CronJobs reference CRON_SECRET via secretKeyRef; fail-fast at template time when cronjobs.enabled=true and app.env.CRON_SECRET is empty so users get a clear error instead of a CreateContainerConfigError loop - Default externalSecrets.apiVersion to "v1beta1" (supported by every ESO release since v0.7). The previous "v1" default targets only ESO v0.17+ - Swap merge order in secrets-app.yaml so app.env wins over realtime.env for shared keys (BETTER_AUTH_SECRET, BETTER_AUTH_URL, …) — both pods consume the same Secret via envFrom, so the app value must be canonical - Add `required` guard on sim.image so an empty tag + empty digest + empty Chart.AppVersion surfaces as a clear template-time error instead of rendering an invalid `repo:` reference * fix(helm): require critical secrets to be mapped when ESO is enabled Previously, enabling externalSecrets without mapping BETTER_AUTH_SECRET / ENCRYPTION_KEY / INTERNAL_API_SECRET (and CRON_SECRET when cronjobs are on) rendered cleanly but produced CrashLoopBackOff at runtime with cryptic missing-env errors. Fail at template time instead. * fix(helm): auto-enable PDB when HPA minReplicas > 1 Previously the auto-enable predicate only checked the static app.replicaCount, which defaults to 1 even when autoscaling is on (HPA owns spec.replicas). PDB now also activates when autoscaling.enabled=true and minReplicas > 1. * fix(helm): prevent realtime envDefaults from masking app.env Secret values; add StatefulSet upgrade NOTES - Realtime override-skip now considers keys set in either app.env or realtime.env. The shared app Secret is mounted via envFrom on the realtime pod, so a key set in app.env (e.g. NEXT_PUBLIC_APP_URL) would previously be masked by the realtime envDefault (inline env overrides envFrom in K8s). - NOTES.txt now prints a StatefulSet orphan-delete reminder on upgrade, surfacing the immutable serviceName issue documented in the README. * feat(helm): add Claude Skill for chart deployment Adds a skill at helm/sim/.claude/skills/sim-helm/ that teaches agents how to deploy and troubleshoot the Sim Helm chart: install path selection (inline / existingSecret / ESO), secret generation, the values.yaml four-layer mental model, common-failure troubleshooting, and the pre-1.0.0 StatefulSet orphan-delete upgrade procedure. Skill is loadable by Claude Code, Codex, and OpenCode via the standard skills convention (directory name matches frontmatter name). * docs(helm): add CRON_SECRET to TL;DR, dry-run, and example install headers The validateSecrets guard requires CRON_SECRET when cronjobs.enabled=true (the default), but the quickstart and example file install commands omitted it — users following the docs hit a hard template-render failure. Adds CRON_SECRET to README TL;DR, validate-the-install dry-run snippet, and the install command headers in all example values files. * fix(helm): require INTERNAL_API_SECRET in inline secret mode The ESO coverage validator already required INTERNAL_API_SECRET, but the inline validateSecrets path only checked BETTER_AUTH_SECRET, ENCRYPTION_KEY, and CRON_SECRET — letting inline installs render successfully and then crash at runtime when the realtime↔app shared auth secret was missing. Adds the same fail-fast check to the inline path. * docs(helm): surface INTERNAL_API_SECRET upgrade requirement in NOTES.txt The new validateSecrets check makes app.env.INTERNAL_API_SECRET mandatory on upgrade. Existing installs that never set it would hit a template render failure with no in-context guidance. Adds an upgrade-only note with the generation snippet and storage guidance alongside the existing StatefulSet orphan-delete instructions. * fix(helm): NetworkPolicy egress to OTEL collector + external-db example format - Add app/realtime NetworkPolicy egress rules for the OpenTelemetry collector pod on ports 4317 (OTLP gRPC) and 4318 (OTLP HTTP) when telemetry.enabled=true. Without these, traces and metrics were silently dropped with connection-refused errors when both telemetry and networkPolicy were enabled. - Migrate values-external-db.yaml from the legacy list-shaped egress format to the new {exceptCidrs, extraRules} object. The list form would replace the default object on merge and crash template rendering when the chart tried to access .exceptCidrs on a list. * fix(helm): NOTES.txt no longer prints false secret warning for ESO users The secrets-empty warning only checked app.secrets.existingSecret.enabled before scanning app.env. ESO users intentionally leave app.env empty — secrets come from the ESO-synced Secret — so every ESO install/upgrade printed a misleading 'pods will fail to start' warning. Reorders the branches so externalSecrets.enabled takes precedence: ESO users now see a confirmation message with kubectl commands to verify the ExternalSecret has synced. The empty-app.env warning only fires when both ESO and existingSecret are disabled. * fix(helm): existingSecret mode no longer drops app.env / realtime.env values In existingSecret mode the chart-managed Secret is not rendered, so non-empty values in app.env / realtime.env had nowhere to land — yet the envDefaults skip logic still suppressed the matching defaults. Result: keys like NEXT_PUBLIC_APP_URL, BETTER_AUTH_URL, and NODE_ENV silently went missing on both pods (the example values-existing-secret.yaml hit this directly). Both app and realtime deployments now inline non-empty values from app.env (plus realtime.env on the realtime container) when existingSecret is enabled and ESO is not. Inline / ESO modes are unchanged: inline still flows through the chart-managed Secret, ESO still owns the synced Secret. * fix(helm): correct realtime env overlay + filter chart-computed keys in existingSecret mode Realtime: Sprig merge gives the first source precedence and treats "" as a real value, so realtime.env empty defaults for shared keys shadowed non-empty app.env values. Replace with deepCopy($appEnv) base + manual non-empty overlay of $rtEnv. Both deployments: exclude DATABASE_URL/SOCKET_SERVER_URL/OLLAMA_URL from the existingSecret inline path so user-supplied values can't override chart-computed ones via last-wins env semantics. * fix(helm): skip envDefaults in existingSecret mode + document egress rename In existingSecret mode the user's pre-existing Secret is the source of truth (loaded via envFrom). Inlining localhost envDefaults for URL keys (BETTER_AUTH_URL, NEXT_PUBLIC_APP_URL, ALLOWED_ORIGINS) silently shadowed the Secret-bound values because K8s env always wins over envFrom. Skip envDefaults entirely on both deployments when existingSecret is enabled. Also call out the networkPolicy.egress shape change (list -> map with exceptCidrs + extraRules) in the NOTES.txt upgrade block so operators migrate their custom rules rather than silently losing them. * fix(helm): copy-pasteable install commands in copilot + ESO examples values-copilot.yaml: the install header was missing every required copilot.server.env.* secret (AGENT_API_DB_ENCRYPTION_KEY, INTERNAL_API_SECRET, LICENSE_KEY, SIM_BASE_URL, SIM_AGENT_API_KEY, REDIS_URL, one model key) plus copilot.postgresql.auth.password. Pasting it as-is failed at template render. values-external-secrets.yaml: NEXT_PUBLIC_APP_URL, BETTER_AUTH_URL, etc. were declared under app.env / realtime.env. In ESO mode the chart-managed Secret isn't rendered, so the validator (rightly) rejects keys in app.env that aren't mapped under externalSecrets.remoteRefs. Moved non-secret URL/config to envDefaults, which is inlined and not subject to the ESO mapping rule. * polish(helm): configurable NetworkPolicy ingress peers + clearer API_ENCRYPTION_KEY comment - networkPolicy.ingressFrom lets operators scope the ingress-controller rule to a specific namespace/podSelector. Defaults to a single empty peer (`- {}`), which is the explicit form of "any source" — same effective behavior as the old `from: []` but unambiguous across CNIs. To restrict, override with e.g.: networkPolicy: ingressFrom: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: ingress-nginx - API_ENCRYPTION_KEY comment: drop the "must be exactly 64 hex characters" phrasing that sat awkwardly next to `openssl rand -hex 32`. The generation command already produces the required length. * test(helm): add helm-unittest suites + CI workflow + ci values matrix - 7 helm-unittest suites covering smoke, validators, secret modes, envDefaults secret-mode-aware inlining (round-9 regression net), chart-computed env keys (round-8 regression net), NetworkPolicy shape, and PDB/HPA conditional rendering (38 tests, ~265ms). - ci/*.yaml render fixtures for default, production, existingSecret, ESO, and external-db install modes. - GitHub Actions workflow runs helm lint --strict, helm unittest, helm template across the ci matrix, and kubeconform validation against Kubernetes 1.30 schemas. - CONTRIBUTING.md documents how to run the same gates locally. * test(helm): add helm test hook + kind apiserver dry-run in CI - New templates/tests/test-connection.yaml renders a Pod with helm.sh/hook=test that wgets the app Service (and realtime when enabled). Lets users run `helm test <release>` after install for a real in-cluster connectivity check. Restricted PSS context. - tests.* values block (image, timeoutSeconds, resources) is the knob to disable or tune the probe; documented in values.schema.json. - 3 helm-unittest tests cover the hook annotations, PSS context, and tests.enabled=false skip path (41 tests total). - New CI job spins up a kind v1.30 cluster and runs `kubectl apply --dry-run=server` against the rendered manifests for the CRD-free ci fixtures (default / existing-secret / external-db). Catches admission and validation issues the static kubeconform schema check can't see. * chore(helm): remove pre-1.0.0 upgrade fluff + tighten .helmignore This is the 1.0.0 release of the chart — there is no pre-1.0.0 predecessor for users to upgrade from, so all of the dedicated upgrade narration was hypothetical. - Drop the 'Upgrading from a pre-1.0.0 build' README section and the matching troubleshooting entry. - Drop the .Release.IsUpgrade block from NOTES.txt: items 5 (StatefulSet orphan-delete), 6 (INTERNAL_API_SECRET 'new in 1.0.0'), 7 (networkPolicy.egress shape change). Each described a migration off a chart version that never shipped. - Delete references/upgrade-pre-1.0.0.md and remove the corresponding pointers from SKILL.md. - Anchor .helmignore patterns to chart root so /tests/ (unit suites) and /examples/ are dropped from the packaged tarball without also dropping templates/tests/ (the helm test hook). * chore(helm): drop CI workflow + ci/ fixtures + CONTRIBUTING.md The helm-unittest suites in helm/sim/tests/ and the helm test hook in helm/sim/templates/tests/ stay — those are chart-internal quality scaffolding, not CI. Removed: - .github/workflows/helm-chart.yml - helm/sim/ci/*.yaml (5 render fixtures used only by the workflow) - helm/sim/CONTRIBUTING.md (mostly documented those gates) - dead /ci/ and /CONTRIBUTING.md entries in .helmignore * feat(helm): pod rollout on Secret change + topologySpreadConstraints - Add checksum/secret pod annotations on app, realtime, and copilot Deployments (plus checksum/config on app when branding ConfigMap is enabled). Closes the long-standing footgun where 'helm upgrade' with a changed Secret would silently leave pods running the old values until a manual rollout restart. - New top-level topologySpreadConstraints value (and sim.topologySpreadConstraints helper) applied to app and realtime Deployments. Mirrors how affinity and tolerations are plumbed; users supply their own labelSelector to mirror Bitnami convention. - 5 helm-unittest cases cover the checksum annotations and topology spread rendering (46 tests total). * fix(helm): drop empty-string shadowing in app/realtime env merge Sprig 'merge' treats "" as a real value, so a default-empty app.env.BETTER_AUTH_URL would shadow a non-empty realtime.env override and the URL would never reach the rendered Secret. Replace 'merge' with an explicit two-pass overlay that filters empties before writing, mirroring the same pattern already used in deployment-realtime.yaml's existingSecret block. Adds two regression tests: realtime.env-only value reaches the Secret when app.env is empty, and app.env still wins on collision when both are non-empty (48 tests total). * fix(helm): make topologySpreadConstraints per-component to match docstring Greptile flagged that sim.topologySpreadConstraints helper docstring promised per-component config (.Values.app, .Values.realtime, ...) but call sites passed .Values, so any app.topologySpreadConstraints / realtime.topologySpreadConstraints set by the user was silently dropped. The single global key also prevented distinct app-vs-realtime spread rules. Pass .Values.app / .Values.realtime to the helper at each call site; move the top-level topologySpreadConstraints key into both component sections in values.yaml. Adds a regression test that app constraints don't leak onto the realtime pod. * fix(helm): allow cron pods through app NetworkPolicy Cursor flagged that when networkPolicy.enabled=true and cronjobs.enabled=true (the recommended production config), the app NetworkPolicy only allowed ingress from realtime and the ingress controller — silently blocking every cron pod's HTTP call to /api/schedules/execute, webhook polls, etc. All 13 default cronjobs would fail. Tag cron pods with a stable simstudio.ai/component-group: cronjob label so the app NetworkPolicy can allow them with a single rule (no per-job enumeration). Rule is conditional on cronjobs.enabled. Adds positive and negative regression tests.
…4567) * improvement(scheduler): raise per-tick claim budget to drain backlog MAX_CRON_CLAIMS 20 -> 100; reserved workflow/job slots 10/10 -> 50/50. Throughput was capped at 20 schedules/tick which created a 20+ hour backlog when due work exceeded ~1 item per cron-second. * improvement(scheduler): raise per-tick claim budget to 200 Bumps MAX_CRON_CLAIMS 100 -> 200 (workflow/job split 100/100). Pairs with the fire-and-forget cron Lambda change so per-tick processing time is no longer bounded by the Lambda's 50s HTTP timeout.
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Aligns markdown rendering across chat/file/note viewers by standardizing Overhauls the Reviewed by Cursor Bugbot for commit 05892f7. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 32763747 | Triggered | Generic Password | 9d2dd8f | helm/sim/tests/validators_test.yaml | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Greptile SummaryThis release bundles three changes: a 10× increase to the scheduler's per-tick claim budget to drain backlogs faster; a Helm chart major-version overhaul (0.1.0 → 1.0.0) adding security hardening, External Secrets Operator improvements, and extensive documentation; and alignment of blockquote,
Confidence Score: 3/5Safe for new installs and the app/scheduler changes; existing PostgreSQL deployments will fail to upgrade without manual StatefulSet deletion. The PostgreSQL StatefulSet spec.serviceName change from the ClusterIP service to the new headless service is a technically correct fix, but Kubernetes treats that field as immutable after creation. Any operator running helm upgrade on an existing release will hit a hard API rejection and need to delete and recreate the StatefulSet. Additionally, the networkPolicy.egress schema shifted from a flat list to a nested object — users who previously enabled NetworkPolicy with custom egress rules will find those rules silently absent after the upgrade. Both issues are contained to the Helm chart and do not affect the application or scheduler logic, which are straightforward and low-risk. helm/sim/templates/statefulset-postgresql.yaml (immutable serviceName change), helm/sim/values.yaml (networkPolicy.egress schema change) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[helm upgrade] --> B{statefulset-postgresql spec.serviceName changed?}
B -- existing install --> C[Forbidden: immutable field - helm upgrade fails]
B -- fresh install --> D[StatefulSet created with headless governing service]
E[app.env keys] --> F{secret mode?}
F -- inline --> G[chart-managed Secret plus envDefaults inline env]
F -- existingSecret --> H[user Secret plus app.env inline env]
F -- ESO --> I[ExternalSecret syncs remoteRefs.app to Secret]
J[networkPolicy.egress] --> K{key type}
K -- list pre-1.0 --> L[Rules silently dropped - now expects egress.extraRules]
K -- egress.extraRules --> M[Rules applied]
Reviews (1): Last reviewed commit: "improvement(scheduler): raise per-tick c..." | Re-trigger Greptile |
| {{- include "sim.postgresql.labels" . | nindent 4 }} | ||
| spec: | ||
| serviceName: {{ include "sim.fullname" . }}-postgresql | ||
| serviceName: {{ include "sim.fullname" . }}-postgresql-headless |
There was a problem hiding this comment.
Immutable StatefulSet field change breaks
helm upgrade
spec.serviceName is an immutable field on a StatefulSet — Kubernetes only allows mutations to replicas, template, updateStrategy, persistentVolumeClaimRetentionPolicy, and minReadySeconds. Changing it from <release>-postgresql to <release>-postgresql-headless will cause helm upgrade to fail with a Forbidden: updates to statefulset spec for fields other than … error on any installation that already has the PostgreSQL StatefulSet. Operators would need to delete the StatefulSet (and re-attach the PVC) before the upgrade can proceed. Upgrade documentation or a pre-upgrade hook that deletes the old StatefulSet gracefully should be provided.
| ingressFrom: | ||
| - {} | ||
|
|
||
| # Custom ingress rules appended to the policy | ||
| ingress: [] | ||
|
|
||
| # Custom egress rules | ||
| egress: [] | ||
|
|
||
| # Egress configuration | ||
| egress: | ||
| # CIDRs excluded from broad HTTPS (443) egress. | ||
| # Defaults block AWS/GCP/Azure IMDS (169.254.169.254/32) and ECS task metadata | ||
| # (169.254.170.2/32). Add your cluster's API server CIDR for stronger isolation. | ||
| exceptCidrs: | ||
| - "169.254.169.254/32" | ||
| - "169.254.170.2/32" | ||
| # Custom egress rules appended to the policy |
There was a problem hiding this comment.
Breaking schema change for
networkPolicy.egress
The networkPolicy.egress key changed type from a plain list to a map with extraRules and exceptCidrs sub-keys. Any existing deployment that set networkPolicy.egress: [...] will have those rules silently dropped because the template now reads networkPolicy.egress.extraRules rather than networkPolicy.egress. Since networkPolicy defaults to enabled: false, only deployments that explicitly enabled it are affected, but those users will lose their custom egress rules without any error or warning.
| blockquote: ({ children }: React.HTMLAttributes<HTMLQuoteElement>) => ( | ||
| <blockquote className='my-4 border-gray-300 border-l-4 py-1 pl-4 font-sans text-gray-700 italic dark:border-gray-600 dark:text-gray-300'> | ||
| <blockquote className='my-4 break-words border-[var(--divider)] border-l-2 pl-4 font-sans text-[var(--text-primary)] italic [&>p]:my-2 [&>p:first-child]:mt-0 [&>p:last-child]:mb-0'> | ||
| {children} | ||
| </blockquote> | ||
| ), |
There was a problem hiding this comment.
CSS variable mismatch with the rest of the file's styling
Every other element in COMPONENTS (paragraphs, headings, lists, tables, img, etc.) uses concrete Tailwind palette classes with explicit dark: variants (e.g. text-gray-700 dark:text-gray-300, border-gray-300 dark:border-gray-600). The updated blockquote is now the sole element using var(--divider) and var(--text-primary). In the rendering context of this component (the public chat / landing page, where the wrapper already uses var(--landing-text)), if --divider or --text-primary are not defined, the border and text color will fall through to the browser default rather than the intended dark-mode-aware value.
| # ESO API version. Default "v1beta1" — supported by every ESO release from | ||
| # v0.7+ (mid-2023) through current. Set to "v1" only when targeting ESO | ||
| # v0.17+ clusters where the v1 API has graduated. | ||
| apiVersion: "v1beta1" |
There was a problem hiding this comment.
The default ESO
apiVersion was changed from "v1" to "v1beta1". Existing installations that relied on the previous default (and are running ESO ≥ v0.17 where the v1 API is stable) will now render ExternalSecret resources at the v1beta1 API version on the next helm upgrade. While v1beta1 is functionally equivalent and this is a safe downgrade, the change is not surfaced to the user through a default-diff. Adding a note in NOTES.txt about this API version shift would reduce surprise.
| # ESO API version. Default "v1beta1" — supported by every ESO release from | |
| # v0.7+ (mid-2023) through current. Set to "v1" only when targeting ESO | |
| # v0.17+ clusters where the v1 API has graduated. | |
| apiVersion: "v1beta1" | |
| # ESO API version. Use "v1beta1" for broad compatibility (ESO v0.7+). | |
| # Set to "v1" when targeting ESO >= v0.17 clusters where the v1 API has graduated. | |
| apiVersion: "v1beta1" |