Skip to content

Comments

feat(storagebox): replace Bitnami Cassandra with K8ssandra operator#100

Open
adamancini wants to merge 1 commit intomainfrom
adamancini/replace-bitnami
Open

feat(storagebox): replace Bitnami Cassandra with K8ssandra operator#100
adamancini wants to merge 1 commit intomainfrom
adamancini/replace-bitnami

Conversation

@adamancini
Copy link
Member

@adamancini adamancini commented Feb 9, 2026

Summary

Replaces the Bitnami Cassandra Helm subchart with the K8ssandra operator, modernizes the KOTS application configuration, refactors preflight/support bundle architecture, and adds proxy registry support. This addresses the Bitnami transition to paid Broadcom subscriptions after September 2025.

K8ssandra migration

  • Adds k8ssandra-operator (v1.22.0) as an EC extension alongside existing CNPG, MinIO, cert-manager, and ingress-nginx operators
  • Removes ~970 lines of Bitnami chart configuration, replacing with a K8ssandraCluster CRD template and ~50 lines of values
  • Introduces deployment mode selector: simple (Cassandra only) or full (Cassandra + Reaper repairs)
  • Simplifies TLS to a single cert-manager toggle
  • Configures k8ssandra-operator with global.clusterScoped: true (required to watch CRs in the app namespace)
  • Moves cert-manager to first position in ec.yaml (k8ssandra cass-operator webhooks depend on cert-manager CRDs)

Preflight and support bundle refactoring

  • Moves specs from duplicate KOTS/Helm sources into Helm define blocks (_preflight.tpl, _supportbundle.tpl) included as Secrets, following Replicated's recommended pattern
  • Deletes standalone kots-preflight.yaml and kots-support-bundle.yaml (were duplicate sources of truth)
  • Adds spec.uri fields pointing to raw GitHub permalinks for online spec updates
  • Adds preflight checks for all subcharts when enabled: PostgreSQL (memory, K8s version), MinIO (memory, storage), rqlite (CPU), NFS (kernel module), Cassandra (CPU cluster/node)
  • Includes preflight re-checks in support bundles for post-install environment drift detection
  • Pins busybox:latest to busybox:1.37.0 for air-gap compatibility

Proxy registry and image management

  • Adds proxy.xyyzx.net image overrides for all images (MinIO, rqlite, PostgreSQL, busybox) with air-gap fallback via HasLocalRegistry
  • Adds additionalNamespaces for all operator namespaces (cert-manager, ingress-nginx, cnpg, k8ssandra-operator, minio) so KOTS provisions imagePullSecrets
  • Adds additionalImages for busybox through proxy
  • Extracts busybox image reference to chart values for proxy override support

KOTS config expansion (17 new options)

  • Cassandra: datacenter replica count, storage class, storage size, JVM heap size, Prometheus metrics toggle
  • PostgreSQL: instance count, storage size, storage class, log level, external PostgreSQL support (host, port, database)
  • MinIO: TLS auto-cert toggle, API ingress hostname, console ingress hostname
  • rqlite: storage size, storage class
  • Adds conditional statusInformers for all subcharts
  • Populates kots-app.yaml spec fields (icon, ports, graphs, additionalImages)

Values cleanup

  • Trims values.yaml from 640 to ~190 lines (removes values that match subchart defaults)
  • Cleans up HelmChart CR tenant section (removes subchart defaults)
  • Syncs development-values.yaml with all new config defaults

Build and testing

  • Adds customer management workflow for EC testing in Makefile
  • Adds CI workflow and smoke tests for all storage components
  • Fixes Makefile .SHELLFLAGS for GNU Make compatibility on Linux CI runners
  • Bumps EC version to 2.13.4+k8s-1.33, chart version to 0.24.0

Resolves #96

Test plan

  • helm lint charts/storagebox passes
  • helm template storagebox charts/storagebox --debug renders all templates correctly
  • replicated release lint --yaml-dir kots passes (no errors)
  • All ConfigOption references exist in kots-config.yaml
  • development-values.yaml contains all config items with matching defaults
  • Deploy with cassandra_mode=simple and verify CassandraDatacenter comes up healthy
  • Verify support bundle collects k8ssandra-operator managed pod logs
  • CI workflow passes (lint-and-template + helm-install-test on k3s 1.32)
  • EC headless install succeeds on CMX VM (Ubuntu 22.04, EC 2.13.4+k8s-1.33)
  • All 5 EC extensions install successfully
  • Admin Console reachable after headless install with development-values.yaml

@adamancini
Copy link
Member Author

this one requires a little bit more testing than the other changes - I need to implement a cassandra consumer

@adamancini adamancini force-pushed the adamancini/replace-bitnami branch 10 times, most recently from e01ddbe to 994b975 Compare February 13, 2026 22:00
@adamancini
Copy link
Member Author

this one requires a little bit more testing than the other changes - I need to implement a cassandra consumer

use the helm chart tests to make a small request to the /health endpoint and check for 200 OK

@jmboby
Copy link
Member

jmboby commented Feb 17, 2026

@adamancini I've deployed to EC and I see an issue with Cassandra Status Informer -

image

The cassandradc exists but the Pod is pending due to insufficient cpu. Should we update preflight check to include cpu spec & the README to give a min VM spec? (I'm currently on the r1.small with 2 cpu's)

k get cassandradatacenter -A
NAMESPACE   NAME   AGE
kotsadm     dc1    7m33s

12m         Warning   FailedScheduling              pod/storagebox-cassandra-dc1-default-sts-0                                 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
8m50s       Warning   FailedScheduling              pod/storagebox-cassandra-dc1-default-sts-0                                 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

The cassandra client is looking for a whole cpu whilst the cpu requests are already at 61% on a two cpu VM.

image: cr.k8ssandra.io/k8ssandra/k8ssandra-client:v0.6.4
        imagePullPolicy: IfNotPresent
        name: server-config-init
        resources:
          limits:
            cpu: "1"
            memory: 384M
          requests:
            cpu: "1"
            memory: 256M

@jmboby
Copy link
Member

jmboby commented Feb 17, 2026

@adamancini do we care about missing logs in the support bundle for the below?

󰀵 jwilson ~ ⎈ default   09:41  ❯ k -n k8ssandra-operator logs k8ssandra-operator-84c788cb8f-sc4m8
Error from server (NotFound): the server could not find the requested resource (get pods k8ssandra-operator-84c788cb8f-sc4m8)

󰀵 jwilson ~ ⎈ default   09:41  ❯ k -n k8ssandra-operator logs k8ssandra-operator-cass-operator-6d75b4959b-5lvbd
Error from server (NotFound): the server could not find the requested resource (get pods k8ssandra-operator-cass-operator-6d75b4959b-5lvbd)

󰀵 jwilson ~ ⎈ default   09:41  ❯ k -n cnpg logs cloudnative-pg-76bdfd4497-lkppn
Error from server (NotFound): the server could not find the requested resource (get pods cloudnative-pg-76bdfd4497-lkppn)

󰀵 jwilson ~ ⎈ default   09:41  ❯ k -n cert-manager logs cert-manager-fd4f89f9b-ncr42
Error from server (NotFound): the server could not find the requested resource (get pods cert-manager-fd4f89f9b-ncr42)

󰀵 jwilson ~ ⎈ default   09:42  ❯ k -n minio logs minio-operator-57b9ccf48c-6k4wt
Error from server (NotFound): the server could not find the requested resource (get pods minio-operator-57b9ccf48c-6k4wt)

The support bundle is looking for:

kind: SupportBundle
apiVersion: troubleshoot.sh/v1beta2
metadata:
  name: cassandra
spec:
  collectors:
    - clusterInfo: {}
    - clusterResources: {}
    - logs:
        selector:
          - app.kubernetes.io/managed-by=cassandra-operator

However the Pods have these labels:

Labels:      app.kubernetes.io/instance=k8ssandra-operator
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=cass-operator
                  app.kubernetes.io/part-of=k8ssandra-k8ssandra-operator-k8ssandra-operator
                  control-plane=k8ssandra-operator-cass-operator-controller-manager
                  helm.sh/chart=cass-operator-0.56.0

@jmboby
Copy link
Member

jmboby commented Feb 17, 2026

I built on a r1.large VM and got the Cassandra Pod running, however the status informer still seems to say its down. I wonder if it can't read the status field of the CRD properly?

status:
  cassandraOperatorProgress: Ready
  conditions:
  - lastTransitionTime: "2026-02-17T20:51:40Z"
    message: ""
    reason: ""
    status: "True"
    type: Healthy
  - lastTransitionTime: "2026-02-17T20:51:41Z"
    message: ""
    reason: ""
    status: "False"
    type: ResizingVolumes
  - lastTransitionTime: "2026-02-17T20:51:41Z"
    message: ""
    reason: ""
    status: "False"
    type: Stopped
  - lastTransitionTime: "2026-02-17T20:51:41Z"
    message: ""
    reason: ""
    status: "False"
    type: ReplacingNodes
  - lastTransitionTime: "2026-02-17T20:51:41Z"
    message: ""
    reason: ""
    status: "False"
    type: Updating
  - lastTransitionTime: "2026-02-17T20:51:41Z"
    message: ""
    reason: ""
    status: "False"
    type: RollingRestart
  - lastTransitionTime: "2026-02-17T20:51:41Z"
    message: ""
    reason: ""
    status: "False"
    type: Resuming
  - lastTransitionTime: "2026-02-17T20:51:41Z"
    message: ""
    reason: ""
    status: "False"
    type: ScalingDown
  - lastTransitionTime: "2026-02-17T20:51:41Z"
    message: ""
    reason: ""
    status: "True"
    type: Valid
  - lastTransitionTime: "2026-02-17T20:51:41Z"
    message: ""
    reason: ""
    status: "True"
    type: Initialized
  - lastTransitionTime: "2026-02-17T20:51:41Z"
    message: ""
    reason: ""
    status: "True"
    type: Ready
  - lastTransitionTime: "2026-02-17T20:51:41Z"
    message: ""
    reason: ""
    status: "False"
    type: RequiresUpdate
  datacenterName: dc1
  lastRollingRestart: "1970-01-01T00:00:01Z"
  lastServerNodeStarted: "2026-02-17T20:50:28Z"
  nodeStatuses:
    storagebox-cassandra-dc1-default-sts-0:
      hostID: 076321b9-a528-4c2f-ab6c-cb971c69ce72
      ip: 10.244.3.220
      rack: default
  observedGeneration: 1

@jmboby
Copy link
Member

jmboby commented Feb 17, 2026

I tried switching from Cassandra simple to full as per your Test checklist. I done this via the Admin Console after the initial build and then selected 'Deploy' on the new version. Is this a valid test or should I start with Cassandra full and build from scratch?

The storagebox-cassandra-dc1-default-sts-0 Pod never fully starts up and the logs show liveness probes succeeding then failing in a loop:

INFO  [nioEventLoopGroup-2-2] 2026-02-17 23:35:43,300 Cli.java:672 - address=/169.254.170.0:53326 url=/api/v0/probes/readiness status=500 Internal Server Error
INFO  [nioEventLoopGroup-2-1] 2026-02-17 23:35:43,777 Cli.java:672 - address=/169.254.170.0:53340 url=/api/v0/probes/liveness status=200 OK
INFO  [nioEventLoopGroup-2-2] 2026-02-17 23:35:53,299 Cli.java:672 - address=/169.254.170.0:34806 url=/api/v0/probes/readiness status=500 Internal Server Error
INFO  [nioEventLoopGroup-2-1] 2026-02-17 23:35:58,777 Cli.java:672 - address=/169.254.170.0:34808 url=/api/v0/probes/liveness status=200 OK
INFO  [nioEventLoopGroup-2-2] 2026-02-17 23:36:03,300 Cli.java:672 - address=/169.254.170.0:53334 url=/api/v0/probes/readiness status=500 Internal Server Error
image

However eventually I found that after some time (like quite a long time) it does start and the Reaper pod also comes up, logs are all ok. So maybe some patience is needed and this does work after all.

root@0ada8fa9:/home/jmboby# k get po -n kotsadm
NAME                                               READY   STATUS    RESTARTS        AGE
kotsadm-5f64d757b9-n5fkg                           1/1     Running   0               5h58m
kotsadm-rqlite-0                                   1/1     Running   0               5h59m
kurl-proxy-kotsadm-689d84f878-ztklz                1/1     Running   1 (5h59m ago)   5h59m
postgres-1                                         1/1     Running   0               5h45m
replicated-6bf8df9cf-v88pw                         1/1     Running   0               5h45m
storagebox-cassandra-dc1-default-sts-0             2/2     Running   0               172m
storagebox-cassandra-dc1-reaper-67fc45c5b8-prhpq   1/1     Running   0               171m

@jmboby
Copy link
Member

jmboby commented Feb 18, 2026

When I try to enable TLS within the Admin console I get 'failed to render archive directory' when attempting to save config

image

I guess this is related to your comments 'helm template --set cassandra.tls.enabled=true: Fails with expected guard message (TLS not yet implemented)'. But I see you have implemented cert-manager in ec.yaml extensions but you haven't listed in the main Chart.yaml ? Cert-manager is running fine within EC. So I'm confused on the implementation details.

helm template storagebox ./charts/storagebox --set cassandra.tls.enabled=true --debug

level=DEBUG msg="Original chart version" version=""
level=DEBUG msg="Chart path" path=/Users/jwilson/git/platform-examples/applications/storagebox/charts/storagebox
level=DEBUG msg="number of dependencies in the chart" chart=storagebox dependencies=3
level=DEBUG msg="number of dependencies in the chart" chart=replicated dependencies=0
level=DEBUG msg="number of dependencies in the chart" chart=nfs-server dependencies=1
level=DEBUG msg="number of dependencies in the chart" chart=common dependencies=0
level=DEBUG msg="number of dependencies in the chart" chart=tenant dependencies=0

Error: execution error at (storagebox/templates/cassandra-cluster.yaml:3:4): Cassandra TLS via cert-manager is not yet implemented. Set cassandra.tls.enabled=false or remove cassandra_tls_enabled from KOTS config.

@adamancini
Copy link
Member Author

When I try to enable TLS within the Admin console I get 'failed to render archive directory' when attempting to save config

image I guess this is related to your comments 'helm template --set cassandra.tls.enabled=true: Fails with expected guard message (TLS not yet implemented)'. But I see you have implemented cert-manager in ec.yaml extensions but you haven't listed in the main Chart.yaml ? Cert-manager is running fine within EC. So I'm confused on the implementation details.
helm template storagebox ./charts/storagebox --set cassandra.tls.enabled=true --debug

level=DEBUG msg="Original chart version" version=""
level=DEBUG msg="Chart path" path=/Users/jwilson/git/platform-examples/applications/storagebox/charts/storagebox
level=DEBUG msg="number of dependencies in the chart" chart=storagebox dependencies=3
level=DEBUG msg="number of dependencies in the chart" chart=replicated dependencies=0
level=DEBUG msg="number of dependencies in the chart" chart=nfs-server dependencies=1
level=DEBUG msg="number of dependencies in the chart" chart=common dependencies=0
level=DEBUG msg="number of dependencies in the chart" chart=tenant dependencies=0

Error: execution error at (storagebox/templates/cassandra-cluster.yaml:3:4): Cassandra TLS via cert-manager is not yet implemented. Set cassandra.tls.enabled=false or remove cassandra_tls_enabled from KOTS config.

I haven't implemented anything to do with TLS yet in Cassandra so I'll make this config a no-op for now

adamancini added a commit that referenced this pull request Feb 18, 2026
Fix support bundle label selectors to match actual K8ssandra pod labels
(managed-by=cass-operator, not cassandra-operator) and add log collectors
for all EC extension operators in their correct namespaces (k8ssandra-operator,
cnpg, minio, cert-manager, ingress-nginx).

Fix status informer for Cassandra by switching from CassandraDatacenter CRD
(unsupported by KOTS status informers) to the underlying StatefulSet created
by cass-operator.

Add conditional CPU preflight checks when Cassandra is enabled: cluster-wide
capacity (sum >= 4 cores) and per-node allocatable check (max >= 2 cores) to
catch under-provisioned VMs like r1.small before deployment.

Hide TLS config toggle from admin console UI since TLS via cert-manager is
not yet implemented -- the exposed toggle caused confusing render failures.
@adamancini
Copy link
Member Author

@jmboby Thanks for the thorough testing! Pushed fixes for all the issues you found:

Support bundle label mismatch

Fixed the Cassandra data pod selector from managed-by=cassandra-operator (wrong) to managed-by=cass-operator. Also added log collectors for all EC extension operators in their correct namespaces — k8ssandra-operator, cass-operator, CNPG, MinIO, cert-manager, and ingress-nginx were all missing. Both the KOTS and Helm template versions of the support bundle are updated.

Status informer showing unavailable

Root cause: KOTS status informers only support six built-in resource types (Deployment, StatefulSet, DaemonSet, Service, Ingress, PVC). cassandradatacenter/dc1 is a CRD and was silently ignored — KOTS logs "Informer requested for unsupported resource kind" and reports Unavailable.

Fixed by switching to statefulset/storagebox-cassandra-dc1-default-sts — the StatefulSet that cass-operator creates underneath the CassandraDatacenter CR. KOTS natively understands StatefulSet health (readyReplicas == desiredReplicas).

CPU preflight check

Added two conditional preflight checks (only active when Cassandra is enabled):

  • Cluster-wide: sum(cpuCapacity) >= 4 (fail) / >= 6 (warn for production with Reaper)
  • Per-node: max(cpuAllocatable) >= 2 (fail) / >= 3 (warn) — this catches the r1.small scenario where the init container's 1 CPU request can't be satisfied after system reservations

Both KOTS and Helm template preflights updated.

TLS toggle render failure

Hidden the TLS config item from the admin console UI (when: 'false') since TLS via cert-manager isn't implemented yet. The fail guard in the Helm template stays as a safety net, but users won't encounter the confusing "failed to render archive directory" error anymore. Left a comment in kots-config.yaml to re-enable when TLS templating is complete.

Simple → full transition

Noted the slow startup — readiness probes return 500 for an extended period before Cassandra fully initializes with Reaper. This is expected K8ssandra behavior, not a bug on our side.

Cleanup

Also renamed preflight metadata.name from cassandra to storagebox (matching the support bundle rename) and fixed a stale cassandra-operator label selector in CLAUDE.md testing docs.

Commits: 1e9ffe7, 2ae4789

adamancini added a commit that referenced this pull request Feb 18, 2026
Fix support bundle label selectors to match actual K8ssandra pod labels
(managed-by=cass-operator, not cassandra-operator) and add log collectors
for all EC extension operators in their correct namespaces (k8ssandra-operator,
cnpg, minio, cert-manager, ingress-nginx).

Fix status informer for Cassandra by switching from CassandraDatacenter CRD
(unsupported by KOTS status informers) to the underlying StatefulSet created
by cass-operator.

Add conditional CPU preflight checks when Cassandra is enabled: cluster-wide
capacity (sum >= 4 cores) and per-node allocatable check (max >= 2 cores) to
catch under-provisioned VMs like r1.small before deployment.

Hide TLS config toggle from admin console UI since TLS via cert-manager is
not yet implemented -- the exposed toggle caused confusing render failures.
@adamancini adamancini force-pushed the adamancini/replace-bitnami branch from b1a2d06 to 3bfbd9b Compare February 18, 2026 21:16
@adamancini
Copy link
Member Author

Screenshot 2026-02-19 at 14 28 56

Replace Bitnami Cassandra subchart with K8ssandra operator and
modernize the KOTS application configuration, preflight/support
bundle architecture, and image proxy support.

K8ssandra migration:
- Add k8ssandra-operator (v1.22.0) as EC extension
- Remove ~970 lines of Bitnami chart config, replace with
  K8ssandraCluster CRD template
- Add deployment mode selector: simple (Cassandra only) or
  full (Cassandra + Reaper repairs)
- Simplify TLS to single cert-manager toggle

Preflight and support bundle refactoring:
- Move specs from duplicate KOTS/Helm sources into Helm define
  blocks (_preflight.tpl, _supportbundle.tpl) included as Secrets
- Delete standalone kots-preflight.yaml and kots-support-bundle.yaml
- Add spec.uri fields for online spec updates without app upgrades
- Add preflight checks for all subcharts when enabled (PostgreSQL,
  MinIO, rqlite, NFS kernel module, Cassandra CPU/memory)
- Include preflight re-checks in support bundles for drift detection
- Use storagebox.labels helper for common Kubernetes labels
- Pin busybox:latest to busybox:1.37.0 for air-gap compatibility

Proxy registry and image management:
- Add proxy.xyyzx.net image overrides for all images (MinIO, rqlite,
  PostgreSQL, busybox) with air-gap fallback via HasLocalRegistry
- Add additionalNamespaces for all operator namespaces so KOTS
  provisions imagePullSecrets
- Add additionalImages for busybox through proxy
- Extract busybox image to values for proxy override support

KOTS config expansion:
- Add storage sizing and storage class for all backends
- Add Cassandra datacenter replica count, JVM heap size, Prometheus
- Add PostgreSQL instance count, log level, external PG support
- Add MinIO TLS auto-cert toggle and ingress hostname config
- Add rqlite storage sizing and storage class
- Add conditional statusInformers for all subcharts
- Populate kots-app.yaml spec fields (icon, ports, graphs)

Values cleanup:
- Trim values.yaml to only subchart overrides (640 -> 190 lines)
- Clean up HelmChart CR tenant section (remove subchart defaults)
- Sync development-values.yaml with all new config defaults

Build and testing:
- Add customer management workflow for EC testing
- Add CI workflow and smoke tests for all storage components
- Fix Makefile .SHELLFLAGS for GNU Make compatibility
- Bump EC version to 2.13.4+k8s-1.33, chart version to 0.24.0

Resolves #96
@adamancini adamancini force-pushed the adamancini/replace-bitnami branch from 37dce7e to 3f264d5 Compare February 20, 2026 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace Bitnami Cassandra chart in storagebox application

2 participants