feat(storagebox): replace Bitnami Cassandra with K8ssandra operator#100
feat(storagebox): replace Bitnami Cassandra with K8ssandra operator#100adamancini wants to merge 1 commit intomainfrom
Conversation
|
this one requires a little bit more testing than the other changes - I need to implement a cassandra consumer |
e01ddbe to
994b975
Compare
use the helm chart tests to make a small request to the /health endpoint and check for 200 OK |
|
@adamancini I've deployed to EC and I see an issue with Cassandra Status Informer -
The cassandradc exists but the Pod is pending due to insufficient cpu. Should we update preflight check to include cpu spec & the README to give a min VM spec? (I'm currently on the r1.small with 2 cpu's) k get cassandradatacenter -A
NAMESPACE NAME AGE
kotsadm dc1 7m33s
12m Warning FailedScheduling pod/storagebox-cassandra-dc1-default-sts-0 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
8m50s Warning FailedScheduling pod/storagebox-cassandra-dc1-default-sts-0 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.The cassandra client is looking for a whole cpu whilst the cpu requests are already at 61% on a two cpu VM. image: cr.k8ssandra.io/k8ssandra/k8ssandra-client:v0.6.4
imagePullPolicy: IfNotPresent
name: server-config-init
resources:
limits:
cpu: "1"
memory: 384M
requests:
cpu: "1"
memory: 256M |
|
@adamancini do we care about missing logs in the support bundle for the below? jwilson ~ ⎈ default 09:41 ❯ k -n k8ssandra-operator logs k8ssandra-operator-84c788cb8f-sc4m8
Error from server (NotFound): the server could not find the requested resource (get pods k8ssandra-operator-84c788cb8f-sc4m8)
jwilson ~ ⎈ default 09:41 ❯ k -n k8ssandra-operator logs k8ssandra-operator-cass-operator-6d75b4959b-5lvbd
Error from server (NotFound): the server could not find the requested resource (get pods k8ssandra-operator-cass-operator-6d75b4959b-5lvbd)
jwilson ~ ⎈ default 09:41 ❯ k -n cnpg logs cloudnative-pg-76bdfd4497-lkppn
Error from server (NotFound): the server could not find the requested resource (get pods cloudnative-pg-76bdfd4497-lkppn)
jwilson ~ ⎈ default 09:41 ❯ k -n cert-manager logs cert-manager-fd4f89f9b-ncr42
Error from server (NotFound): the server could not find the requested resource (get pods cert-manager-fd4f89f9b-ncr42)
jwilson ~ ⎈ default 09:42 ❯ k -n minio logs minio-operator-57b9ccf48c-6k4wt
Error from server (NotFound): the server could not find the requested resource (get pods minio-operator-57b9ccf48c-6k4wt)The support bundle is looking for: kind: SupportBundle
apiVersion: troubleshoot.sh/v1beta2
metadata:
name: cassandra
spec:
collectors:
- clusterInfo: {}
- clusterResources: {}
- logs:
selector:
- app.kubernetes.io/managed-by=cassandra-operatorHowever the Pods have these labels: Labels: app.kubernetes.io/instance=k8ssandra-operator
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=cass-operator
app.kubernetes.io/part-of=k8ssandra-k8ssandra-operator-k8ssandra-operator
control-plane=k8ssandra-operator-cass-operator-controller-manager
helm.sh/chart=cass-operator-0.56.0 |
|
I built on a r1.large VM and got the Cassandra Pod running, however the status informer still seems to say its down. I wonder if it can't read the status field of the CRD properly? status:
cassandraOperatorProgress: Ready
conditions:
- lastTransitionTime: "2026-02-17T20:51:40Z"
message: ""
reason: ""
status: "True"
type: Healthy
- lastTransitionTime: "2026-02-17T20:51:41Z"
message: ""
reason: ""
status: "False"
type: ResizingVolumes
- lastTransitionTime: "2026-02-17T20:51:41Z"
message: ""
reason: ""
status: "False"
type: Stopped
- lastTransitionTime: "2026-02-17T20:51:41Z"
message: ""
reason: ""
status: "False"
type: ReplacingNodes
- lastTransitionTime: "2026-02-17T20:51:41Z"
message: ""
reason: ""
status: "False"
type: Updating
- lastTransitionTime: "2026-02-17T20:51:41Z"
message: ""
reason: ""
status: "False"
type: RollingRestart
- lastTransitionTime: "2026-02-17T20:51:41Z"
message: ""
reason: ""
status: "False"
type: Resuming
- lastTransitionTime: "2026-02-17T20:51:41Z"
message: ""
reason: ""
status: "False"
type: ScalingDown
- lastTransitionTime: "2026-02-17T20:51:41Z"
message: ""
reason: ""
status: "True"
type: Valid
- lastTransitionTime: "2026-02-17T20:51:41Z"
message: ""
reason: ""
status: "True"
type: Initialized
- lastTransitionTime: "2026-02-17T20:51:41Z"
message: ""
reason: ""
status: "True"
type: Ready
- lastTransitionTime: "2026-02-17T20:51:41Z"
message: ""
reason: ""
status: "False"
type: RequiresUpdate
datacenterName: dc1
lastRollingRestart: "1970-01-01T00:00:01Z"
lastServerNodeStarted: "2026-02-17T20:50:28Z"
nodeStatuses:
storagebox-cassandra-dc1-default-sts-0:
hostID: 076321b9-a528-4c2f-ab6c-cb971c69ce72
ip: 10.244.3.220
rack: default
observedGeneration: 1 |
Fix support bundle label selectors to match actual K8ssandra pod labels (managed-by=cass-operator, not cassandra-operator) and add log collectors for all EC extension operators in their correct namespaces (k8ssandra-operator, cnpg, minio, cert-manager, ingress-nginx). Fix status informer for Cassandra by switching from CassandraDatacenter CRD (unsupported by KOTS status informers) to the underlying StatefulSet created by cass-operator. Add conditional CPU preflight checks when Cassandra is enabled: cluster-wide capacity (sum >= 4 cores) and per-node allocatable check (max >= 2 cores) to catch under-provisioned VMs like r1.small before deployment. Hide TLS config toggle from admin console UI since TLS via cert-manager is not yet implemented -- the exposed toggle caused confusing render failures.
|
@jmboby Thanks for the thorough testing! Pushed fixes for all the issues you found: Support bundle label mismatchFixed the Cassandra data pod selector from Status informer showing unavailableRoot cause: KOTS status informers only support six built-in resource types (Deployment, StatefulSet, DaemonSet, Service, Ingress, PVC). Fixed by switching to CPU preflight checkAdded two conditional preflight checks (only active when Cassandra is enabled):
Both KOTS and Helm template preflights updated. TLS toggle render failureHidden the TLS config item from the admin console UI ( Simple → full transitionNoted the slow startup — readiness probes return 500 for an extended period before Cassandra fully initializes with Reaper. This is expected K8ssandra behavior, not a bug on our side. CleanupAlso renamed preflight |
Fix support bundle label selectors to match actual K8ssandra pod labels (managed-by=cass-operator, not cassandra-operator) and add log collectors for all EC extension operators in their correct namespaces (k8ssandra-operator, cnpg, minio, cert-manager, ingress-nginx). Fix status informer for Cassandra by switching from CassandraDatacenter CRD (unsupported by KOTS status informers) to the underlying StatefulSet created by cass-operator. Add conditional CPU preflight checks when Cassandra is enabled: cluster-wide capacity (sum >= 4 cores) and per-node allocatable check (max >= 2 cores) to catch under-provisioned VMs like r1.small before deployment. Hide TLS config toggle from admin console UI since TLS via cert-manager is not yet implemented -- the exposed toggle caused confusing render failures.
b1a2d06 to
3bfbd9b
Compare
Replace Bitnami Cassandra subchart with K8ssandra operator and modernize the KOTS application configuration, preflight/support bundle architecture, and image proxy support. K8ssandra migration: - Add k8ssandra-operator (v1.22.0) as EC extension - Remove ~970 lines of Bitnami chart config, replace with K8ssandraCluster CRD template - Add deployment mode selector: simple (Cassandra only) or full (Cassandra + Reaper repairs) - Simplify TLS to single cert-manager toggle Preflight and support bundle refactoring: - Move specs from duplicate KOTS/Helm sources into Helm define blocks (_preflight.tpl, _supportbundle.tpl) included as Secrets - Delete standalone kots-preflight.yaml and kots-support-bundle.yaml - Add spec.uri fields for online spec updates without app upgrades - Add preflight checks for all subcharts when enabled (PostgreSQL, MinIO, rqlite, NFS kernel module, Cassandra CPU/memory) - Include preflight re-checks in support bundles for drift detection - Use storagebox.labels helper for common Kubernetes labels - Pin busybox:latest to busybox:1.37.0 for air-gap compatibility Proxy registry and image management: - Add proxy.xyyzx.net image overrides for all images (MinIO, rqlite, PostgreSQL, busybox) with air-gap fallback via HasLocalRegistry - Add additionalNamespaces for all operator namespaces so KOTS provisions imagePullSecrets - Add additionalImages for busybox through proxy - Extract busybox image to values for proxy override support KOTS config expansion: - Add storage sizing and storage class for all backends - Add Cassandra datacenter replica count, JVM heap size, Prometheus - Add PostgreSQL instance count, log level, external PG support - Add MinIO TLS auto-cert toggle and ingress hostname config - Add rqlite storage sizing and storage class - Add conditional statusInformers for all subcharts - Populate kots-app.yaml spec fields (icon, ports, graphs) Values cleanup: - Trim values.yaml to only subchart overrides (640 -> 190 lines) - Clean up HelmChart CR tenant section (remove subchart defaults) - Sync development-values.yaml with all new config defaults Build and testing: - Add customer management workflow for EC testing - Add CI workflow and smoke tests for all storage components - Fix Makefile .SHELLFLAGS for GNU Make compatibility - Bump EC version to 2.13.4+k8s-1.33, chart version to 0.24.0 Resolves #96
37dce7e to
3f264d5
Compare





Summary
Replaces the Bitnami Cassandra Helm subchart with the K8ssandra operator, modernizes the KOTS application configuration, refactors preflight/support bundle architecture, and adds proxy registry support. This addresses the Bitnami transition to paid Broadcom subscriptions after September 2025.
K8ssandra migration
global.clusterScoped: true(required to watch CRs in the app namespace)Preflight and support bundle refactoring
defineblocks (_preflight.tpl,_supportbundle.tpl) included as Secrets, following Replicated's recommended patternkots-preflight.yamlandkots-support-bundle.yaml(were duplicate sources of truth)spec.urifields pointing to raw GitHub permalinks for online spec updatesbusybox:latesttobusybox:1.37.0for air-gap compatibilityProxy registry and image management
proxy.xyyzx.netimage overrides for all images (MinIO, rqlite, PostgreSQL, busybox) with air-gap fallback viaHasLocalRegistryadditionalNamespacesfor all operator namespaces (cert-manager,ingress-nginx,cnpg,k8ssandra-operator,minio) so KOTS provisionsimagePullSecretsadditionalImagesfor busybox through proxyKOTS config expansion (17 new options)
statusInformersfor all subchartskots-app.yamlspec fields (icon, ports, graphs, additionalImages)Values cleanup
values.yamlfrom 640 to ~190 lines (removes values that match subchart defaults)development-values.yamlwith all new config defaultsBuild and testing
.SHELLFLAGSfor GNU Make compatibility on Linux CI runnersResolves #96
Test plan
helm lint charts/storageboxpasseshelm template storagebox charts/storagebox --debugrenders all templates correctlyreplicated release lint --yaml-dir kotspasses (no errors)cassandra_mode=simpleand verify CassandraDatacenter comes up healthy