Add Fluent Bit DaemonSet to CI for persistent log collection#2350
Add Fluent Bit DaemonSet to CI for persistent log collection#2350bert-e merged 4 commits intodevelopment/2.14from
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
|
Pushing a tentative fix to make sure the log files have the same name. Let's see how it goes. |
|
Working fine. Ready. |
|
Seems fine for the CI use case: but I wonder if this is appropriate for the local (or codespace) cases, where platform may run for much longer and where space may be kind of a premium. Should this use of fluentbit be configurable/optional? Or do we estimate impact will still be acceptable in these context? |
@delthas did you see this comment? |
Deploy a Fluent Bit DaemonSet on all kind nodes to tail container logs in real-time and write them to persistent storage. This ensures logs from deleted pods (rolling restarts, operator reconciliation) are preserved in CI artifacts. Currently, kind export logs only captures logs from pods alive at export time. When pods are replaced (e.g. during operator reconciliation loops), their logs are deleted by kubelet and lost. Issue: ZENKO-5216
Strip the kube.var.log.containers. prefix from fluentbit output filenames so they match the naming used by kind export logs. Issue: ZENKO-5216
At archive creation time, iterate through kind-export container log files and replace them with symlinks to the corresponding fluentbit log files when present. This deduplicates the ~118 overlapping container logs while keeping the file paths ksnap expects. Issue: ZENKO-5216
Skip the Fluent Bit DaemonSet in local and codespace environments where long-running clusters and limited disk space make persistent log collection undesirable. Issue: ZENKO-5216
1236514 to
a09c100
Compare
Addressed. fluentbit will only be deployed in CI (which makes sense anyway because we are not running the archive artifacts step in local/devcontainer). |
|
Ready for review. Symlinks resolve correctly. |
|
/approve |
In the queueThe changeset has received all authorizations and has been added to the The changeset will be merged in:
The following branches will NOT be impacted:
There is no action required on your side. You will be notified here once IMPORTANT Please do not attempt to modify this pull request.
If you need this pull request to be removed from the queue, please contact a The following options are set: approve |
|
I have successfully merged the changeset of this pull request
The following branches have NOT changed:
Please check the status of the associated issue ZENKO-5216. Goodbye delthas. |
Summary
/datahostPath volume and included in thelogs-volumes.tgzartifactWhy this is needed
When the Zenko operator reconciles, it triggers rolling restarts that replace pods. Kubelet deletes container logs from
/var/log/pods/on pod deletion, andkind export logsonly captures logs from pods alive at export time. Every log from before the restart is permanently lost.Concrete example
While investigating ctst-end2end-sharded #66912102578, all 11 Azure archive tests failed because objects were stuck at
transition-in-progress: true. The transition pipeline (lifecycle → cloudserver → Kafka → sorbet-fwd → sorbet-azure) was broken, but we could not determine where because:kind export logsran at 08:22, capturing only the post-restart pod logsWith Fluent Bit tailing logs continuously, all of these would have been preserved regardless of pod lifecycle.
Changes
.github/scripts/end2end/configs/fluentbit.yaml— ConfigMap + DaemonSet manifest. Tails/var/log/containers/*.log, writes raw CRI log lines (with timestamps) to/data/fluentbit-logs/, one file per container instance.github/scripts/end2end/install-kind-dependencies.sh— Deploys Fluent Bit early (before nginx-controller) so it starts capturing logs before any other workloads.github/actions/archive-artifacts/action.yaml— Copies Fluent Bit output intokind-logs/fluentbit-logs/before tarringIssue: ZENKO-5216