Skip to content

Add Fluent Bit DaemonSet to CI for persistent log collection#2350

Merged
bert-e merged 4 commits intodevelopment/2.14from
improvement/ZENKO-5216/fluentbit-ci-logs
Mar 20, 2026
Merged

Add Fluent Bit DaemonSet to CI for persistent log collection#2350
bert-e merged 4 commits intodevelopment/2.14from
improvement/ZENKO-5216/fluentbit-ci-logs

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Mar 13, 2026

Summary

  • Deploy a Fluent Bit DaemonSet on all kind nodes to capture container logs in real-time
  • Logs are written to the shared /data hostPath volume and included in the logs-volumes.tgz artifact
  • Lightweight: ~50m CPU / 64Mi RAM per node, estimated ~100-200 MB compressed output

Why this is needed

When the Zenko operator reconciles, it triggers rolling restarts that replace pods. Kubelet deletes container logs from /var/log/pods/ on pod deletion, and kind export logs only captures logs from pods alive at export time. Every log from before the restart is permanently lost.

Concrete example

While investigating ctst-end2end-sharded #66912102578, all 11 Azure archive tests failed because objects were stuck at transition-in-progress: true. The transition pipeline (lifecycle → cloudserver → Kafka → sorbet-fwd → sorbet-azure) was broken, but we could not determine where because:

  • The operator's reconciliation loop triggered a mass rolling restart at 07:20, replacing ALL data service pods
  • kind export logs ran at 08:22, capturing only the post-restart pod logs
  • All container logs from the 06:36–06:38 failure window were gone
  • We could not determine whether sorbet-fwd crashed (it has a known pattern of committing Kafka offsets during shutdown, permanently losing in-flight messages), whether sorbet-azure was unreachable, or whether internal-cloudserver had the wrong config

With Fluent Bit tailing logs continuously, all of these would have been preserved regardless of pod lifecycle.

Changes

  • .github/scripts/end2end/configs/fluentbit.yaml — ConfigMap + DaemonSet manifest. Tails /var/log/containers/*.log, writes raw CRI log lines (with timestamps) to /data/fluentbit-logs/, one file per container instance
  • .github/scripts/end2end/install-kind-dependencies.sh — Deploys Fluent Bit early (before nginx-controller) so it starts capturing logs before any other workloads
  • .github/actions/archive-artifacts/action.yaml — Copies Fluent Bit output into kind-logs/fluentbit-logs/ before tarring

Issue: ZENKO-5216

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 13, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 13, 2026

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

Comment thread .github/actions/archive-artifacts/action.yaml
@delthas
Copy link
Copy Markdown
Contributor Author

delthas commented Mar 13, 2026

Pushing a tentative fix to make sure the log files have the same name. Let's see how it goes.

@delthas
Copy link
Copy Markdown
Contributor Author

delthas commented Mar 13, 2026

Working fine. Ready.

@delthas delthas requested review from a team, SylvainSenechal and benzekrimaha March 13, 2026 23:54
@francoisferrand
Copy link
Copy Markdown
Contributor

Seems fine for the CI use case: but I wonder if this is appropriate for the local (or codespace) cases, where platform may run for much longer and where space may be kind of a premium.

Should this use of fluentbit be configurable/optional? Or do we estimate impact will still be acceptable in these context?

@francoisferrand
Copy link
Copy Markdown
Contributor

Seems fine for the CI use case: but I wonder if this is appropriate for the local (or codespace) cases, where platform may run for much longer and where space may be kind of a premium.

Should this use of fluentbit be configurable/optional? Or do we estimate impact will still be acceptable in these context?

@delthas did you see this comment?

delthas added 4 commits March 18, 2026 09:46
Deploy a Fluent Bit DaemonSet on all kind nodes to tail container
logs in real-time and write them to persistent storage. This ensures
logs from deleted pods (rolling restarts, operator reconciliation)
are preserved in CI artifacts.

Currently, kind export logs only captures logs from pods alive at
export time. When pods are replaced (e.g. during operator
reconciliation loops), their logs are deleted by kubelet and lost.

Issue: ZENKO-5216
Strip the kube.var.log.containers. prefix from fluentbit output
filenames so they match the naming used by kind export logs.

Issue: ZENKO-5216
At archive creation time, iterate through kind-export container log
files and replace them with symlinks to the corresponding fluentbit
log files when present. This deduplicates the ~118 overlapping
container logs while keeping the file paths ksnap expects.

Issue: ZENKO-5216
Skip the Fluent Bit DaemonSet in local and codespace environments
where long-running clusters and limited disk space make persistent
log collection undesirable.

Issue: ZENKO-5216
@delthas delthas force-pushed the improvement/ZENKO-5216/fluentbit-ci-logs branch from 1236514 to a09c100 Compare March 18, 2026 08:46
@delthas
Copy link
Copy Markdown
Contributor Author

delthas commented Mar 18, 2026

Seems fine for the CI use case: but I wonder if this is appropriate for the local (or codespace) cases, where platform may run for much longer and where space may be kind of a premium.

Should this use of fluentbit be configurable/optional? Or do we estimate impact will still be acceptable in these context?

Addressed. fluentbit will only be deployed in CI (which makes sense anyway because we are not running the archive artifacts step in local/devcontainer).

@delthas delthas requested a review from francoisferrand March 18, 2026 08:47
@delthas
Copy link
Copy Markdown
Contributor Author

delthas commented Mar 18, 2026

Ready for review. Symlinks resolve correctly.

Comment thread .github/actions/archive-artifacts/action.yaml
@SylvainSenechal SylvainSenechal self-requested a review March 18, 2026 10:21
@scality scality deleted a comment from bert-e Mar 19, 2026
@delthas
Copy link
Copy Markdown
Contributor Author

delthas commented Mar 19, 2026

/approve

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 19, 2026

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

  • ✔️ development/2.14

The following branches will NOT be impacted:

  • development/2.10
  • development/2.11
  • development/2.12
  • development/2.13
  • development/2.5
  • development/2.6
  • development/2.7
  • development/2.8
  • development/2.9

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

  • Any commit you add on the source branch will trigger a new cycle after the
    current queue is merged.
  • Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

@scality scality deleted a comment from bert-e Mar 20, 2026
@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 20, 2026

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/2.14

The following branches have NOT changed:

  • development/2.10
  • development/2.11
  • development/2.12
  • development/2.13
  • development/2.5
  • development/2.6
  • development/2.7
  • development/2.8
  • development/2.9

Please check the status of the associated issue ZENKO-5216.

Goodbye delthas.

@bert-e bert-e merged commit 763d223 into development/2.14 Mar 20, 2026
52 of 56 checks passed
@bert-e bert-e deleted the improvement/ZENKO-5216/fluentbit-ci-logs branch March 20, 2026 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants