feat(metrics): add CollectorSourceDiscardedLogs alert for discarded source logs by vparfonov · Pull Request #3293 · openshift/cluster-logging-operator

vparfonov · 2026-05-21T13:04:37Z

Description

Add a warning alert that fires when Vector source components discard logs (e.g. lines exceeding max_line_bytes).
The alert groups by namespace, app_kubernetes_io_instance, component_id, and component_type so users can identify the affected log stream.

/cc @Clee2691
/assign @jcantrill

Links

Reference on PR(s): docs(runbooks): add CollectorSourceDiscardedLogs runbook runbooks#438
GitHub issue:
JIRA: https://redhat.atlassian.net/browse/LOG-6384
Enhancement proposal:

Summary by CodeRabbit

New Features
- Added a monitoring alert (severity: warning) that notifies when collector sources discard logs due to oversized lines.
Tests
- Added unit and functional tests to ensure the alert exists, targets source discards, and that the collector emits the discard metric when oversized lines occur.

coderabbitai · 2026-05-21T13:04:50Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a PrometheusRule alert for discarded source logs, moves the discarded-events metric into the alerts allowlist, and adds unit and functional tests validating the rule and that the collector emits the metric when lines exceed max_line_bytes.

Changes

Collector Source Discarded Logs

Layer / File(s)	Summary
Metric Allowlist Configuration `internal/metrics/relabel.go`	`vector_component_discarded_events_total` is moved into the "Metrics used in alerts" section of `collectorMinimalAllowlist` so alert expressions may reference it.
Alert Rule Definition `config/prometheus/collector_alerts.yaml`, `bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yaml`	New `CollectorSourceDiscardedLogs` alert fires when `irate(vector_component_discarded_events_total[2m])` for `component_kind="source"` > 0, with `for: 1m`, templated annotations referencing `max_line_bytes`, and labels `service: collector`, `severity: warning`.
Alert Validation Tests (unit) `internal/metrics/alerts_test.go`	Ginkgo tests load the PrometheusRule YAML, assert metrics referenced by alert expressions exist in `collectorMinimalAllowlist.allowedMetrics`, and validate `CollectorSourceDiscardedLogs` expression contents, grouping labels, and `severity: warning`.
Functional metrics test `test/functional/metrics/discarded_metrics_test.go`	Functional test configures collector `max_line_bytes` low, writes oversized audit log lines, creates necessary RBAC, and polls collector `/metrics` until `vector_component_discarded_events_total` with `component_kind="source"` appears.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble lines too long and watch the counter climb,
A rule now listens, measures rate, and waits a patient time,
The allowlist opens up the gate, tests search the metrics' sign,
Tiny max_line_bytes makes discarded counts align,
Hop—alerts blink bright; the rabbit leaves a rhyme.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a new alert for discarded source logs, which matches the changeset's core objective.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description follows the template structure with a clear description of the feature, assigned reviewers and approvers, and relevant links to related issues and documentation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 67-75: The test that iterates rule.Spec.Groups and checks the
"CollectorSourceDiscardedLogs" rule is missing an assertion for the grouped
label app_kubernetes_io_instance; inside the loop where r.Alert ==
"CollectorSourceDiscardedLogs" (using the expr variable from r.Expr.String()),
add an expectation like
Expect(expr).To(ContainSubstring("app_kubernetes_io_instance")) alongside the
existing namespace/component_id/component_type assertions so the grouping
contract is fully asserted.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f2254321-9b6f-4e3f-b6ae-524497669203

📥 Commits

Reviewing files that changed from the base of the PR and between c5945ee and 8a204c7.

📒 Files selected for processing (4)

bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yaml
config/prometheus/collector_alerts.yaml
internal/metrics/alerts_test.go
internal/metrics/relabel.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 67-89: The tests iterate rule.Spec.Groups and check properties
when a rule with Alert == "CollectorSourceDiscardedLogs" is found but never
assert that such a rule actually existed; update both It blocks in
internal/metrics/alerts_test.go to set a found flag (e.g., foundCollector :=
false) while scanning rule.Spec.Groups/Rules for r.Alert ==
"CollectorSourceDiscardedLogs", perform the existing assertions inside that
conditional, then after the loop add Expect(foundCollector).To(BeTrue()) (or
equivalent) so the test fails when the alert is missing; reference the
rule.Spec.Groups and the alert name "CollectorSourceDiscardedLogs" when adding
the flag and final assertion.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a62fb8e0-4737-4a08-9abf-6ca6f7136916

📥 Commits

Reviewing files that changed from the base of the PR and between 8a204c7 and 0b47d55.

📒 Files selected for processing (4)

bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yaml
config/prometheus/collector_alerts.yaml
internal/metrics/alerts_test.go
internal/metrics/relabel.go

jcantrill · 2026-05-21T18:22:32Z

/hold

jcantrill · 2026-05-21T18:23:49Z

Overall, looks good to me. @vparfonov Please confirm vector generates this metric

vparfonov · 2026-05-25T08:43:30Z

Overall, looks good to me. @vparfonov Please confirm vector generates this metric

added functional test

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 21-24: The test currently ignores the error returned by os.Getwd()
when setting up mdir; change the setup to check and fail on that error (e.g.,
assert/Expect that err is nil) before using mdir so setup failures are surfaced
early — specifically update the block that assigns mdir and err from os.Getwd()
and then uses path.Dir/path.Join (variables mdir and err) to call os.ReadFile,
adding an explicit Expect(err).NotTo(HaveOccurred()) or equivalent assertion
right after os.Getwd() and before modifying mdir.

In `@test/functional/metrics/discarded_metrics_test.go`:
- Around line 94-103: Change the Eventually closure to capture and assert the
RunCommand error instead of ignoring it: call metrics, err :=
framework.RunCommand(constants.CollectorName, "sh", "-c", grepDiscardCmd) inside
the func, then Assert the command succeeded with Expect(err).To(BeNil()) (or
otherwise fail the test immediately) before returning metrics; keep the
subsequent Should assertions (e.g., ContainSubstring checks) the same so each
poll both verifies successful scrape/auth and inspects /metrics output.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 28386eb4-9cb5-49d6-a046-eb31d1d63d91

📥 Commits

Reviewing files that changed from the base of the PR and between 0b47d55 and 5c4b4af.

📒 Files selected for processing (5)

bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yaml
config/prometheus/collector_alerts.yaml
internal/metrics/alerts_test.go
internal/metrics/relabel.go
test/functional/metrics/discarded_metrics_test.go

jcantrill · 2026-05-26T18:32:24Z

+		rule *monitoringv1.PrometheusRule
+	)
+
+	BeforeEach(func() {


Probably can change this to "BeforeAll" if all the subsequent tests rely on upon the same file

jcantrill · 2026-05-26T18:33:36Z

+		Expect(err).NotTo(HaveOccurred())
+	})
+
+	It("should parse as a valid PrometheusRule with alert groups", func() {


Should this be moved into the Before?

vparfonov · 2026-05-27T12:45:50Z

/retest

vparfonov · 2026-05-27T13:28:21Z

/test e2e-target

jcantrill · 2026-06-04T19:01:08Z

+      annotations:
+        description: |-
+          The collector source "{{ $labels.component_id }}" owned by ClusterLogForwarder "{{ $labels.namespace }}/{{ $labels.app_kubernetes_io_instance }}"
+          is discarding logs. This typically occurs when log lines exceed the configured max_line_bytes limit.


Suggested change

is discarding logs. This typically occurs when log lines exceed the configured max_line_bytes limit.

is discarding logs. This typically occurs when log lines exceed the configured maxMessageSize.

jcantrill · 2026-06-04T19:01:45Z

+          Collector source "{{ $labels.component_id }}" in namespace "{{ $labels.namespace }}" is discarding logs.
+      expr: |
+        sum by(namespace, app_kubernetes_io_instance, component_id, component_type)(irate(vector_component_discarded_events_total{component_kind="source"}[2m])) > 0
+      for: 1m


consider dropping so that it immediately fires since we expire metrics after 60s

…ource logs Add a warning alert that fires when Vector source components discard logs (e.g. lines exceeding max_line_bytes). The alert groups by namespace, component_id, and component_type so users can identify the affected log stream. Use increase() window 10m for better balance between responsiveness and reducing transient noise. Alert now uses both discard and error metrics with proper error code filtering. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jcantrill · 2026-06-05T19:42:52Z

/hold cancel
/approve
/lgtm

openshift-ci · 2026-06-05T19:43:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcantrill, vparfonov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jcantrill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jcantrill · 2026-06-05T19:44:46Z

/retest

jcantrill · 2026-06-05T20:22:42Z

/verified by @jcantrill

openshift-ci-robot · 2026-06-05T20:22:53Z

@jcantrill: This PR has been marked as verified by @jcantrill.

Details

In response to this:

/verified by @jcantrill

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-06-06T02:37:11Z

@vparfonov: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from Clee2691 and cahartma May 21, 2026 13:05

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

Comment thread internal/metrics/alerts_test.go Outdated

vparfonov force-pushed the log6384 branch from 8a204c7 to 0b47d55 Compare May 21, 2026 13:42

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

Comment thread internal/metrics/alerts_test.go Outdated

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2026

vparfonov force-pushed the log6384 branch from 0b47d55 to 5c4b4af Compare May 25, 2026 08:46

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

Comment thread internal/metrics/alerts_test.go Outdated

Comment thread test/functional/metrics/discarded_metrics_test.go

vparfonov force-pushed the log6384 branch from 5c4b4af to e8649e5 Compare May 25, 2026 09:39

jcantrill reviewed May 26, 2026

View reviewed changes

vparfonov force-pushed the log6384 branch from e8649e5 to 82bba44 Compare May 27, 2026 12:20

vparfonov requested a review from jcantrill May 27, 2026 20:04

jcantrill reviewed Jun 4, 2026

View reviewed changes

vparfonov force-pushed the log6384 branch from 113aff6 to 609c1b4 Compare June 5, 2026 17:21

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2026

openshift-ci Bot assigned jcantrill Jun 5, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 5, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2026

jcantrill added the release/6.6 label Jun 5, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 5, 2026

	is discarding logs. This typically occurs when log lines exceed the configured max_line_bytes limit.
	is discarding logs. This typically occurs when log lines exceed the configured maxMessageSize.

Conversation

vparfonov commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jcantrill commented May 21, 2026

Uh oh!

jcantrill commented May 21, 2026

Uh oh!

vparfonov commented May 25, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jcantrill May 26, 2026

Choose a reason for hiding this comment

Uh oh!

jcantrill May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vparfonov commented May 27, 2026

Uh oh!

vparfonov commented May 27, 2026

Uh oh!

jcantrill Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

jcantrill Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

jcantrill commented Jun 5, 2026

Uh oh!

openshift-ci Bot commented Jun 5, 2026

Uh oh!

jcantrill commented Jun 5, 2026

Uh oh!

jcantrill commented Jun 5, 2026

Uh oh!

openshift-ci-robot commented Jun 5, 2026

Uh oh!

openshift-ci Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vparfonov commented May 21, 2026 •

edited

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading