Skip to content

feat(metrics): add CollectorSourceDiscardedLogs alert for discarded source logs#3293

Open
vparfonov wants to merge 1 commit into
openshift:masterfrom
vparfonov:log6384
Open

feat(metrics): add CollectorSourceDiscardedLogs alert for discarded source logs#3293
vparfonov wants to merge 1 commit into
openshift:masterfrom
vparfonov:log6384

Conversation

@vparfonov
Copy link
Copy Markdown
Contributor

@vparfonov vparfonov commented May 21, 2026

Description

Add a warning alert that fires when Vector source components discard logs (e.g. lines exceeding max_line_bytes).
The alert groups by namespace, app_kubernetes_io_instance, component_id, and component_type so users can identify the affected log stream.

/cc @Clee2691
/assign @jcantrill

Links

Summary by CodeRabbit

  • New Features

    • Added a monitoring alert (severity: warning) that notifies when collector sources discard logs due to oversized lines.
  • Tests

    • Added unit and functional tests to ensure the alert exists, targets source discards, and that the collector emits the discard metric when oversized lines occur.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a PrometheusRule alert for discarded source logs, moves the discarded-events metric into the alerts allowlist, and adds unit and functional tests validating the rule and that the collector emits the metric when lines exceed max_line_bytes.

Changes

Collector Source Discarded Logs

Layer / File(s) Summary
Metric Allowlist Configuration
internal/metrics/relabel.go
vector_component_discarded_events_total is moved into the "Metrics used in alerts" section of collectorMinimalAllowlist so alert expressions may reference it.
Alert Rule Definition
config/prometheus/collector_alerts.yaml, bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yaml
New CollectorSourceDiscardedLogs alert fires when irate(vector_component_discarded_events_total[2m]) for component_kind="source" > 0, with for: 1m, templated annotations referencing max_line_bytes, and labels service: collector, severity: warning.
Alert Validation Tests (unit)
internal/metrics/alerts_test.go
Ginkgo tests load the PrometheusRule YAML, assert metrics referenced by alert expressions exist in collectorMinimalAllowlist.allowedMetrics, and validate CollectorSourceDiscardedLogs expression contents, grouping labels, and severity: warning.
Functional metrics test
test/functional/metrics/discarded_metrics_test.go
Functional test configures collector max_line_bytes low, writes oversized audit log lines, creates necessary RBAC, and polls collector /metrics until vector_component_discarded_events_total with component_kind="source" appears.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble lines too long and watch the counter climb,
A rule now listens, measures rate, and waits a patient time,
The allowlist opens up the gate, tests search the metrics' sign,
Tiny max_line_bytes makes discarded counts align,
Hop—alerts blink bright; the rabbit leaves a rhyme.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a new alert for discarded source logs, which matches the changeset's core objective.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description follows the template structure with a clear description of the feature, assigned reviewers and approvers, and relevant links to related issues and documentation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from Clee2691 and cahartma May 21, 2026 13:05
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 67-75: The test that iterates rule.Spec.Groups and checks the
"CollectorSourceDiscardedLogs" rule is missing an assertion for the grouped
label app_kubernetes_io_instance; inside the loop where r.Alert ==
"CollectorSourceDiscardedLogs" (using the expr variable from r.Expr.String()),
add an expectation like
Expect(expr).To(ContainSubstring("app_kubernetes_io_instance")) alongside the
existing namespace/component_id/component_type assertions so the grouping
contract is fully asserted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f2254321-9b6f-4e3f-b6ae-524497669203

📥 Commits

Reviewing files that changed from the base of the PR and between c5945ee and 8a204c7.

📒 Files selected for processing (4)
  • bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yaml
  • config/prometheus/collector_alerts.yaml
  • internal/metrics/alerts_test.go
  • internal/metrics/relabel.go

Comment thread internal/metrics/alerts_test.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 67-89: The tests iterate rule.Spec.Groups and check properties
when a rule with Alert == "CollectorSourceDiscardedLogs" is found but never
assert that such a rule actually existed; update both It blocks in
internal/metrics/alerts_test.go to set a found flag (e.g., foundCollector :=
false) while scanning rule.Spec.Groups/Rules for r.Alert ==
"CollectorSourceDiscardedLogs", perform the existing assertions inside that
conditional, then after the loop add Expect(foundCollector).To(BeTrue()) (or
equivalent) so the test fails when the alert is missing; reference the
rule.Spec.Groups and the alert name "CollectorSourceDiscardedLogs" when adding
the flag and final assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a62fb8e0-4737-4a08-9abf-6ca6f7136916

📥 Commits

Reviewing files that changed from the base of the PR and between 8a204c7 and 0b47d55.

📒 Files selected for processing (4)
  • bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yaml
  • config/prometheus/collector_alerts.yaml
  • internal/metrics/alerts_test.go
  • internal/metrics/relabel.go

Comment thread internal/metrics/alerts_test.go Outdated
@jcantrill
Copy link
Copy Markdown
Contributor

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2026
@jcantrill
Copy link
Copy Markdown
Contributor

Overall, looks good to me. @vparfonov Please confirm vector generates this metric

@vparfonov
Copy link
Copy Markdown
Contributor Author

Overall, looks good to me. @vparfonov Please confirm vector generates this metric

added functional test

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 21-24: The test currently ignores the error returned by os.Getwd()
when setting up mdir; change the setup to check and fail on that error (e.g.,
assert/Expect that err is nil) before using mdir so setup failures are surfaced
early — specifically update the block that assigns mdir and err from os.Getwd()
and then uses path.Dir/path.Join (variables mdir and err) to call os.ReadFile,
adding an explicit Expect(err).NotTo(HaveOccurred()) or equivalent assertion
right after os.Getwd() and before modifying mdir.

In `@test/functional/metrics/discarded_metrics_test.go`:
- Around line 94-103: Change the Eventually closure to capture and assert the
RunCommand error instead of ignoring it: call metrics, err :=
framework.RunCommand(constants.CollectorName, "sh", "-c", grepDiscardCmd) inside
the func, then Assert the command succeeded with Expect(err).To(BeNil()) (or
otherwise fail the test immediately) before returning metrics; keep the
subsequent Should assertions (e.g., ContainSubstring checks) the same so each
poll both verifies successful scrape/auth and inspects /metrics output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 28386eb4-9cb5-49d6-a046-eb31d1d63d91

📥 Commits

Reviewing files that changed from the base of the PR and between 0b47d55 and 5c4b4af.

📒 Files selected for processing (5)
  • bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yaml
  • config/prometheus/collector_alerts.yaml
  • internal/metrics/alerts_test.go
  • internal/metrics/relabel.go
  • test/functional/metrics/discarded_metrics_test.go

Comment thread internal/metrics/alerts_test.go Outdated
Comment thread test/functional/metrics/discarded_metrics_test.go
Comment thread internal/metrics/alerts_test.go Outdated
rule *monitoringv1.PrometheusRule
)

BeforeEach(func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can change this to "BeforeAll" if all the subsequent tests rely on upon the same file

Comment thread internal/metrics/alerts_test.go Outdated
Expect(err).NotTo(HaveOccurred())
})

It("should parse as a valid PrometheusRule with alert groups", func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved into the Before?

Comment thread internal/metrics/alerts_test.go Outdated
@vparfonov
Copy link
Copy Markdown
Contributor Author

/retest

@vparfonov
Copy link
Copy Markdown
Contributor Author

/test e2e-target

@vparfonov vparfonov requested a review from jcantrill May 27, 2026 20:04
Comment thread config/prometheus/collector_alerts.yaml Outdated
annotations:
description: |-
The collector source "{{ $labels.component_id }}" owned by ClusterLogForwarder "{{ $labels.namespace }}/{{ $labels.app_kubernetes_io_instance }}"
is discarding logs. This typically occurs when log lines exceed the configured max_line_bytes limit.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is discarding logs. This typically occurs when log lines exceed the configured max_line_bytes limit.
is discarding logs. This typically occurs when log lines exceed the configured maxMessageSize.

Comment thread config/prometheus/collector_alerts.yaml Outdated
Collector source "{{ $labels.component_id }}" in namespace "{{ $labels.namespace }}" is discarding logs.
expr: |
sum by(namespace, app_kubernetes_io_instance, component_id, component_type)(irate(vector_component_discarded_events_total{component_kind="source"}[2m])) > 0
for: 1m
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider dropping so that it immediately fires since we expire metrics after 60s

…ource logs

Add a warning alert that fires when Vector source components discard
logs (e.g. lines exceeding max_line_bytes). The alert groups by
namespace, component_id, and component_type so users can identify
the affected log stream.

Use increase() window 10m for better balance between
responsiveness and reducing transient noise. Alert now uses both
discard and error metrics with proper error code filtering.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jcantrill
Copy link
Copy Markdown
Contributor

/hold cancel
/approve
/lgtm

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2026
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcantrill, vparfonov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2026
@jcantrill
Copy link
Copy Markdown
Contributor

/retest

@jcantrill
Copy link
Copy Markdown
Contributor

/verified by @jcantrill

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@jcantrill: This PR has been marked as verified by @jcantrill.

Details

In response to this:

/verified by @jcantrill

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 6, 2026

@vparfonov: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. release/6.6 verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants