feat(metrics): add CollectorSourceDiscardedLogs alert for discarded source logs#3293
feat(metrics): add CollectorSourceDiscardedLogs alert for discarded source logs#3293vparfonov wants to merge 1 commit into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a PrometheusRule alert for discarded source logs, moves the discarded-events metric into the alerts allowlist, and adds unit and functional tests validating the rule and that the collector emits the metric when lines exceed max_line_bytes. ChangesCollector Source Discarded Logs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 67-75: The test that iterates rule.Spec.Groups and checks the
"CollectorSourceDiscardedLogs" rule is missing an assertion for the grouped
label app_kubernetes_io_instance; inside the loop where r.Alert ==
"CollectorSourceDiscardedLogs" (using the expr variable from r.Expr.String()),
add an expectation like
Expect(expr).To(ContainSubstring("app_kubernetes_io_instance")) alongside the
existing namespace/component_id/component_type assertions so the grouping
contract is fully asserted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f2254321-9b6f-4e3f-b6ae-524497669203
📒 Files selected for processing (4)
bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yamlconfig/prometheus/collector_alerts.yamlinternal/metrics/alerts_test.gointernal/metrics/relabel.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 67-89: The tests iterate rule.Spec.Groups and check properties
when a rule with Alert == "CollectorSourceDiscardedLogs" is found but never
assert that such a rule actually existed; update both It blocks in
internal/metrics/alerts_test.go to set a found flag (e.g., foundCollector :=
false) while scanning rule.Spec.Groups/Rules for r.Alert ==
"CollectorSourceDiscardedLogs", perform the existing assertions inside that
conditional, then after the loop add Expect(foundCollector).To(BeTrue()) (or
equivalent) so the test fails when the alert is missing; reference the
rule.Spec.Groups and the alert name "CollectorSourceDiscardedLogs" when adding
the flag and final assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a62fb8e0-4737-4a08-9abf-6ca6f7136916
📒 Files selected for processing (4)
bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yamlconfig/prometheus/collector_alerts.yamlinternal/metrics/alerts_test.gointernal/metrics/relabel.go
|
/hold |
|
Overall, looks good to me. @vparfonov Please confirm vector generates this metric |
added functional test |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 21-24: The test currently ignores the error returned by os.Getwd()
when setting up mdir; change the setup to check and fail on that error (e.g.,
assert/Expect that err is nil) before using mdir so setup failures are surfaced
early — specifically update the block that assigns mdir and err from os.Getwd()
and then uses path.Dir/path.Join (variables mdir and err) to call os.ReadFile,
adding an explicit Expect(err).NotTo(HaveOccurred()) or equivalent assertion
right after os.Getwd() and before modifying mdir.
In `@test/functional/metrics/discarded_metrics_test.go`:
- Around line 94-103: Change the Eventually closure to capture and assert the
RunCommand error instead of ignoring it: call metrics, err :=
framework.RunCommand(constants.CollectorName, "sh", "-c", grepDiscardCmd) inside
the func, then Assert the command succeeded with Expect(err).To(BeNil()) (or
otherwise fail the test immediately) before returning metrics; keep the
subsequent Should assertions (e.g., ContainSubstring checks) the same so each
poll both verifies successful scrape/auth and inspects /metrics output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 28386eb4-9cb5-49d6-a046-eb31d1d63d91
📒 Files selected for processing (5)
bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yamlconfig/prometheus/collector_alerts.yamlinternal/metrics/alerts_test.gointernal/metrics/relabel.gotest/functional/metrics/discarded_metrics_test.go
| rule *monitoringv1.PrometheusRule | ||
| ) | ||
|
|
||
| BeforeEach(func() { |
There was a problem hiding this comment.
Probably can change this to "BeforeAll" if all the subsequent tests rely on upon the same file
| Expect(err).NotTo(HaveOccurred()) | ||
| }) | ||
|
|
||
| It("should parse as a valid PrometheusRule with alert groups", func() { |
There was a problem hiding this comment.
Should this be moved into the Before?
|
/retest |
|
/test e2e-target |
| annotations: | ||
| description: |- | ||
| The collector source "{{ $labels.component_id }}" owned by ClusterLogForwarder "{{ $labels.namespace }}/{{ $labels.app_kubernetes_io_instance }}" | ||
| is discarding logs. This typically occurs when log lines exceed the configured max_line_bytes limit. |
There was a problem hiding this comment.
| is discarding logs. This typically occurs when log lines exceed the configured max_line_bytes limit. | |
| is discarding logs. This typically occurs when log lines exceed the configured maxMessageSize. |
| Collector source "{{ $labels.component_id }}" in namespace "{{ $labels.namespace }}" is discarding logs. | ||
| expr: | | ||
| sum by(namespace, app_kubernetes_io_instance, component_id, component_type)(irate(vector_component_discarded_events_total{component_kind="source"}[2m])) > 0 | ||
| for: 1m |
There was a problem hiding this comment.
consider dropping so that it immediately fires since we expire metrics after 60s
…ource logs Add a warning alert that fires when Vector source components discard logs (e.g. lines exceeding max_line_bytes). The alert groups by namespace, component_id, and component_type so users can identify the affected log stream. Use increase() window 10m for better balance between responsiveness and reducing transient noise. Alert now uses both discard and error metrics with proper error code filtering. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/hold cancel |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcantrill, vparfonov The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/verified by @jcantrill |
|
@jcantrill: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@vparfonov: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
Add a warning alert that fires when Vector source components discard logs (e.g. lines exceeding
max_line_bytes).The alert groups by
namespace,app_kubernetes_io_instance,component_id, andcomponent_typeso users can identify the affected log stream./cc @Clee2691
/assign @jcantrill
Links
Summary by CodeRabbit
New Features
Tests