feat(sources): add source latency metric#24987
Conversation
3229c6c to
7b7b37a
Compare
7b7b37a to
7646d06
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7646d065be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e287c34ee
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let send_batch_start = Instant::now(); | ||
|
|
||
| for events in array::events_into_arrays(events, Some(CHUNK_SIZE)) { | ||
| self.send(events, &mut unsent_event_count) | ||
| .await |
There was a problem hiding this comment.
Measure batch latency at channel-send boundaries
source_send_batch_latency_seconds is intended (and documented) as downstream channel blocking time, but the timer starts before the loop and wraps self.send(...), which includes per-event lag metric emission and metadata mutation before send_with_timeout. On large/complex batches, this inflates the histogram with source-side CPU work and can mislead backpressure debugging or alerting even when channel wait time is unchanged.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Yes, we discussed that in the previous comment. I think it is good to have a measurement that encompasses most of the shared logic across all sources, to get a better idea of how much it is slow to respond. We can combine source_send_latency_seconds and source_send_batch_latency_seconds to measure the cost of that source-side CPU work only
Summary
This PR aims at generalizing the metric http_server_handler_duration_seconds by producing metrics giving insight about events batches handling time to all sources, in order to help debugging latency and backpressure issues.
It adds two distributions,
source_send_latency_secondsandsource_send_batch_latency_seconds, recording the time spent waiting for a single array/event, and for a full payload, to be pulled by the buffer.For the sake of simplicity, those metrics are directly added to the source_sender's Output object so that they will be transparently added to all existing sources. This means that the metrics do not account for the processing time by the source itself (decoding and enriching the events), neither the time elapsed before returning an ack. The former is usually difficult to record without complex refactoring of the source implementation because decoding is buried in the source reader channel, but the computing time used by the source itself is usually small in comparison to the delay caused by downstream latency. The latter however (acks) might be significant if using E2E acks, and since not all sources support acknowledgments (and those that do have a different way of implementing it), it will probably require case-by-case implementation.
Vector configuration
n/a
How did you test this PR?
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details here.