fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers#24949
Conversation
|
@codex review |
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb5630a0fd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
eb5630a to
28a64b0
Compare
lib/vector-buffers/src/variants/disk_v2/tests/acknowledgements.rs
Outdated
Show resolved
Hide resolved
de6b4ec to
a62a56a
Compare
|
@codex review this PR |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a62a56a08b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@graphcareful any movement in this fix? |
a62a56a to
522821e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 522821e093
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
522821e to
9b96386
Compare
|
Latest round of changes send the cancel signal to the sinks to force them to gracefully shutdown instead of pause then reconnect buffers. A bit of extra code was added to be able to wait for a sink to fully shut down, since it wasn't really possible to understand when the retain count of a buffer would go to 0 since fanout.rs performed delayed shutdown of channels in the background -- with no way to wait on that works completion. I had to include that after observing that during manual testing things were fine but my new unit tests failed, so something must have been racing to make things flaky. |
tobz
left a comment
There was a problem hiding this comment.
Seems straightforward enough to me if it fixes the glitch. 👍🏻
|
@codex review
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9f6ea320f0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Changing a disk buffer's configuration (e.g. `max_size`) during a
live config reload would stall indefinitely or fail with a `buffer.lock` error.
This was caused by two issues:
1. The sink's detach trigger was only cancelled for buffer-reuse
cases, so sinks with changed disk buffer configs never had their input
stream terminated — disk buffer readers do not return `None` when the
writer disconnects, causing the old sink task to hang forever.
2. The source output pump only processed fanout control messages
(Remove/Pause) during active sends, so idle sources would never drop their
`BufferSender` clone, keeping the `Arc<Ledger>` file lock alive even after the
sink task completed.
- Cancel the detach trigger for changed disk buffer sinks so the old
sink task can complete.
- Make the source output pump `select!` on both events and fanout
control messages, so Remove/Pause is processed promptly even when the source
is idle.
- Add a retry loop (30s timeout) when acquiring the disk buffer lock
to handle the small race window between the sink task completing and the
fanout releasing the writer.
- Add `BufferConfig::has_disk_stage()` helper for identifying disk
buffer configurations.
9f6ea32 to
79c27c3
Compare
|
@codex review
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 79c27c3290
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if reuse_buffers.contains(key) || changed_disk_buffer_sinks.contains(key) { | ||
| self.detach_triggers | ||
| .remove(key) | ||
| .unwrap() | ||
| .into_inner() | ||
| .cancel(); |
There was a problem hiding this comment.
Restrict sink cancel to avoid dropping buffered events
This unconditionally calls cancel() for every changed sink with a prior disk stage, even when this is a normal reload path and the buffer is not being reused. Because sink input is wrapped in take_until_if(tripwire), cancellation stops consumption immediately; later, when reuse_buffers is false, the old receiver is dropped instead of handed off, so queued records are not gracefully drained. In practical terms, changing a sink from disk-backed buffering to a different topology (for example disk→memory) can lose in-flight buffered data that previously continued draining in the detached old sink.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This I'm not so sure about, will require more research to confirm/deny
There was a problem hiding this comment.
Here's my thoughts on this:
Because sink input is wrapped in take_until_if(tripwire), cancellation stops consumption immediately
Yes consumption is stopped but not abruptly, the stream will flush partial batches
so queued records are not gracefully drained
In this specific case where we are calling cancel on a disk buffer sink, it is guaranteed that all queued records are gracefully drained because the stream will not end until the buffer has been emptied
Summary
This PR includes two fixes when sinks configured with buffers are shutdown. First issue is observed when reloading a config that contains a sink with a disk buffer. Vector will wait until
batch.timeout_secshas completed which causes buffers to flush and unblocks the reload process. The fix for this is to send thecancel()signal to the sink, so it doesn't block on its buffer not being flushed downstream.The second fix is for an issue with the same root cause. I noticed the same hang on issue of control-c with the
aws_s3sink. I employed the same solution there in thestop()method - to call thecancel()signal early in thestop()method.Vector configuration
How did you test this PR?
By using the config above and a simple HTTP traffic generator. Modifying the sink buffer and re-saving the file, looking for errors from the source.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
DatadogandPrometheusand S3 sink hangs forever #17666Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details here.