fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers by graphcareful · Pull Request #24949 · vectordotdev/vector

graphcareful · 2026-03-17T19:05:53Z

Summary

This PR includes two fixes when sinks configured with buffers are shutdown. First issue is observed when reloading a config that contains a sink with a disk buffer. Vector will wait until batch.timeout_secs has completed which causes buffers to flush and unblocks the reload process. The fix for this is to send the cancel() signal to the sink, so it doesn't block on its buffer not being flushed downstream.

The second fix is for an issue with the same root cause. I noticed the same hang on issue of control-c with the aws_s3 sink. I employed the same solution there in the stop() method - to call the cancel() signal early in the stop() method.

Vector configuration

sources:
  http-server-source:
    type: "http_server"
    address: "127.0.0.1:8001"

sinks:
  amazon-s3-sink:
    inputs:
      - "http-server-source"
    type: "aws_s3"
    bucket: e2e-tests
    region: us-east-1
    encoding:
      codec: json
    buffer:
      type: disk
      max_size: 30000000000
      when_full: block
    batch:
      timeout_secs: 500
      max_bytes: 100000000

How did you test this PR?

By using the config above and a simple HTTP traffic generator. Modifying the sink buffer and re-saving the file, looking for errors from the source.

Change Type

Is this a breaking change?

Yes
No

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the no-changelog label to this PR.

References

Notes

Please read our Vector contributor resources.
Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
Some CI checks run only after we manually approve them.
- We recommend adding a pre-push hook, please see this template.
- Alternatively, we recommend running the following locally before pushing to the remote branch:
  - make fmt
  - make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
  - make test
After a review is requested, please avoid force pushes to help us review incrementally.
- Feel free to push as many commits as you want. They will be squashed into one before merging.
- For example, you can run git merge origin master and git push.
If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
run make build-licenses to regenerate the license inventory and commit the changes (if any). More details here.

pront · 2026-03-17T19:07:32Z

@codex review

src/topology/running.rs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb5630a0fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/topology/running.rs

lib/vector-buffers/src/variants/disk_v2/tests/acknowledgements.rs

lib/vector-buffers/src/variants/disk_v2/ledger.rs

pront · 2026-03-24T17:00:14Z

@codex review this PR

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a62a56a08b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

lib/vector-buffers/src/variants/disk_v2/reader.rs

garryliu-stripe · 2026-03-31T17:27:11Z

@graphcareful any movement in this fix?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 522821e093

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/topology/running.rs

graphcareful · 2026-04-02T21:02:24Z

Latest round of changes send the cancel signal to the sinks to force them to gracefully shutdown instead of pause then reconnect buffers. A bit of extra code was added to be able to wait for a sink to fully shut down, since it wasn't really possible to understand when the retain count of a buffer would go to 0 since fanout.rs performed delayed shutdown of channels in the background -- with no way to wait on that works completion. I had to include that after observing that during manual testing things were fine but my new unit tests failed, so something must have been racing to make things flaky.

tobz

Seems straightforward enough to me if it fixes the glitch. 👍🏻

changelog.d/24929_fix_stall_on_disk_shutdown.fix.md

pront · 2026-04-03T15:04:31Z

@codex review

focus on unconditional cancel() and graceful drain
RunningTopology::stop() and cancels sink detach triggers

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f6ea320f0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/topology/running.rs

- Changing a disk buffer's configuration (e.g. `max_size`) during a live config reload would stall indefinitely or fail with a `buffer.lock` error. This was caused by two issues: 1. The sink's detach trigger was only cancelled for buffer-reuse cases, so sinks with changed disk buffer configs never had their input stream terminated — disk buffer readers do not return `None` when the writer disconnects, causing the old sink task to hang forever. 2. The source output pump only processed fanout control messages (Remove/Pause) during active sends, so idle sources would never drop their `BufferSender` clone, keeping the `Arc<Ledger>` file lock alive even after the sink task completed. - Cancel the detach trigger for changed disk buffer sinks so the old sink task can complete. - Make the source output pump `select!` on both events and fanout control messages, so Remove/Pause is processed promptly even when the source is idle. - Add a retry loop (30s timeout) when acquiring the disk buffer lock to handle the small race window between the sink task completing and the fanout releasing the writer. - Add `BufferConfig::has_disk_stage()` helper for identifying disk buffer configurations.

graphcareful · 2026-04-06T17:56:59Z

@codex review

focus on unconditional cancel() and graceful drain
RunningTopology::stop() and cancels sink detach triggers

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 79c27c3290

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-06T18:03:09Z

src/topology/running.rs

+            if reuse_buffers.contains(key) || changed_disk_buffer_sinks.contains(key) {
                self.detach_triggers
                    .remove(key)
                    .unwrap()
                    .into_inner()
                    .cancel();


Restrict sink cancel to avoid dropping buffered events

This unconditionally calls cancel() for every changed sink with a prior disk stage, even when this is a normal reload path and the buffer is not being reused. Because sink input is wrapped in take_until_if(tripwire), cancellation stops consumption immediately; later, when reuse_buffers is false, the old receiver is dropped instead of handed off, so queued records are not gracefully drained. In practical terms, changing a sink from disk-backed buffering to a different topology (for example disk→memory) can lose in-flight buffered data that previously continued draining in the detached old sink.

Useful? React with 👍 / 👎.

This I'm not so sure about, will require more research to confirm/deny

Here's my thoughts on this:

Because sink input is wrapped in take_until_if(tripwire), cancellation stops consumption immediately

Yes consumption is stopped but not abruptly, the stream will flush partial batches

so queued records are not gracefully drained

In this specific case where we are calling cancel on a disk buffer sink, it is guaranteed that all queued records are gracefully drained because the stream will not end until the buffer has been emptied

graphcareful requested a review from a team as a code owner March 17, 2026 19:05

graphcareful added domain: topology Anything related to Vector's topology code domain: config Anything related to configuring Vector domain: buffers Anything related to Vector's memory/disk buffers labels Mar 17, 2026

bruceg added the type: bug A code related bug. label Mar 17, 2026

bruceg reviewed Mar 17, 2026

View reviewed changes

src/topology/running.rs Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

src/topology/running.rs Outdated Show resolved Hide resolved

src/topology/running.rs Outdated Show resolved Hide resolved

graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from eb5630a to 28a64b0 Compare March 23, 2026 17:26

github-actions bot removed the domain: topology Anything related to Vector's topology code label Mar 23, 2026

bruceg reviewed Mar 23, 2026

View reviewed changes

lib/vector-buffers/src/variants/disk_v2/tests/acknowledgements.rs Outdated Show resolved Hide resolved

bruceg reviewed Mar 23, 2026

View reviewed changes

lib/vector-buffers/src/variants/disk_v2/ledger.rs Outdated Show resolved Hide resolved

graphcareful requested review from bruceg and pront March 23, 2026 21:06

graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from de6b4ec to a62a56a Compare March 24, 2026 15:57

chatgpt-codex-connector bot reviewed Mar 24, 2026

View reviewed changes

lib/vector-buffers/src/variants/disk_v2/reader.rs Outdated Show resolved Hide resolved

graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from a62a56a to 522821e Compare April 2, 2026 19:35

github-actions bot added the domain: topology Anything related to Vector's topology code label Apr 2, 2026

chatgpt-codex-connector bot reviewed Apr 2, 2026

View reviewed changes

src/topology/running.rs Outdated Show resolved Hide resolved

graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from 522821e to 9b96386 Compare April 2, 2026 20:56

tobz approved these changes Apr 3, 2026

View reviewed changes

graphcareful commented Apr 3, 2026

View reviewed changes

changelog.d/24929_fix_stall_on_disk_shutdown.fix.md Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Apr 3, 2026

View reviewed changes

src/topology/running.rs Outdated Show resolved Hide resolved

src/topology/running.rs Outdated Show resolved Hide resolved

graphcareful added 2 commits April 6, 2026 13:47

Add unit tests for reload when disk buffer config modified

8aebef5

Add changelog fragment

79c27c3

graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from 9f6ea32 to 79c27c3 Compare April 6, 2026 17:56

github-actions bot added the domain: core Anything related to core crates i.e. vector-core, core-common, etc label Apr 6, 2026

chatgpt-codex-connector bot reviewed Apr 6, 2026

View reviewed changes

Conversation

graphcareful commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Does this PR include user facing changes?

References

Notes

Uh oh!

pront commented Mar 17, 2026

Uh oh!

Uh oh!

This comment was marked as outdated.

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pront commented Mar 24, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

garryliu-stripe commented Mar 31, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

graphcareful commented Apr 2, 2026

Uh oh!

tobz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pront commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

graphcareful commented Apr 6, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

graphcareful Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

graphcareful Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

graphcareful commented Mar 17, 2026 •

edited

Loading