Skip to content

Expand vector-buffers property model to include crashes, writebacks#25639

Draft
blt wants to merge 1 commit into
blt/antithesis_vector_end-to-end_ack_scenariofrom
blt/expand_the_vector-buffers_property_model_to_include_crashes_os_writes
Draft

Expand vector-buffers property model to include crashes, writebacks#25639
blt wants to merge 1 commit into
blt/antithesis_vector_end-to-end_ack_scenariofrom
blt/expand_the_vector-buffers_property_model_to_include_crashes_os_writes

Conversation

@blt

@blt blt commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

This commit expands on the property model for disk buffers v2, taking inspiration from the paper All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. I have expanded Action to have two new variants, Writeback and Crash. Crash is the simpler, it causes the PBT to simulate a restart of the buffer. The filesystem is updated to have a notion of 'durable' writes, those writes which are not durable are dropped after a crash action and restart. Writeback allows us to simulate OS non-determinism around when mmaps are flushed to disk, directories are synced etc without explicit syscalls to force that. Similar also to the ALICE paper the PBT Filesystem is updated to have a notion of atomicity in writes, whether either block sized or sector sized.

The model remains ideal, that is, the model is built such that all writes are durable once a writer ack is sent backward. This means the test now fails because the SUT does not behave this way, now easily demonstrated. A minimal failure sequence:

[
  WriteRecord(Record { id: 0, size: 0, event_count: 1, encoded_len: 12, archived_len: 64, .. }),
  Crash,
  FlushWrites,
]

Currently disk buffers v2 are not crash safe. This was flagged by our antithesis test introduced in PR #25562 and is reproduced here also, although these faults are hidden behind the more cheap-to-reproduce failure above.

I believe we are missing a few things in the SUT:

  • synchronization: of directory changes, that is, file names do not become durable until synced
  • either bulk commits on flush interval with ack handlers not releasing until that commit or flush interval is dropped entirely with acks == sync, which is very slow.

Vector configuration

How did you test this PR?

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

blt commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@datadog-vectordotdev

datadog-vectordotdev Bot commented Jun 16, 2026

Copy link
Copy Markdown

Tests

⚠️ Warnings

🧪 1 Test failed

variants::disk_v2::tests::model::model_check from vector-buffers   View in Datadog  
Test has failed

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 8a11daa | Docs | Give us feedback!

@blt blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from 6efa8f9 to bfe43c7 Compare June 16, 2026 20:21
@blt blt force-pushed the blt/expand_the_vector-buffers_property_model_to_include_crashes_os_writes branch 2 times, most recently from 9ed8d56 to b8dfc5c Compare June 16, 2026 20:44
@blt blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from bfe43c7 to 36555b5 Compare June 16, 2026 20:44
This commit expands on the property model for disk buffers v2, taking
inspiration from the paper [All File Systems Are Not Created Equal: On
the Complexity of Crafting Crash-Consistent
Applications](https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai).
I have expanded `Action` to have two new variants, `Writeback` and
`Crash`. Crash is the simpler, it causes the PBT to simulate a restart
of the buffer. The filesystem is updated to have a notion of 'durable'
writes, those writes which are not durable are dropped after a crash
action and restart. Writeback allows us to simulate OS non-determinism
around when mmaps are flushed to disk, directories are synced etc
without explicit syscalls to force that.  Similar also to the ALICE
paper the PBT Filesystem is updated to have a notion of atomicity in
writes, whether either block sized or sector sized.

The model remains ideal, that is, the model is built such that all
writes are durable once a writer ack is sent backward. This means the
test now fails because the SUT does not behave this way, now easily
demonstrated. A minimal failure sequence:

```
[
  WriteRecord(Record { id: 0, size: 0, event_count: 1, encoded_len: 12, archived_len: 64, .. }),
  Crash,
  FlushWrites,
]
```

Currently disk buffers v2 are not crash safe. This was flagged by our
antithesis test introduced in PR #25562 and is reproduced here also,
although these faults are hidden behind the more cheap-to-reproduce
failure above.

I believe we are missing a few things in the SUT:

* synchronization: of directory changes, that is, file names do not
  become durable until synced
* _either_ bulk commits on flush interval with ack handlers not
 releasing until that commit _or_ flush interval is dropped entirely with
 acks == sync, which is very slow.
@blt blt force-pushed the blt/expand_the_vector-buffers_property_model_to_include_crashes_os_writes branch from b8dfc5c to 8a11daa Compare June 16, 2026 20:58
@blt blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from 36555b5 to 97cc74d Compare June 16, 2026 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant