iceberg: reduce per-batch overhead in write_data_files#35179
Merged
DAlperin merged 1 commit intoMaterializeInc:mainfrom Feb 24, 2026
Merged
iceberg: reduce per-batch overhead in write_data_files#35179DAlperin merged 1 commit intoMaterializeInc:mainfrom
DAlperin merged 1 commit intoMaterializeInc:mainfrom
Conversation
|
Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone. PR title guidelines
Pre-merge checklist
|
f4ac92a to
9f86c0c
Compare
Member
Author
|
@codex review |
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
martykulma
approved these changes
Feb 24, 2026
src/storage/src/sink/iceberg.rs
Outdated
Comment on lines
1104
to
1108
Contributor
There was a problem hiding this comment.
Looking at this again, I don't think you need any of these clones.
Box the DeltaWriter values stored in the in_flight_batches HashMap. DeltaWriter is a large struct (containing multiple sub-writers, a 100k-entry seen_rows HashMap, VecDeque, Arrow schemas, etc.), and storing it inline means every HashMap insert, resize, and extract_if memcpy's the entire struct. Boxing reduces this to moving an pointer. Also hoist invariant work out of the per-batch create_delta_writer closure: - Position delete and equality delete Iceberg schemas (arrow_schema_to_schema) are now computed once upfront rather than on every batch. - EqualityDeleteWriterConfig (which internally calls schema_to_arrow_schema and builds a RecordBatchProjector) is now constructed once and cloned, avoiding redundant schema conversions per batch.
9f86c0c to
8afd475
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Box the DeltaWriter values stored in the in_flight_batches HashMap. DeltaWriter is a large struct (containing multiple sub-writers, a 100k-entry seen_rows HashMap, VecDeque, Arrow schemas, etc.), and storing it inline means every HashMap insert, resize, and extract_if memcpy's the entire struct. Boxing reduces this to moving an pointer.
Also hoist invariant work out of the per-batch create_delta_writer closure:
(The update to iceberg-rs just includes a commit that makes
EqualityDeleteWriterConfigClone)