-
Notifications
You must be signed in to change notification settings - Fork 76
feat: add partition summary and write added/existing/deleted entries to manifest writer #317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
138cee5 to
7a4c48e
Compare
|
Some unclear points while porting Java's TestManifestWriterVersions:
I've temporarily commented out some failing assertions, but the PR seems ready for review, just pay special attention to the above two questions. |
src/iceberg/manifest_writer.h
Outdated
| /// \param entry Manifest entry to write. | ||
| /// \return Status::OK() if entry was written successfully | ||
| Status Add(const ManifestEntry& entry); | ||
| Status AddEntry(const ManifestEntry& entry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding following functions to deal with different use cases:
// Write the entry that all its fields are populated correctly.
Status WriteEntry(const ManifestEntry& entry);
// Write the input as a new entry to add.
Status WriteAddedEntry(std::shared_ptr<DataFile> file,
std::optional<int64_t> data_sequence_number);
Status WriteAddedEntry(const ManifestEntry& entry);
// Write the input as an existing entry.
Status WriteExistingEntry(std::shared_ptr<DataFile> file, int64_t file_snapshot_id,
int64_t data_sequence_number,
std::optional<int64_t> file_sequence_number);
Status WriteExistingEntry(const ManifestEntry& entry);
// Write the input as a deleted entry.
Status WriteDeletedEntry(std::shared_ptr<DataFile> file, int64_t data_sequence_number,
std::optional<int64_t> file_sequence_number);
Status WriteDeletedEntry(const ManifestEntry& entry);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WriteEntry can be the delegate function to accept writes from other functions. Inside each WriteXXXEntry function, we can just create a new entry and the copy should be cheap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I tried this, please take a look again.
including AddEntry, AddExistingEntry, AddDeleteEntry
7a4c48e to
e0d03e6
Compare
src/iceberg/manifest_adapter.cc
Outdated
| entry.snapshot_id = snapshot_id_; | ||
| if (entry.sequence_number.has_value() && | ||
| entry.sequence_number.value() < TableMetadata::kInitialSequenceNumber) { | ||
| entry.sequence_number = std::nullopt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment why sequence_number should be initialized here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic has been removed since GetSequenceNumber handle this, the reason for it been null is because for newly added data, its sequence number is inherited from manifest metadata, see [1].
[1] https://iceberg.apache.org/spec/#sequence-number-inheritance
src/iceberg/manifest_adapter.cc
Outdated
| manifest_file.added_rows_count = add_rows_count_; | ||
| manifest_file.existing_rows_count = existing_rows_count_; | ||
| manifest_file.deleted_rows_count = delete_rows_count_; | ||
| manifest_file.partitions = std::move(partition_summary_->Summaries()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify the ownership movement. Please add a comment why it is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been changed to:
ICEBERG_ASSIGN_OR_RAISE(auto partition_summary, partition_summary_->Summaries());
manifest_file.partitions = std::move(partition_summary);
I agree at the version of you comment, the move is not necessary, I guess meson build would warn under warning_level=2. But after changing, the move can avoid an extra copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After seeing Gang's comment, I change it to the following:
ICEBERG_ASSIGN_OR_RAISE(manifest_file.partitions, partition_summary_->Summaries());
src/iceberg/manifest_list.h
Outdated
|
|
||
| /// \brief Get the relative manifest content type from name | ||
| ICEBERG_EXPORT constexpr Result<ManifestFile::Content> ManifestFileContentFromString( | ||
| ICEBERG_EXPORT constexpr Result<ManifestContent> ManifestContentFromString( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make it a static function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here, are you worried that this function will be included in multiple translation units?
If so, constexpr implies inline, so this should be fine. We already use this pattern in several other places, correct me if I'm wrong.
|
|
||
| Status Update(const Literal& value); | ||
|
|
||
| PartitionFieldSummary Finish() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment.
|
|
||
| private: | ||
| /// \brief Create a PartitionSummary with the given field stats. | ||
| explicit PartitionSummary(std::vector<PartitionFieldStats> field_stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we hide the ctor? Please add a comment for reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason was there was a Make method, but now I remove the Make and bring the ctor to public.
src/iceberg/v3_metadata.cc
Outdated
|
|
||
| Status ManifestFileAdapterV3::Append(const ManifestFile& file) { | ||
| Status ManifestFileAdapterV3::Append(ManifestFile& file) { | ||
| // TODO(zhjwpku): Should we set sequence_number/first_row_id here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please file an issue and use the tracking id instead of your id. I know other code already used personal id, but it's not helpful to follow what to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These TODOs were meant to attract reviewers' closer attention. I've removed them now.
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! I think it generally looks good now. I've left some inline comments.
- unify partition summary collection - fix manifest reader to inherit metadata - fix first row id round trip
|
Thanks @zhjwpku for working on this and adding sufficient test cases! I've pushed some changes to fix first_row_id and sequence_number inheritance. Thanks for your review, @Jinchul81! Do you have any other comment? |
including AddEntry, AddExistingEntry, AddDeleteEntry