feat(aws_s3 sink): add Parquet encoder with schema_file and auto infer schema support#25156
feat(aws_s3 sink): add Parquet encoder with schema_file and auto infer schema support#25156petere-datadog wants to merge 24 commits intomasterfrom
Conversation
Keep JSON-based build_record_batch/find_null_non_nullable_fields for Parquet compatibility. Drop unused serde_arrow dep. Regenerate Cargo.lock. Made-with: Cursor
|
All contributors have signed the CLA ✍️ ✅ |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1eaa921f96
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 03a4544905
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Hey @petere-datadog, while I am taking a look at this PR please see this #25156 (comment) and the codex review comments. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ca201c5078
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6dcc854103
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 08b9b547f4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
recheck |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2c3a74d8d3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8d6c9c2ea9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if json_values.is_empty() { | ||
| return Ok(()); | ||
| } |
There was a problem hiding this comment.
Reject Parquet batches that serialize to zero rows
When a non-empty batch contains only events that fail serde_json::to_value (for example logs with non-finite floats), json_values becomes empty and this branch returns success. The request-builder path then proceeds with an empty payload and finalizes the whole batch as delivered, which silently drops all events and can create empty .parquet objects. This should return an error (or otherwise skip request creation) whenever input events were present but no rows were encodable.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
It's not really an error if there's no encodable events, if anything this should be handled upstream
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0f8dde57d0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
recheck |
|
I have read the CLA Document and I hereby sign the CLA |
pront
left a comment
There was a problem hiding this comment.
Some things that stood out. I will take another look.
|
Hey, we are currently deploying vector on your UAT instances to test few workflows. One of our use-case consists of having logs in the parquet format with gzip compression. I did check some documentation but couldn't find any good resources on it. Could you let me know when this PR is going to be merged or if there's any resource that I can read on how to set-up. |
tessneau
left a comment
There was a problem hiding this comment.
nice ! overall seems good to me, just some non-blockers, thanks for all the tests
| if !self.schema_field_names.contains(top_level.as_str()) { | ||
| return Err(Box::new(ArrowEncodingError::SchemaFetchError { |
There was a problem hiding this comment.
maybe we should emit the events dropped here, it feels a bit ambiguous since the user is choosing strict mode so it may be expected that not all events match the schema but since we'd be dropping the whole batch I think it makes sense
| if !self.schema_field_names.contains(top_level.as_str()) { | |
| return Err(Box::new(ArrowEncodingError::SchemaFetchError { | |
| if !self.schema_field_names.contains(top_level.as_str()) { | |
| self.events_dropped_handle.emit(Count(events.len())); | |
| return Err(Box::new(ArrowEncodingError::SchemaFetchError { |
There was a problem hiding this comment.
So I looked and we don't emit dropped events metric when encode returns an err and that's a problem I think, we should be emitting that higher up and I think we should fix this there. If I update that specific metric count here then I have to do it everywehre we use a "?" and that's not clean at all. So for now I'd rather not emit dropped events metric after schema validation fails, but we definitely need to include that upstream
There was a problem hiding this comment.
All points are correct here 😅 Yes, ideally we would have this emitted upstream but that is a project on its own. Practically, the strict-mode path knows exactly how many events are being dropped and should emit the metric. We are already emitting event droppped events in this function already, so it would be inconsistent not emitting here.
I understand there are many cases that can fail here so we can wrap this whole logic like so:
fn encode(&mut self, events: Vec<Event>, buffer: &mut BytesMut) -> Result<(), Self::Error> {
if events.is_empty() {
return Ok(());
}
let count = events.len();
let result = self.try_encode(events, buffer);
if result.is_err() {
self.events_dropped_handle.emit(Count(count));
}
result
}This way, we will emit the metric no matter what error happens.
There was a problem hiding this comment.
Hmm the AutoInfer and build_record_batch paths already handle it themselves, maybe the simplest solution is to handle it manually here then in all paths.
There was a problem hiding this comment.
yeah I think we can revisit this in another PR, do we know anyone from vector/documentaiton who can approve this?
There was a problem hiding this comment.
Hi @petere-datadog, since dropped events is actually an important aspect of Vector's observability contract, I'd like to see this addressed before we merge this PR. It's not a big lift, we just need to cover the error paths in this function that drop events, starting with the one @tessneau pointed out.
It should be merged later today and yeah this will support encoding batched events with parquet and gzip compression |
…ik/aws-s3-parquet-encoding
| futures.workspace = true | ||
| influxdb-line-protocol = { version = "2", default-features = false } | ||
| lookup = { package = "vector-lookup", path = "../vector-lookup", default-features = false, features = ["test"] } | ||
| lookup = { package = "vector-lookup", path = "../vector-lookup", default-features = false, features = [ |
There was a problem hiding this comment.
Nit: please revert unrelated formatting changes
Summary
This PR was initially started by @szibis: #24706
Vector configuration
auto-infer.yaml
schema-file.yaml
apache-common.schema
How did you test this PR?
Change Type
Is this a breaking change?
All the new stuff is added under features: codec-parquet.
Does this PR include user facing changes?
no-changeloglabel to this PR.Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.