feat(aws_s3 sink): add Parquet encoder with schema_file and auto infer schema support by petere-datadog · Pull Request #25156 · vectordotdev/vector

petere-datadog · 2026-04-09T19:32:42Z

Summary

Added support for parquet encoding for the aws s3 sink
Options supported:
- schema_mode: auto_infer, relaxed, strict
- schema_file: Only needed if schema mode is not auto_infer

This PR was initially started by @szibis: #24706

Vector configuration

auto-infer.yaml

sources:
  demo:
    type: demo_logs
    format: apache_common
    interval: 0.1

sinks:
  s3_parquet:
    type: aws_s3
    inputs:
      - demo
    bucket: obs-pipelines-e2e-tests
    region: us-east-1
    key_prefix: "peter-test/demo_logs/dt=%Y%m%d/hour=%H/"
    filename_time_format: "%s"
    filename_append_uuid: true
    compression: none  # Parquet handles its own compression internally

    # Standard per-event encoding is still required by the field even when
    # batch_encoding takes over. Set it to text as a no-op placeholder.
    encoding:
      codec: text

    batch_encoding:
      codec: parquet
      schema_mode: auto_infer
      compression: 
        algorithm: gzip
        level: 9

    batch:
      max_events: 10000
      timeout_secs: 5

schema-file.yaml

sources:
  demo:
    type: demo_logs
    format: apache_common
    interval: 0.1

sinks:
  s3_parquet:
    type: aws_s3
    inputs:
      - demo
    bucket: obs-pipelines-e2e-tests
    region: us-east-1
    key_prefix: "peter-test/demo_logs/dt=%Y%m%d/hour=%H/"
    filename_time_format: "%s"
    filename_append_uuid: true
    compression: none  # Parquet handles its own compression internally

    # Standard per-event encoding is still required by the field even when
    # batch_encoding takes over. Set it to text as a no-op placeholder.
    encoding:
      codec: text

    batch_encoding:
      codec: parquet
      schema_mode: strict
      schema_file: /Users/peter.ehikhuemen/go/src/github.com/DataDog/vectordotdev/vector/local/apache-common.schema
      compression: 
        algorithm: snappy

    batch:
      max_events: 10000
      timeout_secs: 5

apache-common.schema

message arrow_schema {
  optional binary host (STRING);
  optional binary message (STRING);
  optional binary service (STRING);
  optional binary source_type (STRING);
  optional int64 timestamp (TIMESTAMP(MICROS,true));
}

How did you test this PR?

cargo run --features "codecs-parquet" --  --config local/configs/aws-s3-sink-parquet-encoding.yaml

Change Type

Is this a breaking change?

Yes
No

All the new stuff is added under features: codec-parquet.

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the no-changelog label to this PR.

Notes

Please read our Vector contributor resources.
Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
Some CI checks run only after we manually approve them.
- We recommend adding a pre-push hook, please see this template.
- Alternatively, we recommend running the following locally before pushing to the remote branch:
  - make fmt
  - make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
  - make test
After a review is requested, please avoid force pushes to help us review incrementally.
- Feel free to push as many commits as you want. They will be squashed into one before merging.
- For example, you can run git merge origin master and git push.
If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

…pport

Keep JSON-based build_record_batch/find_null_non_nullable_fields for Parquet compatibility. Drop unused serde_arrow dep. Regenerate Cargo.lock. Made-with: Cursor

github-actions · 2026-04-09T19:32:55Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1eaa921f96

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

lib/codecs/src/encoding/format/arrow.rs

lib/codecs/src/encoding/format/parquet.rs

changelog.d/aws_s3_parquet_encoding.feature.md

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 03a4544905

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/sinks/aws_s3/integration_tests.rs

src/sinks/aws_s3/config.rs

pront · 2026-04-10T13:29:20Z

Hey @petere-datadog, while I am taking a look at this PR please see this #25156 (comment) and the codex review comments.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca201c5078

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/sinks/aws_s3/config.rs

changelog.d/aws_s3_parquet_encoding.feature.md

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6dcc854103

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/sinks/util/encoding.rs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08b9b547f4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

lib/codecs/src/encoding/format/arrow.rs

datadog-vectordotdev · 2026-04-10T17:38:59Z

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 589f483 | Docs | Was this helpful? Give us feedback!}

changelog.d/aws_s3_parquet_encoding.feature.md

petere-datadog · 2026-04-10T19:07:05Z

recheck

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2c3a74d8d3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/sinks/aws_s3/config.rs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d6c9c2ea9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-10T20:18:16Z

lib/codecs/src/encoding/format/parquet.rs

+        if json_values.is_empty() {
+            return Ok(());
+        }


Reject Parquet batches that serialize to zero rows

When a non-empty batch contains only events that fail serde_json::to_value (for example logs with non-finite floats), json_values becomes empty and this branch returns success. The request-builder path then proceeds with an empty payload and finalizes the whole batch as delivered, which silently drops all events and can create empty .parquet objects. This should return an error (or otherwise skip request creation) whenever input events were present but no rows were encodable.

Useful? React with 👍 / 👎.

It's not really an error if there's no encodable events, if anything this should be handled upstream

…amazon s3 sink

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f8dde57d0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

lib/codecs/src/encoding/format/arrow.rs

petere-datadog · 2026-04-13T15:27:51Z

recheck

petere-datadog · 2026-04-13T15:59:20Z

I have read the CLA Document and I hereby sign the CLA

pront

Some things that stood out. I will take another look.

lib/codecs/src/encoding/format/arrow.rs

lib/codecs/src/encoding/format/parquet.rs

…ting an error

Himan10 · 2026-04-14T09:24:14Z

Hey, we are currently deploying vector on your UAT instances to test few workflows. One of our use-case consists of having logs in the parquet format with gzip compression. I did check some documentation but couldn't find any good resources on it. Could you let me know when this PR is going to be merged or if there's any resource that I can read on how to set-up.

tessneau

nice ! overall seems good to me, just some non-blockers, thanks for all the tests

changelog.d/aws_s3_parquet_encoding.feature.md

lib/codecs/src/internal_events.rs

tessneau · 2026-04-14T10:45:23Z

lib/codecs/src/encoding/format/parquet.rs

+                        if !self.schema_field_names.contains(top_level.as_str()) {
+                            return Err(Box::new(ArrowEncodingError::SchemaFetchError {


maybe we should emit the events dropped here, it feels a bit ambiguous since the user is choosing strict mode so it may be expected that not all events match the schema but since we'd be dropping the whole batch I think it makes sense

Suggested change

if !self.schema_field_names.contains(top_level.as_str()) {

return Err(Box::new(ArrowEncodingError::SchemaFetchError {

if !self.schema_field_names.contains(top_level.as_str()) {

self.events_dropped_handle.emit(Count(events.len()));

return Err(Box::new(ArrowEncodingError::SchemaFetchError {

So I looked and we don't emit dropped events metric when encode returns an err and that's a problem I think, we should be emitting that higher up and I think we should fix this there. If I update that specific metric count here then I have to do it everywehre we use a "?" and that's not clean at all. So for now I'd rather not emit dropped events metric after schema validation fails, but we definitely need to include that upstream

All points are correct here 😅 Yes, ideally we would have this emitted upstream but that is a project on its own. Practically, the strict-mode path knows exactly how many events are being dropped and should emit the metric. We are already emitting event droppped events in this function already, so it would be inconsistent not emitting here.

I understand there are many cases that can fail here so we can wrap this whole logic like so:

fn encode(&mut self, events: Vec<Event>, buffer: &mut BytesMut) -> Result<(), Self::Error> { if events.is_empty() { return Ok(()); } let count = events.len(); let result = self.try_encode(events, buffer); if result.is_err() { self.events_dropped_handle.emit(Count(count)); } result }

This way, we will emit the metric no matter what error happens.

Hmm the AutoInfer and build_record_batch paths already handle it themselves, maybe the simplest solution is to handle it manually here then in all paths.

yeah I think we can revisit this in another PR, do we know anyone from vector/documentaiton who can approve this?

Hi @petere-datadog, since dropped events is actually an important aspect of Vector's observability contract, I'd like to see this addressed before we merge this PR. It's not a big lift, we just need to cover the error paths in this function that drop events, starting with the one @tessneau pointed out.

src/sinks/aws_s3/config.rs

src/sinks/amqp/config.rs

lib/codecs/src/encoding/format/parquet.rs

petere-datadog · 2026-04-14T14:42:16Z

Hey, we are currently deploying vector on your UAT instances to test few workflows. One of our use-case consists of having logs in the parquet format with gzip compression. I did check some documentation but couldn't find any good resources on it. Could you let me know when this PR is going to be merged or if there's any resource that I can read on how to set-up.

It should be merged later today and yeah this will support encoding batched events with parquet and gzip compression

…ik/aws-s3-parquet-encoding

pront · 2026-04-14T18:43:59Z

lib/codecs/Cargo.toml

 futures.workspace = true
 influxdb-line-protocol = { version = "2", default-features = false }
-lookup = { package = "vector-lookup", path = "../vector-lookup", default-features = false, features = ["test"] }
+lookup = { package = "vector-lookup", path = "../vector-lookup", default-features = false, features = [


Nit: please revert unrelated formatting changes

petere-datadog added 3 commits April 9, 2026 15:09

feat(codecs): add Parquet encoder with schema_file and schema_mode su…

352e5f6

…pport

Merge master and resolve conflicts

edbed49

Keep JSON-based build_record_batch/find_null_non_nullable_fields for Parquet compatibility. Drop unused serde_arrow dep. Regenerate Cargo.lock. Made-with: Cursor

fix broken tests

1eaa921

petere-datadog requested review from a team as code owners April 9, 2026 19:32

github-actions bot added domain: sinks Anything related to the Vector's sinks domain: ci Anything related to Vector's CI environment domain: external docs Anything related to Vector's external, public documentation labels Apr 9, 2026

maycmlee added the under_review label Apr 9, 2026

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

lib/codecs/src/encoding/format/arrow.rs Outdated Show resolved Hide resolved

lib/codecs/src/encoding/format/parquet.rs Outdated Show resolved Hide resolved

remove benches stuff

03a4544

github-advanced-security AI found potential problems Apr 9, 2026

View reviewed changes

changelog.d/aws_s3_parquet_encoding.feature.md Fixed Show fixed Hide fixed

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

src/sinks/aws_s3/integration_tests.rs Show resolved Hide resolved

src/sinks/aws_s3/config.rs Show resolved Hide resolved

petere-datadog added 3 commits April 10, 2026 10:38

fix the integration tests

2c61e10

Add compression level to compression type

fca54e7

update teh aws_s3.cue file

ca201c5

chatgpt-codex-connector bot reviewed Apr 10, 2026

View reviewed changes

src/sinks/aws_s3/config.rs Outdated Show resolved Hide resolved

Minor updates based on my own PR review of szibis initial PR

6dcc854

github-advanced-security AI found potential problems Apr 10, 2026

View reviewed changes

changelog.d/aws_s3_parquet_encoding.feature.md Fixed Show fixed Hide fixed

changelog.d/aws_s3_parquet_encoding.feature.md Fixed Show fixed Hide fixed

fix minor issue from batch encoder

11aa83b

chatgpt-codex-connector bot reviewed Apr 10, 2026

View reviewed changes

src/sinks/util/encoding.rs Outdated Show resolved Hide resolved

fix tests from szibis PR

08b9b54

chatgpt-codex-connector bot reviewed Apr 10, 2026

View reviewed changes

lib/codecs/src/encoding/format/arrow.rs Outdated Show resolved Hide resolved

Fix the tests

e4235d2

Fix cargo fmt and deny issues

89bdbbd

github-advanced-security AI found potential problems Apr 10, 2026

View reviewed changes

changelog.d/aws_s3_parquet_encoding.feature.md Fixed Show fixed Hide fixed

fix cargo markdown and license errors

f8a4cfd

petere-datadog changed the title ~~feat(codecs): add Parquet encoder with schema_file and schema_mode: strict, relaxed or auto_infer~~ feat(aws_s3_sink): add Parquet encoder with schema_file and auto infer schema support Apr 10, 2026

petere-datadog changed the title ~~feat(aws_s3_sink): add Parquet encoder with schema_file and auto infer schema support~~ feat(aws_s3 sink): add Parquet encoder with schema_file and auto infer schema support Apr 10, 2026

petere-datadog added 3 commits April 10, 2026 14:41

Revert cargo.lock serde_json back to what it was

fad2186

rebuild cargo.lock, don't want to pull in non needed upgrades

1332eec

update warning and licenses

d2cce72

revert weird test updates

2c3a74d

chatgpt-codex-connector bot reviewed Apr 10, 2026

View reviewed changes

src/sinks/aws_s3/config.rs Show resolved Hide resolved

petere-datadog added 2 commits April 10, 2026 16:04

merge internal_events

8d6c9c2

cargo fmt fix

ec28c64

chatgpt-codex-connector bot reviewed Apr 10, 2026

View reviewed changes

petere-datadog added 2 commits April 13, 2026 10:54

Throw a validation error if someone tries to use arrow encoding with …

f312fda

…amazon s3 sink

update licenses after re-installing dd-rust-license-tool

0f8dde5

chatgpt-codex-connector bot reviewed Apr 13, 2026

View reviewed changes

lib/codecs/src/encoding/format/arrow.rs Outdated Show resolved Hide resolved

pront reviewed Apr 13, 2026

View reviewed changes

lib/codecs/src/encoding/format/arrow.rs Outdated Show resolved Hide resolved

lib/codecs/src/encoding/format/arrow.rs Outdated Show resolved Hide resolved

lib/codecs/src/encoding/format/parquet.rs Show resolved Hide resolved

graphcareful approved these changes Apr 13, 2026

View reviewed changes

Short circuit when there's a serialization error instead of just emit…

4781372

…ting an error

tessneau approved these changes Apr 14, 2026

View reviewed changes

petere-datadog added 2 commits April 14, 2026 11:21

PR comments feedback

56f3462

Merge branch 'master' of github.com:vectordotdev/vector into peter.eh…

589f483

…ik/aws-s3-parquet-encoding

petere-datadog enabled auto-merge April 14, 2026 15:54

maycmlee self-assigned this Apr 14, 2026

pront reviewed Apr 14, 2026

View reviewed changes

		if !self.schema_field_names.contains(top_level.as_str()) {
		return Err(Box::new(ArrowEncodingError::SchemaFetchError {

Conversation

petere-datadog commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Does this PR include user facing changes?

Notes

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

pront commented Apr 10, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

datadog-vectordotdev bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

petere-datadog commented Apr 10, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

petere-datadog commented Apr 13, 2026

Uh oh!

petere-datadog commented Apr 13, 2026

Uh oh!

pront left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

petere-datadog commented Apr 9, 2026 •

edited

Loading

github-actions bot commented Apr 9, 2026 •

edited

Loading

datadog-vectordotdev bot commented Apr 10, 2026 •

edited

Loading