Skip to content

Conversation

@rorylshanks
Copy link

@rorylshanks rorylshanks commented Dec 12, 2025

Summary

This PR adds Apache Parquet encoding support to the AWS S3 sink, enabling Vector to write columnar Parquet files optimized for analytics workloads.

Parquet is a columnar storage format that provides efficient compression and encoding, making it ideal for long-term storage and query performance with tools like AWS Athena, Apache Spark, and Presto. This implementation allows users to write properly formatted Parquet files with configurable schemas, compression, and row group sizing.

Key features:

  • Complete Parquet encoder implementation with Apache Arrow integration
  • YAML schema configuration support (field names → data types)
  • Support for all common data types (strings, integers, floats, timestamps, booleans, etc.)
  • Configurable compression algorithms (snappy, gzip, zstd, lz4, brotli)
  • Row group size control for query parallelization
  • Nullable field support
  • Comprehensive test suite (9 unit tests)
  • Full documentation for schema configuration and Parquet options

Vector configuration

sources:
  events:
    type: kafka
    bootstrap_servers: "kafka:9092"
    topics:
      - events

transforms:
  prepare:
    inputs:
      - events
    type: remap
    source: |
      parsed = parse_json(.message) ?? {}
      .uuid = parsed.uuid
      .properties = parsed.properties
      

sinks:
  s3_events:
    type: aws_s3
    inputs:
      - prepare
    bucket: my-bucket
    region: us-east-1
    compression: none  # Parquet handles compression internally

    batch:
      max_events: 50000
      timeout_secs: 60

    encoding:
      codec: parquet
      parquet:
        compression: zstd
        allow_nullable_fields: true
        schema:
          timestamp: timestamp_microsecond
          uuid: utf8
          properties: utf8

How did you test this PR?

I tested it against production Kafka data, and it produced correctly formatted Parquet files in S3.

Change Type

  • Bug fix
  • New feature (Parquet encoder for AWS S3 sink)
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

@rorylshanks rorylshanks requested review from a team as code owners December 12, 2025 13:07
@github-actions github-actions bot added domain: sinks Anything related to the Vector's sinks domain: codecs Anything related to Vector's codecs (encoding/decoding) domain: external docs Anything related to Vector's external, public documentation labels Dec 12, 2025
@rorylshanks rorylshanks changed the title Added parquet encoding to Vector AWS S3 Output feat(aws_s3 sink): Add Apache Parquet encoder support Dec 12, 2025
@github-actions
Copy link

github-actions bot commented Dec 12, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@rorylshanks
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@drichards-87 drichards-87 self-assigned this Dec 12, 2025
@drichards-87 drichards-87 removed their assignment Dec 12, 2025
@rorylshanks rorylshanks marked this pull request as draft December 14, 2025 15:34
@rorylshanks rorylshanks marked this pull request as ready for review December 16, 2025 06:34
@github-actions github-actions bot added the domain: ci Anything related to Vector's CI environment label Dec 22, 2025
Copy link
Contributor

@thomasqueirozb thomasqueirozb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rorylshanks, thanks for your contribution! It looks like there are failing checks (run make check-clippy for example). This is also failing to compile after merging master because you removed BatchSerializerConfig::build which is used by the clickhouse sink. I'll circle back to this PR and give it a review once I see commits pushed to this branch

@thomasqueirozb thomasqueirozb added the meta: awaiting author Pull requests that are awaiting their author. label Dec 22, 2025
Co-authored-by: Thomas <thomasqueirozb@gmail.com>
@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Dec 23, 2025
@thomasqueirozb thomasqueirozb added the meta: awaiting author Pull requests that are awaiting their author. label Dec 23, 2025
@rorylshanks rorylshanks requested a review from a team as a code owner December 24, 2025 12:02
@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: ci Anything related to Vector's CI environment domain: codecs Anything related to Vector's codecs (encoding/decoding) domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support parquet columnar format in the aws_s3 sink

3 participants