Skip to content

Add JavaParquetSink execution operator for the Java platform#731

Merged
zkaoudi merged 2 commits intoapache:mainfrom
gknz:feature/java-parquet-sink
Apr 16, 2026
Merged

Add JavaParquetSink execution operator for the Java platform#731
zkaoudi merged 2 commits intoapache:mainfrom
gknz:feature/java-parquet-sink

Conversation

@gknz
Copy link
Copy Markdown

@gknz gknz commented Mar 23, 2026

Java Parquet Sink

Hey everyone! I built a Java platform execution operator for the existing ParquetSink logical operator, which previously only had a Spark implementation (SparkParquetSink). Now, the optimizer can choose between Java and Spark when writing Parquet files.

What this PR adds

JavaParquetSink (wayang-java/operators/)

  • Writes Wayang Records to Parquet files using the parquet-avro library
  • Follows the same pattern/logic as SparkParquetSink
  • Infers Avro schema automatically by sampling up to 50 records
  • Uses RecordType field names when available, falls back to field0, field1, etc.
  • Uses Snappy compression
  • Handles file overwrite mode

ParquetSinkMapping (wayang-java/mapping/)

  • Connects the logical ParquetSink to JavaParquetSink
  • Registered in Java platform Mappings.java

Fluent API (DataQuantaBuilder.scala)

  • Added writeParquet method to DataQuantaBuilder so users can call it from the fluent API (from my understanding, this was missing — only the DataQuanta.scala layer existed)

Unit Tests (JavaParquetSinkTest.java)

  • testWriteStringRecords: verifies basic write and read-back
  • testWriteMixedTypeRecords: verifies type inference (Int, String, Double)
  • testWriteWithRecordType: verifies column names from RecordType are preserved in the Parquet schema
  • testOverwriteExistingFile: verifies overwrite mode replaces data

Testing

All 4 unit tests pass. I also tested the operator in a different pipeline (not included here) that writes hourly aggregated data to Parquet and reads it back for further processing, and it was working as intended,

- Add JavaParquetSink: writes Records to Parquet using parquet-avro
  with Snappy compression and automatic schema inference
- Add ParquetSinkMapping and register in Java Mappings
- Add writeParquet to DataQuantaBuilder for Java fluent API
- Add JavaParquetSinkTest with 4 unit tests
Comment thread .gitignore Outdated
# Scala Plugin for VSCode
.metals
.bloop/
.metals/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we do not need the metals twice

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @zkaoudi, I removed the duplicate.

@zkaoudi
Copy link
Copy Markdown
Contributor

zkaoudi commented Apr 16, 2026

Thank you @gknz and congrats on your first contribution :)

@zkaoudi zkaoudi merged commit 3335ef0 into apache:main Apr 16, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants