Add JavaParquetSink execution operator for the Java platform by gknz · Pull Request #731 · apache/wayang

gknz · 2026-03-23T23:43:01Z

Java Parquet Sink

Hey everyone! I built a Java platform execution operator for the existing ParquetSink logical operator, which previously only had a Spark implementation (SparkParquetSink). Now, the optimizer can choose between Java and Spark when writing Parquet files.

What this PR adds

JavaParquetSink (wayang-java/operators/)

Writes Wayang Records to Parquet files using the parquet-avro library
Follows the same pattern/logic as SparkParquetSink
Infers Avro schema automatically by sampling up to 50 records
Uses RecordType field names when available, falls back to field0, field1, etc.
Uses Snappy compression
Handles file overwrite mode

ParquetSinkMapping (wayang-java/mapping/)

Connects the logical ParquetSink to JavaParquetSink
Registered in Java platform Mappings.java

Fluent API (DataQuantaBuilder.scala)

Added writeParquet method to DataQuantaBuilder so users can call it from the fluent API (from my understanding, this was missing — only the DataQuanta.scala layer existed)

Unit Tests (JavaParquetSinkTest.java)

testWriteStringRecords: verifies basic write and read-back
testWriteMixedTypeRecords: verifies type inference (Int, String, Double)
testWriteWithRecordType: verifies column names from RecordType are preserved in the Parquet schema
testOverwriteExistingFile: verifies overwrite mode replaces data

Testing

All 4 unit tests pass. I also tested the operator in a different pipeline (not included here) that writes hourly aggregated data to Parquet and reads it back for further processing, and it was working as intended,

- Add JavaParquetSink: writes Records to Parquet using parquet-avro with Snappy compression and automatic schema inference - Add ParquetSinkMapping and register in Java Mappings - Add writeParquet to DataQuantaBuilder for Java fluent API - Add JavaParquetSinkTest with 4 unit tests

zkaoudi · 2026-04-15T12:43:29Z

 # Scala Plugin for VSCode
 .metals
+.bloop/
+.metals/


maybe we do not need the metals twice

Hello @zkaoudi, I removed the duplicate.

zkaoudi · 2026-04-16T08:04:50Z

Thank you @gknz and congrats on your first contribution :)

zkaoudi requested changes Apr 15, 2026

View reviewed changes

Remove duplicate .metals entry from .gitignore

1ab755b

zkaoudi approved these changes Apr 16, 2026

View reviewed changes

zkaoudi merged commit 3335ef0 into apache:main Apr 16, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JavaParquetSink execution operator for the Java platform#731

Add JavaParquetSink execution operator for the Java platform#731
zkaoudi merged 2 commits intoapache:mainfrom
gknz:feature/java-parquet-sink

gknz commented Mar 23, 2026

Uh oh!

zkaoudi Apr 15, 2026

Uh oh!

gknz Apr 15, 2026

Uh oh!

zkaoudi commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gknz commented Mar 23, 2026

Java Parquet Sink

What this PR adds

Testing

Uh oh!

zkaoudi Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gknz Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

zkaoudi commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants