refactor(parquet-datasource): split sink and schema_coercion out of file_format.rs#22347
Merged
adriangb merged 2 commits intoMay 19, 2026
Merged
Conversation
…ile_format.rs Pure code motion, no behavior change. `file_format.rs` had grown to ~2,000 LOC bundling several responsibilities; this extracts two of them into focused modules (file_format.rs drops to ~660 LOC): - `sink.rs` — `ParquetSink` and the parallel-write machinery (`column_serializer_task`, `spawn_column_parallel_row_group_writer`, `output_single_parquet_file_parallelized`, `concatenate_parallel_row_groups`, etc.). - `schema_coercion.rs` — the Arrow-schema coercion utilities (`apply_file_schema_type_coercions`, `coerce_int96_to_resolution`, `coerce_file_schema_to_view_type`, `coerce_file_schema_to_string_type`, `transform_schema_to_view`, `transform_binary_to_string`, `field_with_new_type`) and their tests. No public API change. Every previously-public item is still reachable at the same path: the crate root re-exports `sink::ParquetSink` and the `schema_coercion::*` functions, and the historical `file_format::ParquetSink` path is preserved via `pub use` (datafusion-proto depends on it). Split out of apache#22156. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@xudong963 wonder if you could review this refactor? |
xudong963
approved these changes
May 19, 2026
Comment on lines
+37
to
+38
| pub mod schema_coercion; | ||
| pub mod sink; |
Contributor
Author
There was a problem hiding this comment.
no, no need to have them public. good catch, thank you
These modules were declared `pub mod`, which exposed two new public module paths (`sink::*`, `schema_coercion::*`) that did not exist before this refactor. Since every public item is already re-exported at the crate root via `pub use`, declaring them `mod` keeps the PR a true no-public-API-change code motion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Relates to the discussion in #22024 about the Parquet datasource crate becoming hard to navigate. Split out of #22156, which bundled several code-motion moves into one PR — this is one of three smaller, independently-reviewable PRs that replace it.
Rationale for this change
file_format.rshad grown to ~2,000 LOC, bundling several distinct responsibilities into one file. That makes it hard to read and hard to review changes in isolation. This PR is pure code motion: no behavior change and no public API change.What changes are included in this PR?
Extracts two responsibilities from
file_format.rsinto focused modules (file_format.rsdrops to ~660 LOC):sink.rs—ParquetSinkand the parallel-write machinery (column_serializer_task,spawn_column_parallel_row_group_writer,output_single_parquet_file_parallelized,concatenate_parallel_row_groups, etc.).schema_coercion.rs— the Arrow-schema coercion utilities (apply_file_schema_type_coercions,coerce_int96_to_resolution,coerce_file_schema_to_view_type,coerce_file_schema_to_string_type,transform_schema_to_view,transform_binary_to_string,field_with_new_type) and their tests.Every previously-public item is still reachable at the same path: the crate root re-exports
sink::ParquetSinkand theschema_coercion::*functions, and the historicalfile_format::ParquetSinkpath is preserved viapub use(datafusion-proto depends on it).Are these changes tested?
Yes, covered by existing tests (the
coerce_int96_to_resolution_*tests moved with the function toschema_coercion.rs).cargo test -p datafusion-datasource-parquet --all-features(122 passing) andcargo clippy -p datafusion-datasource-parquet --all-targets --all-features -- -D warningsboth pass.datafusion-proto(a downstreamParquetSinkconsumer) builds clean.Are there any user-facing changes?
No. Public API is unchanged — every previously-public item is still reachable at the same crate-root path. The only difference is the file organization inside the crate.
🤖 Generated with Claude Code