Skip to content

refactor(parquet-datasource): split sink and schema_coercion out of file_format.rs#22347

Merged
adriangb merged 2 commits into
apache:mainfrom
adriangb:refactor/split-parquet-file-format
May 19, 2026
Merged

refactor(parquet-datasource): split sink and schema_coercion out of file_format.rs#22347
adriangb merged 2 commits into
apache:mainfrom
adriangb:refactor/split-parquet-file-format

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Relates to the discussion in #22024 about the Parquet datasource crate becoming hard to navigate. Split out of #22156, which bundled several code-motion moves into one PR — this is one of three smaller, independently-reviewable PRs that replace it.

Rationale for this change

file_format.rs had grown to ~2,000 LOC, bundling several distinct responsibilities into one file. That makes it hard to read and hard to review changes in isolation. This PR is pure code motion: no behavior change and no public API change.

What changes are included in this PR?

Extracts two responsibilities from file_format.rs into focused modules (file_format.rs drops to ~660 LOC):

  • sink.rsParquetSink and the parallel-write machinery (column_serializer_task, spawn_column_parallel_row_group_writer, output_single_parquet_file_parallelized, concatenate_parallel_row_groups, etc.).
  • schema_coercion.rs — the Arrow-schema coercion utilities (apply_file_schema_type_coercions, coerce_int96_to_resolution, coerce_file_schema_to_view_type, coerce_file_schema_to_string_type, transform_schema_to_view, transform_binary_to_string, field_with_new_type) and their tests.

Every previously-public item is still reachable at the same path: the crate root re-exports sink::ParquetSink and the schema_coercion::* functions, and the historical file_format::ParquetSink path is preserved via pub use (datafusion-proto depends on it).

Are these changes tested?

Yes, covered by existing tests (the coerce_int96_to_resolution_* tests moved with the function to schema_coercion.rs). cargo test -p datafusion-datasource-parquet --all-features (122 passing) and cargo clippy -p datafusion-datasource-parquet --all-targets --all-features -- -D warnings both pass. datafusion-proto (a downstream ParquetSink consumer) builds clean.

Are there any user-facing changes?

No. Public API is unchanged — every previously-public item is still reachable at the same crate-root path. The only difference is the file organization inside the crate.

🤖 Generated with Claude Code

…ile_format.rs

Pure code motion, no behavior change. `file_format.rs` had grown to
~2,000 LOC bundling several responsibilities; this extracts two of
them into focused modules (file_format.rs drops to ~660 LOC):

- `sink.rs` — `ParquetSink` and the parallel-write machinery
  (`column_serializer_task`, `spawn_column_parallel_row_group_writer`,
  `output_single_parquet_file_parallelized`,
  `concatenate_parallel_row_groups`, etc.).
- `schema_coercion.rs` — the Arrow-schema coercion utilities
  (`apply_file_schema_type_coercions`, `coerce_int96_to_resolution`,
  `coerce_file_schema_to_view_type`, `coerce_file_schema_to_string_type`,
  `transform_schema_to_view`, `transform_binary_to_string`,
  `field_with_new_type`) and their tests.

No public API change. Every previously-public item is still reachable
at the same path: the crate root re-exports `sink::ParquetSink` and the
`schema_coercion::*` functions, and the historical
`file_format::ParquetSink` path is preserved via `pub use`
(datafusion-proto depends on it).

Split out of apache#22156.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb
Copy link
Copy Markdown
Contributor Author

@xudong963 wonder if you could review this refactor?

Comment on lines +37 to +38
pub mod schema_coercion;
pub mod sink;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the two pub expected?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, no need to have them public. good catch, thank you

These modules were declared `pub mod`, which exposed two new public
module paths (`sink::*`, `schema_coercion::*`) that did not exist
before this refactor. Since every public item is already re-exported
at the crate root via `pub use`, declaring them `mod` keeps the PR a
true no-public-API-change code motion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb enabled auto-merge May 19, 2026 03:54
@adriangb adriangb added this pull request to the merge queue May 19, 2026
Merged via the queue into apache:main with commit c8b784a May 19, 2026
35 checks passed
@adriangb adriangb deleted the refactor/split-parquet-file-format branch May 19, 2026 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants