GH-47662: [C++][Parquet] Reject metadata with null_count on required column#49909
GH-47662: [C++][Parquet] Reject metadata with null_count on required column#49909ArnavBalyan wants to merge 1 commit intoapache:mainfrom
Conversation
|
|
f669161 to
d0043f4
Compare
|
cc @pitrou could you ptal thanks! |
| column_metadata.__isset.statistics = true; | ||
|
|
||
| EXPECT_THROW(ColumnChunkMetaData::Make(&column_chunk, schema_descr.Column(0)), | ||
| ParquetException); |
There was a problem hiding this comment.
Can we please test something about the exception message? You can use the EXPECT_THROW_THAT macro.
|
|
||
| // Per the Parquet spec, a column with max_definition_level == 0 cannot | ||
| // have nulls, so null_count must be 0. Reject inconsistent metadata | ||
| // from writers that skip ValidatingRecordConsumer checks for missing |
There was a problem hiding this comment.
Readers of the Parquet C++ code will not know anything about ValidatingRecordConsumer, can we make this comment more generic?
|
@wgtmac Could you please take a look? |
d0043f4 to
76111a6
Compare
|
updated thanks |
76111a6 to
018c631
Compare
| ss << "Malformed Parquet file: column '" << descr_->path()->ToDotString() | ||
| << "' has max_definition_level == 0 but statistics report null_count=" | ||
| << column_metadata_->statistics.null_count; | ||
| throw ParquetException(ss.str()); |
There was a problem hiding this comment.
Instead of throwing an exception here, is it better to unset the null_count field when building the statistics?
Just FYI that I recently reviewed apache/parquet-java#3458 which has dealt with a similar case. From downstream users' standpoint, it is a bad experience if the Parquet file cannot be read simply because of malformed statistics.
There was a problem hiding this comment.
Well, is it possible to get such a malformed file using the regular Parquet writing APIs?
There was a problem hiding this comment.
Looks like some legacy versions can get it through with disabled validating record consumer
There was a problem hiding this comment.
unset the null_count field when building the statistics
Yeah agreed, thanks for sharing, updated to maintain parity with java
018c631 to
6cd1b89
Compare
Rationale for this change
REQUIREDcolumns can't encode nulls (max_definition_level = 0 always has null_count = 0).What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?