Skip to content

GH-49502: [Parquet][C++] Fix missing overflow check for dictionary encoder indices count#49513

Open
aryansri05 wants to merge 1 commit intoapache:mainfrom
aryansri05:fix/dict-encoder-index-overflow
Open

GH-49502: [Parquet][C++] Fix missing overflow check for dictionary encoder indices count#49513
aryansri05 wants to merge 1 commit intoapache:mainfrom
aryansri05:fix/dict-encoder-index-overflow

Conversation

@aryansri05
Copy link

@aryansri05 aryansri05 commented Mar 14, 2026

Closes #49502
Rationale for this change

When writing large dictionary-encoded Parquet data with
ARROW_LARGE_MEMORY_TESTS=ON, two tests were failing:

  • TestColumnWriter.WriteLargeDictEncodedPage — expected 2 pages, got 7501
  • TestColumnWriter.ThrowsOnDictIndicesTooLarge — expected ParquetException,
    got nothing thrown

The root cause is that PutIndicesTyped() in DictEncoderImpl had no check
for when the total number of buffered dictionary indices exceeds INT32_MAX.
The existing overflow check in FlushValues() only checks the buffer size in
bytes, not the index count, so it never triggered for this case.

What changes are included in this PR?

Added an overflow check in DictEncoderImpl::PutIndicesTyped() immediately
after buffered_indices_.resize():

if (buffered_indices_.size() >
static_cast<size_t>(std::numeric_limits<int32_t>::max())) {
throw ParquetException("Total dictionary indices count (",
buffered_indices_.size(),
") exceeds maximum int value");
}

This makes the encoder throw a ParquetException with a message containing
"exceeds maximum int value" when the index count overflows, which is exactly
what ThrowsOnDictIndicesTooLarge expects.

Are these changes tested?

Yes — the existing tests in column_writer_test.cc cover this fix:

  • TestColumnWriter.ThrowsOnDictIndicesTooLarge
  • TestColumnWriter.WriteLargeDictEncodedPage

Both tests were failing before this fix and should pass after.
Tests require building with ARROW_LARGE_MEMORY_TESTS=ON.

This PR contains a "Critical Fix"— previously, writing dictionary-encoded
data with more than INT32_MAX indices would silently produce incorrect output
(wrong page count) instead of raising an error. This fix makes the encoder
correctly throw a ParquetException in that scenario.

@aryansri05 aryansri05 requested a review from wgtmac as a code owner March 14, 2026 20:46
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@aryansri05 aryansri05 changed the title ARROW-PARQUET: Fix missing overflow check for dict encoder indices count GH-49502: [Parquet][C++] Fix missing overflow check for dictionary encoder indices count Mar 14, 2026
@github-actions
Copy link

⚠️ GitHub issue #49502 has been automatically assigned in GitHub to PR creator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

1 participant