Skip to content

[paimon-common] Fix ApplyBitmapIndexRecordReader to stop readBatch after bitmap selection is exhausted.#7994

Open
wwj6591812 wants to merge 1 commit into
apache:masterfrom
wwj6591812:fix_bug_of_ApplyBitmapIndexRecordReader_0527
Open

[paimon-common] Fix ApplyBitmapIndexRecordReader to stop readBatch after bitmap selection is exhausted.#7994
wwj6591812 wants to merge 1 commit into
apache:masterfrom
wwj6591812:fix_bug_of_ApplyBitmapIndexRecordReader_0527

Conversation

@wwj6591812
Copy link
Copy Markdown
Contributor

Problem

When file-index.read.enabled=true (default) and a query has only LIMIT N (no filter), Paimon pushes the limit down as a bitmap selection and wraps the ORC reader with ApplyBitmapIndexRecordReader.

After row N, ApplyBitmapIndexFileRecordIterator.next() returns null because position > last. However, RecordReaderIterator treats iterator null as batch exhaustion, not reader exhaustion, and keeps calling readBatch() until EOF. This causes unnecessary full-file ORC I/O even though only N rows are needed.

This affects any caller using RecordReader.toCloseableIterator() or forEachRemaining(), not only Flink.

Related Flink-side fix: #7991
That PR stops the dedicated split path from calling hasNext() after the limit is reached in ReadOperator. This PR fixes the root cause in the core reader layer and benefits all read paths.

Root Cause

ApplyBitmapIndexFileRecordIterator stops yielding rows once position > last, but ApplyBitmapIndexRecordReader.readBatch() had no global exhausted state. So RecordReaderIterator.advanceIfNeeded() interpreted the null as "current batch finished, open the next batch" and looped until EOF.

Fix

Track bitmap selection exhaustion in ApplyBitmapIndexRecordReader with an AtomicBoolean:

  1. When ApplyBitmapIndexFileRecordIterator sees position > last, it sets exhausted = true.
  2. Subsequent readBatch() calls return null immediately, so RecordReaderIterator stops without scanning the rest of the file.

Minimal change: only ApplyBitmapIndexRecordReader and ApplyBitmapIndexFileRecordIterator are modified.

Relationship to #7991

PR Layer Scope
#7991 Flink ReadOperator Dedicated split path; stops extra hasNext() after limit
This PR Core ApplyBitmapIndexRecordReader All callers of toCloseableIterator() / forEachRemaining()

The two PRs are independent and can be reviewed/merged separately. Either one fixes the reported LIMIT N performance issue on the dedicated path; together they provide defense in depth.

Testing

  • ApplyBitmapIndexRecordReaderTest — mock reader with 100 rows (batch size 20), limit 10 → 10 rows returned, underlying readBatch called once; covers toCloseableIterator() and forEachRemaining(); sparse bitmap case
  • AppendOnlySimpleTableTest.testLimitWithCloseableIterator — 5000-row Parquet table, limit 10 via toCloseableIterator()
  • BatchFileStoreITCase.testDedicatedPathLimitTenOnManyRows — 100 rows INSERT, dedicated split + LIMIT 10 → 10 rows
mvn test -pl paimon-common -Dtest=ApplyBitmapIndexRecordReaderTest

mvn test -pl paimon-core -Dtest=AppendOnlySimpleTableTest#testLimitWithCloseableIterator

mvn test -pl paimon-flink/paimon-flink-common \
  -Dtest=BatchFileStoreITCase#testDedicatedPathLimitTenOnManyRows

Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. I think the current exhausted flag does not cover the real RawFileSplitRead path yet.

In RawFileSplitRead#createFileReader, the same BitmapIndexResult is first passed into FormatReaderContext and then DataFileRecordReader#readBatchInternal applies it again via iterator.selection(selection) before the reader is wrapped by ApplyBitmapIndexRecordReader. Because of that inner selection, the wrapped iterator can return null as soon as the bitmap iterator is exhausted, without ever exposing a row whose returnedPosition() > last to ApplyBitmapIndexFileRecordIterator. In that case exhausted remains false, and RecordReaderIterator still keeps calling readBatch() until EOF.

The new unit test does not model this because its underlying reader returns unfiltered rows. I added a local variant where CountingFileRecordReader#readBatch() returns batch.selection(selection), which is equivalent to the DataFileRecordReader behavior above. With totalRows = 100, batchSize = 20, and limit = 10, the result rows are correct, but the underlying readBatchCount() is 6 instead of 1.

A minimal fix could be to mark the reader exhausted when returning the last selected row, for example when bitmap.contains(position) and position >= last, or otherwise avoid/apply only one layer of bitmap selection and add a test that simulates the already-selected underlying iterator.

@wwj6591812 wwj6591812 force-pushed the fix_bug_of_ApplyBitmapIndexRecordReader_0527 branch from 8b47063 to 6e7bb2d Compare May 29, 2026 00:50
@wwj6591812
Copy link
Copy Markdown
Contributor Author

Thanks for the fix. I think the current exhausted flag does not cover the real RawFileSplitRead path yet.

In RawFileSplitRead#createFileReader, the same BitmapIndexResult is first passed into FormatReaderContext and then DataFileRecordReader#readBatchInternal applies it again via iterator.selection(selection) before the reader is wrapped by ApplyBitmapIndexRecordReader. Because of that inner selection, the wrapped iterator can return null as soon as the bitmap iterator is exhausted, without ever exposing a row whose returnedPosition() > last to ApplyBitmapIndexFileRecordIterator. In that case exhausted remains false, and RecordReaderIterator still keeps calling readBatch() until EOF.

The new unit test does not model this because its underlying reader returns unfiltered rows. I added a local variant where CountingFileRecordReader#readBatch() returns batch.selection(selection), which is equivalent to the DataFileRecordReader behavior above. With totalRows = 100, batchSize = 20, and limit = 10, the result rows are correct, but the underlying readBatchCount() is 6 instead of 1.

A minimal fix could be to mark the reader exhausted when returning the last selected row, for example when bitmap.contains(position) and position >= last, or otherwise avoid/apply only one layer of bitmap selection and add a test that simulates the already-selected underlying iterator.

Thanks again for the careful review and for pointing out the gap in the real RawFileSplitRead path. Your feedback was very helpful.

You were right that our original exhausted flag did not cover the case where the underlying iterator is already selection-filtered in DataFileRecordReader. In that path, the wrapped iterator can return null once the selected rows in the current batch are consumed, without ever reaching position > last, so readBatch() could still continue until EOF.

We have updated the fix accordingly:

We still keep the original reader-level exhausted guard in ApplyBitmapIndexRecordReader.
In ApplyBitmapIndexFileRecordIterator, we now also mark the reader exhausted when returning the last selected row (position >= last), in addition to the existing position > last handling.
We added a unit test where the underlying reader returns batch.selection(selection), matching the DataFileRecordReader behavior you described. With totalRows = 100, batchSize = 20, and limit = 10, the result is still 10 rows, and readBatch() is now called only once.

We really appreciate you taking the time to review this in detail. Could you please take another look when convenient? Thx.

@leaves12138

@wwj6591812 wwj6591812 requested a review from leaves12138 May 29, 2026 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants