[paimon-common] Fix ApplyBitmapIndexRecordReader to stop readBatch after bitmap selection is exhausted.#7994
Conversation
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the fix. I think the current exhausted flag does not cover the real RawFileSplitRead path yet.
In RawFileSplitRead#createFileReader, the same BitmapIndexResult is first passed into FormatReaderContext and then DataFileRecordReader#readBatchInternal applies it again via iterator.selection(selection) before the reader is wrapped by ApplyBitmapIndexRecordReader. Because of that inner selection, the wrapped iterator can return null as soon as the bitmap iterator is exhausted, without ever exposing a row whose returnedPosition() > last to ApplyBitmapIndexFileRecordIterator. In that case exhausted remains false, and RecordReaderIterator still keeps calling readBatch() until EOF.
The new unit test does not model this because its underlying reader returns unfiltered rows. I added a local variant where CountingFileRecordReader#readBatch() returns batch.selection(selection), which is equivalent to the DataFileRecordReader behavior above. With totalRows = 100, batchSize = 20, and limit = 10, the result rows are correct, but the underlying readBatchCount() is 6 instead of 1.
A minimal fix could be to mark the reader exhausted when returning the last selected row, for example when bitmap.contains(position) and position >= last, or otherwise avoid/apply only one layer of bitmap selection and add a test that simulates the already-selected underlying iterator.
…ter bitmap selection is exhausted.
8b47063 to
6e7bb2d
Compare
Thanks again for the careful review and for pointing out the gap in the real RawFileSplitRead path. Your feedback was very helpful. You were right that our original exhausted flag did not cover the case where the underlying iterator is already selection-filtered in DataFileRecordReader. In that path, the wrapped iterator can return null once the selected rows in the current batch are consumed, without ever reaching position > last, so readBatch() could still continue until EOF. We have updated the fix accordingly: We still keep the original reader-level exhausted guard in ApplyBitmapIndexRecordReader. We really appreciate you taking the time to review this in detail. Could you please take another look when convenient? Thx. |
Problem
When
file-index.read.enabled=true(default) and a query has onlyLIMIT N(no filter), Paimon pushes the limit down as a bitmap selection and wraps the ORC reader withApplyBitmapIndexRecordReader.After row N,
ApplyBitmapIndexFileRecordIterator.next()returnsnullbecauseposition > last. However,RecordReaderIteratortreats iteratornullas batch exhaustion, not reader exhaustion, and keeps callingreadBatch()until EOF. This causes unnecessary full-file ORC I/O even though only N rows are needed.This affects any caller using
RecordReader.toCloseableIterator()orforEachRemaining(), not only Flink.Related Flink-side fix: #7991
That PR stops the dedicated split path from calling
hasNext()after the limit is reached inReadOperator. This PR fixes the root cause in the core reader layer and benefits all read paths.Root Cause
ApplyBitmapIndexFileRecordIteratorstops yielding rows onceposition > last, butApplyBitmapIndexRecordReader.readBatch()had no global exhausted state. SoRecordReaderIterator.advanceIfNeeded()interpreted thenullas "current batch finished, open the next batch" and looped until EOF.Fix
Track bitmap selection exhaustion in
ApplyBitmapIndexRecordReaderwith anAtomicBoolean:ApplyBitmapIndexFileRecordIteratorseesposition > last, it setsexhausted = true.readBatch()calls returnnullimmediately, soRecordReaderIteratorstops without scanning the rest of the file.Minimal change: only
ApplyBitmapIndexRecordReaderandApplyBitmapIndexFileRecordIteratorare modified.Relationship to #7991
ReadOperatorhasNext()after limitApplyBitmapIndexRecordReadertoCloseableIterator()/forEachRemaining()The two PRs are independent and can be reviewed/merged separately. Either one fixes the reported
LIMIT Nperformance issue on the dedicated path; together they provide defense in depth.Testing
ApplyBitmapIndexRecordReaderTest— mock reader with 100 rows (batch size 20), limit 10 → 10 rows returned, underlyingreadBatchcalled once; coverstoCloseableIterator()andforEachRemaining(); sparse bitmap caseAppendOnlySimpleTableTest.testLimitWithCloseableIterator— 5000-row Parquet table, limit 10 viatoCloseableIterator()BatchFileStoreITCase.testDedicatedPathLimitTenOnManyRows— 100 rows INSERT, dedicated split +LIMIT 10→ 10 rows