Fix Iceberg read optimization returning NULLs for stats-less manifests#1895
Open
il9ue wants to merge 1 commit into
Open
Fix Iceberg read optimization returning NULLs for stats-less manifests#1895il9ue wants to merge 1 commit into
il9ue wants to merge 1 commit into
Conversation
When an Iceberg manifest's per-file column statistics are absent or empty (common for non-Spark writers like pyiceberg with default settings), DataFileMetaInfo::columns_info is empty. The optimization in StorageObjectStorageSource::createReader misread this as "all columns are absent from the file" and returned constant NULLs for every row while still returning the correct row count. Result: silent data loss on icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants. Gate the optimization's absent-NULL loop directly on columns_info.empty() instead of introducing a separate stats-presence flag. When no usable per-column stats were parsed -- whether the manifest omitted the stats fields entirely or declared them but left them empty -- fall through to the Parquet reader, which correctly handles physically-present columns (read normally) and schema-evolved-absent columns (handled by IcebergMetadata::getInitialSchemaByPath setting the file's own schema as initial_header). columns_info is already serialized to workers in the cluster JSON path, so this changes no serialization format and keeps the fork's DataFileMetaInfo serde identical to upstream. Closes #1545. Mirror of #1688 (antalya-25.8 fix). Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com>
Closed
25 tasks
ianton-ru
approved these changes
Jun 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-opened from
Altinity/ClickHouse(instead of my fork) so CI publishes direct.debpackage URLs for clickhouse-regression. Same commit (8b597ed) as #1814. Fork PRs don't receive repo secrets, so the S3 upload step was skipped on #1814 and CI only emitted theDEB_ARM_RELEASEartifact zip.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix Iceberg read optimization returning NULL for every column when reading from manifests written without per-file column statistics (typical of non-Spark writers like pyiceberg with default settings). Affects
icebergLocal,icebergS3,icebergAzure,icebergHDFS, and all*Clustervariants. Antalya 26.3 fix for #1545.Documentation entry for user-facing changes
Antalya-specific bug fix on
antalya-26.3. No upstream cherry-pick — this bug exists only on Antalya, introduced by #1069 ("Read optimization using Iceberg metadata"). Mirror of the 25.8 fix in #1688.Why this fires
When reading an Iceberg table written by a non-Spark writer that omits per-file column statistics from the manifest's Avro schema (pyiceberg with default settings, format v1 writers, and others), the
allow_experimental_iceberg_read_optimizationpath produces silent data loss: correct row counts, every column valueNULL. Confirmed onicebergLocal; the same code path fires foricebergS3,icebergAzure,icebergHDFS, and all*Clustervariants.Root cause
IcebergIteratoralways populatesfile_meta_infobefore yielding objects, so thefile_meta_data.has_value()check in the optimization passes. The problem is what's inside the populatedDataFileMetaInfo: when the manifest'sdata_file.value_counts/column_sizes/null_value_countsAvro fields are all absent (all three are optional per the Iceberg spec),DataFileMetaInfo::columns_infostays empty.The optimization's second loop in
StorageObjectStorageSource::createReaderthen iterates every requested column, finds none in the emptycolumns_infomap, and adds them all toconstant_columns_with_valueswithField()(NULL).requested_columns_copyis cleared,need_only_count = true, the Parquet reader returns row count only, andgenerate()injects every column as a constant-NULL column at the correct row count. The optimization conflates "no stats were written" with "all columns are absent" — but absent stats tell us nothing about which columns are physically present.The fix
Add
any_stats_field_present(bool) toDataFileMetaInfo, populated during manifest parsing inAvroForIcebergDeserializer.cpp—trueif any ofvalue_counts,column_sizes, ornull_value_countswere emitted. Gate the optimization's absent-NULL loop on this flag: when no stats were emitted, skip the loop and fall through to the Parquet reader, which correctly handles both physically-present columns (read normally) and schema-evolved-absent columns (handled upstream byIcebergMetadata::getInitialSchemaByPathsetting the file's own schema asinitial_header).A per-column presence set was considered but is unnecessary — schema evolution is already handled upstream of the optimization, so the boolean is sufficient. JSON serialization (cluster reads via
toJson()/ JSON-ptr constructor) round-trips the new field; missing-on-deserialization defaults tofalse, matching pre-fix behavior.Files changed
src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.h— addedany_stats_field_presenttoDataFileMetaInfo; constructor signature updated.src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.cpp— JSON serde round-trips the new field; defaults tofalseon missing.src/Storages/ObjectStorage/DataLakes/Iceberg/ManifestFile.h— header updates forParsedManifestFileEntry.src/Storages/ObjectStorage/DataLakes/Common/AvroForIcebergDeserializer.cpp— tracks whether any stats Avro field was present during manifest parsing on 26.3.src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergIterator.cpp— forwards the new bool when constructingDataFileMetaInfo.src/Storages/ObjectStorage/StorageObjectStorageSource.cpp— the absent-NULL loop now skips whenany_stats_field_presentisfalse.Tested
Integration test
tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py, ported from the 25.8 PR. Four scenarios:test_iceberg_local_returns_actual_rows_with_stats_less_manifest— reproducer, fails without the fix.test_iceberg_local_returns_correct_rows_when_optimization_disabled— control.test_iceberg_local_partial_stats_manifest_reads_correctly— manifest withvalue_countsonly.test_iceberg_local_full_stats_manifest_reads_correctly— full Spark-style stats regression guard.Closes #1545
Mirror of #1688 (antalya-25.8 fix).
CI/CD Options
Exclude tests:
Regression jobs to run: