Fix Iceberg read optimization returning NULLs for stats-less manifests by il9ue · Pull Request #1895 · Altinity/ClickHouse

il9ue · 2026-06-08T14:12:34Z

Re-opened from Altinity/ClickHouse (instead of my fork) so CI publishes direct .deb package URLs for clickhouse-regression. Same commit (8b597ed) as #1814. Fork PRs don't receive repo secrets, so the S3 upload step was skipped on #1814 and CI only emitted the DEB_ARM_RELEASE artifact zip.

Changelog category (leave one):

Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix Iceberg read optimization returning NULL for every column when reading from manifests written without per-file column statistics (typical of non-Spark writers like pyiceberg with default settings). Affects icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants. Antalya 26.3 fix for #1545.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Antalya-specific bug fix on antalya-26.3. No upstream cherry-pick — this bug exists only on Antalya, introduced by #1069 ("Read optimization using Iceberg metadata"). Mirror of the 25.8 fix in #1688.

Why this fires

When reading an Iceberg table written by a non-Spark writer that omits per-file column statistics from the manifest's Avro schema (pyiceberg with default settings, format v1 writers, and others), the allow_experimental_iceberg_read_optimization path produces silent data loss: correct row counts, every column value NULL. Confirmed on icebergLocal; the same code path fires for icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants.

Root cause

IcebergIterator always populates file_meta_info before yielding objects, so the file_meta_data.has_value() check in the optimization passes. The problem is what's inside the populated DataFileMetaInfo: when the manifest's data_file.value_counts / column_sizes / null_value_counts Avro fields are all absent (all three are optional per the Iceberg spec), DataFileMetaInfo::columns_info stays empty.

The optimization's second loop in StorageObjectStorageSource::createReader then iterates every requested column, finds none in the empty columns_info map, and adds them all to constant_columns_with_values with Field() (NULL). requested_columns_copy is cleared, need_only_count = true, the Parquet reader returns row count only, and generate() injects every column as a constant-NULL column at the correct row count. The optimization conflates "no stats were written" with "all columns are absent" — but absent stats tell us nothing about which columns are physically present.

The fix

Add any_stats_field_present (bool) to DataFileMetaInfo, populated during manifest parsing in AvroForIcebergDeserializer.cpp — true if any of value_counts, column_sizes, or null_value_counts were emitted. Gate the optimization's absent-NULL loop on this flag: when no stats were emitted, skip the loop and fall through to the Parquet reader, which correctly handles both physically-present columns (read normally) and schema-evolved-absent columns (handled upstream by IcebergMetadata::getInitialSchemaByPath setting the file's own schema as initial_header).

A per-column presence set was considered but is unnecessary — schema evolution is already handled upstream of the optimization, so the boolean is sufficient. JSON serialization (cluster reads via toJson() / JSON-ptr constructor) round-trips the new field; missing-on-deserialization defaults to false, matching pre-fix behavior.

Files changed

src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.h — added any_stats_field_present to DataFileMetaInfo; constructor signature updated.
src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.cpp — JSON serde round-trips the new field; defaults to false on missing.
src/Storages/ObjectStorage/DataLakes/Iceberg/ManifestFile.h — header updates for ParsedManifestFileEntry.
src/Storages/ObjectStorage/DataLakes/Common/AvroForIcebergDeserializer.cpp — tracks whether any stats Avro field was present during manifest parsing on 26.3.
src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergIterator.cpp — forwards the new bool when constructing DataFileMetaInfo.
src/Storages/ObjectStorage/StorageObjectStorageSource.cpp — the absent-NULL loop now skips when any_stats_field_present is false.

Note: 26.3 uses AvroForIcebergDeserializer.cpp for manifest parsing where 25.8 / 26.1 use ManifestFile.cpp (file was split upstream). Same logic, different file.

Tested

Integration test tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py, ported from the 25.8 PR. Four scenarios:

test_iceberg_local_returns_actual_rows_with_stats_less_manifest — reproducer, fails without the fix.
test_iceberg_local_returns_correct_rows_when_optimization_disabled — control.
test_iceberg_local_partial_stats_manifest_reads_correctly — manifest with value_counts only.
test_iceberg_local_full_stats_manifest_reads_correctly — full Spark-style stats regression guard.

Closes #1545
Mirror of #1688 (antalya-25.8 fix).

CI/CD Options

Exclude tests:

Regression jobs to run:

When an Iceberg manifest's per-file column statistics are absent or empty (common for non-Spark writers like pyiceberg with default settings), DataFileMetaInfo::columns_info is empty. The optimization in StorageObjectStorageSource::createReader misread this as "all columns are absent from the file" and returned constant NULLs for every row while still returning the correct row count. Result: silent data loss on icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants. Gate the optimization's absent-NULL loop directly on columns_info.empty() instead of introducing a separate stats-presence flag. When no usable per-column stats were parsed -- whether the manifest omitted the stats fields entirely or declared them but left them empty -- fall through to the Parquet reader, which correctly handles physically-present columns (read normally) and schema-evolved-absent columns (handled by IcebergMetadata::getInitialSchemaByPath setting the file's own schema as initial_header). columns_info is already serialized to workers in the cluster JSON path, so this changes no serialization format and keeps the fork's DataFileMetaInfo serde identical to upstream. Closes #1545. Mirror of #1688 (antalya-25.8 fix). Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com>

github-actions · 2026-06-08T14:13:54Z

Workflow [PR], commit [5ec03a9]

il9ue mentioned this pull request Jun 8, 2026

Fix Iceberg read optimization returning NULLs for stats-less manifests (#1545) — antalya-26.3 #1814

Closed

25 tasks

ianton-ru approved these changes Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Iceberg read optimization returning NULLs for stats-less manifests#1895

Fix Iceberg read optimization returning NULLs for stats-less manifests#1895
il9ue wants to merge 1 commit into
antalya-26.3from
fix/iceberg-empty-stats-26.3

il9ue commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

il9ue commented Jun 8, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Why this fires

Root cause

The fix

Files changed

Tested

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants