Skip to content

Fix Iceberg read optimization returning NULLs for stats-less manifests#1895

Open
il9ue wants to merge 1 commit into
antalya-26.3from
fix/iceberg-empty-stats-26.3
Open

Fix Iceberg read optimization returning NULLs for stats-less manifests#1895
il9ue wants to merge 1 commit into
antalya-26.3from
fix/iceberg-empty-stats-26.3

Conversation

@il9ue

@il9ue il9ue commented Jun 8, 2026

Copy link
Copy Markdown

Re-opened from Altinity/ClickHouse (instead of my fork) so CI publishes direct .deb package URLs for clickhouse-regression. Same commit (8b597ed) as #1814. Fork PRs don't receive repo secrets, so the S3 upload step was skipped on #1814 and CI only emitted the DEB_ARM_RELEASE artifact zip.

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix Iceberg read optimization returning NULL for every column when reading from manifests written without per-file column statistics (typical of non-Spark writers like pyiceberg with default settings). Affects icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants. Antalya 26.3 fix for #1545.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Antalya-specific bug fix on antalya-26.3. No upstream cherry-pick — this bug exists only on Antalya, introduced by #1069 ("Read optimization using Iceberg metadata"). Mirror of the 25.8 fix in #1688.

Why this fires

When reading an Iceberg table written by a non-Spark writer that omits per-file column statistics from the manifest's Avro schema (pyiceberg with default settings, format v1 writers, and others), the allow_experimental_iceberg_read_optimization path produces silent data loss: correct row counts, every column value NULL. Confirmed on icebergLocal; the same code path fires for icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants.

Root cause

IcebergIterator always populates file_meta_info before yielding objects, so the file_meta_data.has_value() check in the optimization passes. The problem is what's inside the populated DataFileMetaInfo: when the manifest's data_file.value_counts / column_sizes / null_value_counts Avro fields are all absent (all three are optional per the Iceberg spec), DataFileMetaInfo::columns_info stays empty.

The optimization's second loop in StorageObjectStorageSource::createReader then iterates every requested column, finds none in the empty columns_info map, and adds them all to constant_columns_with_values with Field() (NULL). requested_columns_copy is cleared, need_only_count = true, the Parquet reader returns row count only, and generate() injects every column as a constant-NULL column at the correct row count. The optimization conflates "no stats were written" with "all columns are absent" — but absent stats tell us nothing about which columns are physically present.

The fix

Add any_stats_field_present (bool) to DataFileMetaInfo, populated during manifest parsing in AvroForIcebergDeserializer.cpptrue if any of value_counts, column_sizes, or null_value_counts were emitted. Gate the optimization's absent-NULL loop on this flag: when no stats were emitted, skip the loop and fall through to the Parquet reader, which correctly handles both physically-present columns (read normally) and schema-evolved-absent columns (handled upstream by IcebergMetadata::getInitialSchemaByPath setting the file's own schema as initial_header).

A per-column presence set was considered but is unnecessary — schema evolution is already handled upstream of the optimization, so the boolean is sufficient. JSON serialization (cluster reads via toJson() / JSON-ptr constructor) round-trips the new field; missing-on-deserialization defaults to false, matching pre-fix behavior.

Files changed

  • src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.h — added any_stats_field_present to DataFileMetaInfo; constructor signature updated.
  • src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.cpp — JSON serde round-trips the new field; defaults to false on missing.
  • src/Storages/ObjectStorage/DataLakes/Iceberg/ManifestFile.h — header updates for ParsedManifestFileEntry.
  • src/Storages/ObjectStorage/DataLakes/Common/AvroForIcebergDeserializer.cpp — tracks whether any stats Avro field was present during manifest parsing on 26.3.
  • src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergIterator.cpp — forwards the new bool when constructing DataFileMetaInfo.
  • src/Storages/ObjectStorage/StorageObjectStorageSource.cpp — the absent-NULL loop now skips when any_stats_field_present is false.

Note: 26.3 uses AvroForIcebergDeserializer.cpp for manifest parsing where 25.8 / 26.1 use ManifestFile.cpp (file was split upstream). Same logic, different file.

Tested

Integration test tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py, ported from the 25.8 PR. Four scenarios:

  • test_iceberg_local_returns_actual_rows_with_stats_less_manifest — reproducer, fails without the fix.
  • test_iceberg_local_returns_correct_rows_when_optimization_disabled — control.
  • test_iceberg_local_partial_stats_manifest_reads_correctly — manifest with value_counts only.
  • test_iceberg_local_full_stats_manifest_reads_correctly — full Spark-style stats regression guard.

Closes #1545
Mirror of #1688 (antalya-25.8 fix).


CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

When an Iceberg manifest's per-file column statistics are absent or
empty (common for non-Spark writers like pyiceberg with default
settings), DataFileMetaInfo::columns_info is empty. The optimization
in StorageObjectStorageSource::createReader misread this as "all
columns are absent from the file" and returned constant NULLs for
every row while still returning the correct row count. Result: silent
data loss on icebergLocal, icebergS3, icebergAzure, icebergHDFS, and
all *Cluster variants.

Gate the optimization's absent-NULL loop directly on
columns_info.empty() instead of introducing a separate stats-presence
flag. When no usable per-column stats were parsed -- whether the
manifest omitted the stats fields entirely or declared them but left
them empty -- fall through to the Parquet reader, which correctly
handles physically-present columns (read normally) and
schema-evolved-absent columns (handled by
IcebergMetadata::getInitialSchemaByPath setting the file's own schema
as initial_header). columns_info is already serialized to workers in
the cluster JSON path, so this changes no serialization format and
keeps the fork's DataFileMetaInfo serde identical to upstream.

Closes #1545.
Mirror of #1688 (antalya-25.8 fix).

Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com>
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Workflow [PR], commit [5ec03a9]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read using IcebergLocal returns Nulls instead of actual rows

3 participants