Skip to content

Fix row policies silently ignored on Iceberg tables with PREWHERE enabled#1597

Open
mkmkme wants to merge 1 commit intobackports/antalya-26.1/95476from
mkmkme/antalya-26.1/iceberg-fix-prewhere
Open

Fix row policies silently ignored on Iceberg tables with PREWHERE enabled#1597
mkmkme wants to merge 1 commit intobackports/antalya-26.1/95476from
mkmkme/antalya-26.1/iceberg-fix-prewhere

Conversation

@mkmkme
Copy link
Copy Markdown
Collaborator

@mkmkme mkmkme commented Mar 30, 2026

The Iceberg read optimization (allow_experimental_iceberg_read_optimization) identifies constant columns from Iceberg metadata and removes them from the read request. When all requested columns become constant, it sets need_only_count = true, which tells the Parquet reader to skip all initialization — including preparePrewhere — and just return the raw row count from file metadata.

This completely bypasses row_level_filter (row policies) and prewhere_info, returning unfiltered row counts. The InterpreterSelectQuery relies on the storage to apply these filters when supportsPrewhere is true and does not add a fallback FilterStep to the query plan, so the filter is silently lost.

The fix prevents need_only_count from being set when an active row_level_filter or prewhere_info exists in the format filter info.

Fixes #1595

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix row policies silently ignored on Iceberg tables with PREWHERE enabled

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

…bled

The Iceberg read optimization (`allow_experimental_iceberg_read_optimization`)
identifies constant columns from Iceberg metadata and removes them from the
read request. When all requested columns become constant, it sets
`need_only_count = true`, which tells the Parquet reader to skip all
initialization — including `preparePrewhere` — and just return the raw row
count from file metadata.

This completely bypasses `row_level_filter` (row policies) and `prewhere_info`,
returning unfiltered row counts. The InterpreterSelectQuery relies on the
storage to apply these filters when `supportsPrewhere` is true and does not
add a fallback FilterStep to the query plan, so the filter is silently lost.

The fix prevents `need_only_count` from being set when an active
`row_level_filter` or `prewhere_info` exists in the format filter info.

Fixes #1595
@mkmkme mkmkme added bugfix port-antalya PRs to be ported to all new Antalya releases antalya-26.1 antalya-26.1.6.20001 labels Mar 30, 2026
@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented Mar 30, 2026

AI audit note: This review comment was generated by AI (claude-4.6-opus-high-thinking).

Audit update for PR #1597 (Fix row policies silently ignored on Iceberg tables with PREWHERE enabled):

Confirmed defects

No confirmed defects in reviewed scope.

The fix is correct and minimal. The analysis follows.

Bug being fixed (pre-existing): When allow_experimental_iceberg_read_optimization is enabled and Iceberg metadata identifies all requested columns as constant, the code unconditionally set need_only_count = true. In the ParquetV3 reader, this causes read() (at ParquetV3BlockInputFormat.cpp:105) to skip initializeIfNeeded() entirely — reading only file_metadata.num_rows from Parquet metadata and returning raw row counts. This completely bypasses preparePrewhere() (called via ReadManager), which is where row_level_filter (row policies) and prewhere_info are applied. Since InterpreterSelectQuery does not add a fallback FilterStep for these when supportsPrewhere() is true, the filters are silently lost.

Fix correctness: The added guard (!format_filter_info || (!format_filter_info->row_level_filter && !format_filter_info->prewhere_info)) prevents the need_only_count override when either filter exists. When need_only_count stays false:

  • The Parquet reader initializes normally, reads filter columns via format_header, and applies prewhere/row-level filters during reading
  • ExtractColumnsTransform then strips filter-only columns (producing correct row counts)
  • generate() restores constant columns from constant_columns_with_values with the filtered row count

Null safety: The fix correctly short-circuits on !format_filter_info before dereferencing members.

Consistency with caller: The caller in StorageObjectStorage.cpp:448-452 already guards need_only_count with !read_from_format_info.prewhere_info && !read_from_format_info.row_level_filter. Both are populated from query_info, matching the format_filter_info fields checked by the fix.

filter_actions_dag not checked (correct by design): The fix intentionally does not guard against filter_actions_dag. This is correct because WHERE-clause pushdown via filter_actions_dag has a plan-level FilterStep fallback. Even if the format reader skips it, the plan re-evaluates the WHERE on the constant columns restored by generate(), producing correct results.

Coverage summary

  • Scope reviewed: Single-file change in StorageObjectStorageSource::createReader (static overload, lines 573–964); callers in StorageObjectStorage::read and ReadFromObjectStorageStep::initializePipeline; downstream consumers in ParquetV3BlockInputFormat::read, ParquetBlockInputFormat, ReadManager::preparePrewhere; FormatFilterInfo struct definition.
  • Categories passed: Null-pointer safety, guard condition logic, consistency with caller-level guard, constant-column restoration correctness, cache-path (num_rows_from_cache) correctness, thread-safety of shared_ptr parameter, C++ lifetime/RAII (no new objects or ownership changes).
  • Categories failed: None.
  • Assumptions/limits: No regression test is included in the PR; correctness of the fix relies on the plan-level FilterStep fallback for filter_actions_dag (verified by code inspection, not runtime).

@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented Mar 30, 2026

I've also checked the regression test is passing with this change.

Copy link
Copy Markdown

@ianton-ru ianton-ru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya-26.1 antalya-26.1.6.20001 bugfix port-antalya PRs to be ported to all new Antalya releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants