Skip to content

Antalya 26.1 Backport of #97062 - Improve catalog show tables query#1552

Merged
zvonand merged 3 commits intoantalya-26.1from
backports/antalya-26.1/97062
Mar 26, 2026
Merged

Antalya 26.1 Backport of #97062 - Improve catalog show tables query#1552
zvonand merged 3 commits intoantalya-26.1from
backports/antalya-26.1/97062

Conversation

@mkmkme
Copy link
Copy Markdown
Collaborator

@mkmkme mkmkme commented Mar 19, 2026

Revert "Revert "Improve catalog show tables query""

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Improved processing 'show tables' query by fetching only names of tables and improved getLightweightTablesIterator to return structure containing only table names (ClickHouse#97062 by @SmitaRKulkarni)

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

SmitaRKulkarni and others added 2 commits March 19, 2026 12:42
…rt-94467-Improve_catalog_show_tables

Revert "Revert "Improve catalog show tables query""
The original test leaks the failpoint and all the tests in this module
after this test can hit a 10-second sleep on `tryGetTableImpl`. It
doesn't happen in practice, because this is the last test in the module.

However, for the sake of keeping CI stable we'll fix that leak,
considering the fix is trivial.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 19, 2026

Workflow [PR], commit [d08223b]

@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented Mar 19, 2026

I also ran an audit check and let Claude analyze the output. Here's the result:

Concern Valid? Upstream too? Worth fixing?
1. Unbounded block size Yes Yes Nice-to-have, low practical impact
2. require_metadata_access bypassed Technically yes Yes Arguable — current behavior is arguably correct
3. Failpoint leak Yes Yes Yes, trivial fix

I have fixed the last one.

Copy link
Copy Markdown
Collaborator

@arthurpassos arthurpassos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my eyes are sharp, this seems to be a 1:1 match. LGTM

@ianton-ru
Copy link
Copy Markdown

Just to history

Audit Report: PR #1552 (Altinity/ClickHouse) - Improve catalog show tables query

PR: Antalya 26.1 Backport of #97062 - Improve catalog show tables query #1552
Commits in scope: 3a441ec3115 ("Merge pull request ClickHouse#97062 from ClickHouse/revert-96826-revert-94467-Improve_catalog_show_tables"), 94cfd2a0808 ("test_database_glue: fix the failpoint leak")


AI audit note: This review comment was generated by AI.

Confirmed defects

Low: name-only system.tables path no longer honors max_block_size

Impact: A single datalake database with many visible tables can be emitted as one oversized chunk for SHOW TABLES-style queries, causing unnecessary memory spikes and defeating block-size based pacing.
Anchor: src/Storages/System/StorageSystemTables.cpp / fillTableNamesOnly(), generate()
Trigger: Run SHOW TABLES FROM <datalake_db> or SELECT name[, database] FROM system.tables ... when one database has more matching tables than max_block_size.
Why defect: The normal iterator path stops at rows_count < max_block_size, but the new fast path appends every matching table in the database before returning, then adds the whole count back to rows_count.

Affected subsystem and blast radius: name-only system.tables and rewritten SHOW TABLES queries over datalake catalogs.

Code evidence:

size_t rows_added = fillTableNamesOnly(res_columns);
rows_count += rows_added;
continue;
for (const auto & table_detail: table_details)
{
    ...
    res_columns[res_index++]->insert(table_detail.name);
    ++count;
}

Fix direction (short): cap the fast path at remaining block capacity and keep per-database iteration state across generate() calls.
Regression test direction (short): create more than max_block_size datalake tables and assert system.tables/SHOW TABLES produces bounded chunks instead of one oversized block.

Low: database_datalake_require_metadata_access=1 is bypassed for name-only table listing

Impact: In strict mode, SHOW TABLES and equivalent name-only system.tables queries can now succeed and reveal table names even when fetching table metadata would fail; operators lose the documented fail-on-metadata-inaccessibility behavior for those queries.
Anchor: src/Storages/System/StorageSystemTables.cpp / name-only fast path, src/Databases/DataLake/DatabaseDataLake.cpp / getLightweightTablesIterator(), src/Core/Settings.cpp / database_datalake_require_metadata_access
Trigger: A datalake catalog lists a table, metadata access for that table fails, database_datalake_require_metadata_access=1, and the query only requests table names.
Why defect: The old lightweight path still called tryGetTableImpl(..., true, ...) and enforced the setting on metadata-fetch failures. The new implementation returns names directly from catalog->getTables() and never performs metadata access, so the strict setting cannot fire.

Affected subsystem and blast radius: strict metadata-access behavior for datalake-backed SHOW TABLES and name-only system.tables reads.

Code evidence:

iceberg_tables = catalog->getTables();
...
result.emplace_back(table_name);
DECLARE(Bool, database_datalake_require_metadata_access, true, R"(
Either to throw error or not if we don't have rights to get table's metadata ...
)", 0)

Fix direction (short): either preserve strict-mode checks for name-only listing when the setting is enabled, or explicitly redefine/document the setting so this becomes intentional.
Regression test direction (short): inject metadata-access failure and assert SHOW TABLES FROM <catalog> still errors when database_datalake_require_metadata_access=1.

Coverage summary

Scope reviewed: Full call graph for the PR's changed paths: InterpreterShowTablesQuery rewrite into system.tables, StorageSystemTables filtering/generation fast path, IDatabase lightweight iterator contract, DatabaseDataLake full and lightweight iterator behavior, and the failpoint/test fix in test_database_glue.
Categories failed: Block-size enforcement; strict metadata-access setting contract.
Categories passed: Caller migration back to full iterators where StoragePtr is required (system.columns, system.completions, system.iceberg_history, SYSTEM DROP REPLICA); datalake error propagation in full iterator path; failpoint leak fix; reviewed C++ lifetime/race/deadlock/exception-safety concerns in touched code showed no confirmed defects.
Assumptions/limits: Static audit only; no runtime reproduction or CI execution performed.

@alsugiliazova
Copy link
Copy Markdown
Member

alsugiliazova commented Mar 25, 2026

PR #1552 CI Verification Report

CI Results Overview

Category Count
Success 55
Skipped 39 (excluded sanitizer suites)
Failure 2
Total (result_pr.json) 96

CI Test-Level Failures

1. test_s3_cluster/test.py::test_graceful_shutdown[6-10] — Known Flaky (Altinity-specific)

Job: Integration tests (amd_asan, targeted)
Duration: 46.3s
Retry: Not retried (no retry_ok label)

The targeted integration test was selected to run previously-failed tests: test_graceful_shutdown and test_kafka_formats_with_broken_message. The test ran 10 parameterized instances — 9 passed, 1 failed ([6-10] with assert 4 == 0).

This is the known pre-existing UAF bug in StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica() (investigated in PRs #1550, #1561). It is Altinity-specific (not in the upstream ClickHouse CI database). The fix is tracked in PR #1561.

Related to PR: No

2. 01171_mv_select_insert_isolation_long — Known Flaky (28 upstream failures)

Job: Stateless tests (amd_asan, distributed plan, parallel, 1/2)
Duration: 467.5s
Retry: Passed on retry (retry_ok label)

Known flaky test with 28 documented failures in the upstream ClickHouse CI database in the last 60 days.

Related to PR: No

3. GrypeScanServer (-alpine) — CVE Scan

CVE-2026-2673 found in OpenSSL packages (libcrypto3/libssl3 3.5.5-r0) in the Alpine base image.

Related to PR: No — Base image vulnerability scan

PR's New Test Validation

The PR adds test_show_tables_optimization to tests/integration/test_database_glue/test.py. This test was successfully executed and passed in two independent CI configurations:

Job Shard Result Log
Integration tests (amd_binary, 2/5) 2/5 OK job/68583547762
Integration tests (amd_asan, db disk, old analyzer, 5/6) 5/6 OK job/68583547824

Conclusion

All CI failures are unrelated to the PR:

Failure Root Cause PR-Related
test_graceful_shutdown[6-10] Known UAF bug (PR #1561) No
01171_mv_select_insert_isolation_long Known flaky (28 upstream failures), passed on retry No
Settings regression Snapshot baseline mismatch No (no settings changed)
Swarms regression Docker Hub auth failure No (infrastructure)
Parquet regression Pre-existing null handling failure No (different subsystem)
S3Export regression MinIO stress timeout, cancelled No (different subsystem)
GrypeScan CVE in Alpine OpenSSL No (base image)

Verdict: Ready to merge — no PR-related failures detected, and the new test is validated.

@alsugiliazova alsugiliazova added the verified Verified by QA label Mar 26, 2026
@zvonand zvonand merged commit c291584 into antalya-26.1 Mar 26, 2026
487 of 509 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants