Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions src/Storages/ColumnsDescription.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -976,10 +976,7 @@ std::vector<String> ColumnsDescription::getAllRegisteredNames() const
std::vector<String> names;
names.reserve(columns.size());
for (const auto & column : columns)
{
if (!column.name.contains('.'))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now PR only remove this single line.
This code was added in ClickHouse#35852
Can this change break hints?
For example return all database.table combinations instead of only database, or something like this?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this check was added, it is possible it was added by some reason in that PR. Why?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into both — why the filter was there, and whether removing it breaks hints. Short answer: it's safe, and the database.table case can't arise here.

Why ClickHouse#35852 added it. That PR introduced ColumnsDescription::getAllRegisteredNames together with the dot-filter as part of a new feature (column-name hints). The filter was a design choice to keep flattened-Nested subcolumns (e.g. n.a) out of the suggestion list — not a guard added to fix a specific breakage. So removing it doesn't re-open a known bug; it only makes dotted / Nested-subcolumn names eligible as suggestions.

It's column-scoped — so no database.table. This method iterates one table's columns and returns each column.name. It only ever returns column names of a single table, never database or table names. The . in a.b is part of a literal column name, not a db.table separator, so no database.table combination can originate here.

No consumer splits on .. The list feeds only the IHints prompter (NamePrompter::getHints), which is pure Levenshtein edit-distance over whole candidate strings — it never tokenizes on .. The consumers are the "no such column" / ALTER MODIFY|RENAME|DROP hint messages, plus StorageSystemZooKeeper (whose columns are hardcoded and contain no dot, so the change is a no-op there). A dotted name is just one more intact candidate; it appears only if it's an edit-distance near-miss of the typed name.

Observed output (filter removed):

MODIFY `a.x`      -> Maybe you meant: ['a.b'].
multi-dot `a.b.x` -> Maybe you meant: ['a.b.c'].   (intact, not split)
Nested `n.x`      -> Maybe you meant: ['n.a'].
non-dotted `idd`  -> Maybe you meant: ['id'].       (machinery unchanged)
far typo `xyz`    -> (no suggestion)                (never fabricates combos)

Multi-dot names stay whole; a far typo yields no suggestion. Removing the filter strictly adds correct near-match candidates and never malforms one. (I inferred the "before" from the single removed line plus the xyz no-match behavior rather than building a separate base binary — happy to produce a literal side-by-side if you'd like one.)

I also tightened 04260 to assert the dotted name appears specifically inside the Maybe you meant: [...] list, so if the filter were ever re-added (emptying that list) the test fails directly.

names.push_back(column.name);
}
names.emplace_back(column.name);
return names;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -216,3 +216,101 @@ def test_deeply_nested_struct_with_dotted_names(started_cluster_iceberg_with_spa
).strip()
expected = "deep_value1\ndeep_value2\ndeep_value3"
assert result == expected, f"Expected:\n{expected}\nGot:\n{result}"


@pytest.mark.parametrize("storage_type", ["s3", "azure", "local"])
def test_dotted_array_column(started_cluster_iceberg_with_spark, storage_type):
"""
Regression test for issue #90731.
A top-level ARRAY column whose name literally contains a dot (e.g. `a.b`)
must be returned with its actual values, not as an empty array.
"""
instance = started_cluster_iceberg_with_spark.instances["node1"]
spark = started_cluster_iceberg_with_spark.spark_session
TABLE_NAME = "test_dotted_array_column_" + storage_type + "_" + get_uuid_str()

from pyspark.sql.types import ArrayType

data = [(["a", "b", "c"],)]
schema = StructType([
StructField("a.b", ArrayType(StringType())),
])
df = spark.createDataFrame(data=data, schema=schema)

write_iceberg_from_df(spark, df, TABLE_NAME, mode="overwrite", format_version="2")

default_upload_directory(
started_cluster_iceberg_with_spark,
storage_type,
f"/iceberg_data/default/{TABLE_NAME}/",
f"/iceberg_data/default/{TABLE_NAME}/",
)

# Test via table function
table_function_expr = get_creation_expression(
storage_type, TABLE_NAME, started_cluster_iceberg_with_spark, table_function=True
)

result = instance.query(
f"SELECT `a.b` FROM {table_function_expr}"
).strip()
assert result == "['a','b','c']", f"Expected ['a','b','c'], got: {result}"

# Test via table engine
create_iceberg_table(storage_type, instance, TABLE_NAME, started_cluster_iceberg_with_spark)

result = instance.query(
f"SELECT `a.b` FROM {TABLE_NAME}"
).strip()
assert result == "['a','b','c']", f"Expected ['a','b','c'], got: {result}"


@pytest.mark.parametrize("storage_type", ["s3", "azure", "local"])
def test_dotted_array_alongside_real_nested(started_cluster_iceberg_with_spark, storage_type):
"""
Regression guard: a lone dotted Array column (`a.b`) must not interfere with
a genuine flat-Nested group (`c.x`, `c.y`) that shares a different prefix.
All three columns must round-trip correctly.
"""
instance = started_cluster_iceberg_with_spark.instances["node1"]
spark = started_cluster_iceberg_with_spark.spark_session
TABLE_NAME = "test_dotted_array_alongside_real_nested_" + storage_type + "_" + get_uuid_str()

from pyspark.sql.types import ArrayType, IntegerType as SparkIntegerType

data = [(["a", "b", "c"], [1, 2], ["p", "q"])]
schema = StructType([
StructField("a.b", ArrayType(StringType())),
StructField("c.x", ArrayType(SparkIntegerType())),
StructField("c.y", ArrayType(StringType())),
])
df = spark.createDataFrame(data=data, schema=schema)

write_iceberg_from_df(spark, df, TABLE_NAME, mode="overwrite", format_version="2")

default_upload_directory(
started_cluster_iceberg_with_spark,
storage_type,
f"/iceberg_data/default/{TABLE_NAME}/",
f"/iceberg_data/default/{TABLE_NAME}/",
)

# Test via table function
table_function_expr = get_creation_expression(
storage_type, TABLE_NAME, started_cluster_iceberg_with_spark, table_function=True
)

result = instance.query(
f"SELECT `a.b`, `c.x`, `c.y` FROM {table_function_expr}"
).strip()
assert result == "['a','b','c']\t[1,2]\t['p','q']", \
f"Unexpected result via table function: {result}"

# Test via table engine
create_iceberg_table(storage_type, instance, TABLE_NAME, started_cluster_iceberg_with_spark)

result = instance.query(
f"SELECT `a.b`, `c.x`, `c.y` FROM {TABLE_NAME}"
).strip()
assert result == "['a','b','c']\t[1,2]\t['p','q']", \
f"Unexpected result via table engine: {result}"
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ok
22 changes: 22 additions & 0 deletions tests/queries/0_stateless/04260_dotted_column_in_hints.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/usr/bin/env bash
# Regression test for #90731.
# ColumnsDescription::getAllRegisteredNames must include columns whose names
# contain a dot, so they appear in IHints suggestions after a typo.

CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
# shellcheck source=../shell_config.sh
. "$CURDIR"/../shell_config.sh

$CLICKHOUSE_CLIENT -q "
CREATE TABLE t_dotted_hint (\`a.b\` Array(String))
ENGINE = MergeTree ORDER BY tuple();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this test. It's testing MergeTree whilst the issue seemed to be for Iceberg specifically. Or am I getting it wrong?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point. The fix isn't in Iceberg code — the bug surfaced via Iceberg, but the root cause was in engine-agnostic machinery:

  • ColumnsDescription::getAllRegisteredNames (the dot-name filter) and NestedUtils. This test targets exactly the line this PR changes (the hint path for a dotted column name), and MergeTree is just the simplest engine to exercise it deterministically without a Spark+S3 setup.
  • It's also what answers @ianton-ru's question above — it pins the hint behavior for a dotted name so we don't silently regress Refactoring of hints for column descriptor ClickHouse/ClickHouse#35852. I'll make sure it asserts the actual hint output (not just readability) and link the before/after here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also tells the missing hole of Iceberg test suite I've shared with you earlier and I will share my findings into our team channel

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix isn't in Iceberg-specific code — the bug surfaced via Iceberg, but the root cause was in engine-agnostic machinery: the dot-name filter in ColumnsDescription::getAllRegisteredNames. This test targets exactly the line this PR removes (the hint path for a dotted column name), and MergeTree is just the simplest engine to exercise it deterministically without a Spark+S3 setup. It's also what backs @ianton-ru's question above — it pins the hint behavior for a dotted name so re-adding the filter (which would empty the Maybe you meant: [...] list) fails the test directly. I tightened it to assert the dotted name appears specifically inside that suggestion list, not just anywhere in the message.

"

# Misspell the column name; the dotted name must appear in the IHints suggestion
# list ("Maybe you meant: [...]"), which is produced from
# ColumnsDescription::getAllRegisteredNames. If that method filtered out dotted
# names again, the suggestion list would be empty and this would fail.
$CLICKHOUSE_CLIENT -q "ALTER TABLE t_dotted_hint MODIFY COLUMN a_b Array(String);" 2>&1 \
| grep -qF "Maybe you meant: ['a.b']" && echo "ok" || echo "FAIL"

$CLICKHOUSE_CLIENT -q "DROP TABLE t_dotted_hint;"
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
['x','y','z']
['x','y','z']
['a','b','c'] [1,2] ['p','q']
['a','b','c'] [1,2] ['p','q']
42 changes: 42 additions & 0 deletions tests/queries/0_stateless/04261_iceberg_dotted_array.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#!/usr/bin/env bash
# Tags: no-fasttest
# Regression test for https://github.com/ClickHouse/ClickHouse/issues/90731.
# An Iceberg ARRAY column whose name literally contains a dot (e.g. `a.b`) must
# read back its stored values, not an empty array. The mixed case also checks a
# lone dotted column next to several dotted columns that share a prefix
# (`c.x`, `c.y`): all of them must round-trip correctly.

CUR_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
# shellcheck source=../shell_config.sh
. "$CUR_DIR"/../shell_config.sh

LONE_PATH="${CLICKHOUSE_USER_FILES}/lakehouses/${CLICKHOUSE_DATABASE}_iceberg_dotted_lone"
MIXED_PATH="${CLICKHOUSE_USER_FILES}/lakehouses/${CLICKHOUSE_DATABASE}_iceberg_dotted_mixed"
rm -rf "${LONE_PATH}" "${MIXED_PATH}"

# 1) A lone dotted Array column must read its stored values, not [].
${CLICKHOUSE_CLIENT} --query "
SET allow_experimental_insert_into_iceberg = 1;
DROP TABLE IF EXISTS t_iceberg_dotted_lone;
CREATE TABLE t_iceberg_dotted_lone (\`a.b\` Array(String)) ENGINE = IcebergLocal('${LONE_PATH}', 'Parquet');
INSERT INTO t_iceberg_dotted_lone (\`a.b\`) SELECT ['x', 'y', 'z'];
"
# Read through the Iceberg engine table ...
${CLICKHOUSE_CLIENT} --query "SELECT \`a.b\` FROM t_iceberg_dotted_lone"
# ... and through the icebergLocal table function with an explicitly-declared schema.
${CLICKHOUSE_CLIENT} --query "SELECT \`a.b\` FROM icebergLocal('${LONE_PATH}', 'Parquet', '\`a.b\` Array(String)')"

# 2) A lone dotted column alongside dotted columns that share a prefix (c.x, c.y).
${CLICKHOUSE_CLIENT} --query "
SET allow_experimental_insert_into_iceberg = 1;
DROP TABLE IF EXISTS t_iceberg_dotted_mixed;
CREATE TABLE t_iceberg_dotted_mixed (\`a.b\` Array(String), \`c.x\` Array(Int32), \`c.y\` Array(String)) ENGINE = IcebergLocal('${MIXED_PATH}', 'Parquet');
INSERT INTO t_iceberg_dotted_mixed (\`a.b\`, \`c.x\`, \`c.y\`) SELECT ['a', 'b', 'c'], [1, 2], ['p', 'q'];
"
${CLICKHOUSE_CLIENT} --query "SELECT \`a.b\`, \`c.x\`, \`c.y\` FROM t_iceberg_dotted_mixed"
${CLICKHOUSE_CLIENT} --query "SELECT \`a.b\`, \`c.x\`, \`c.y\` FROM icebergLocal('${MIXED_PATH}', 'Parquet', '\`a.b\` Array(String), \`c.x\` Array(Int32), \`c.y\` Array(String)')"

# Cleanup
${CLICKHOUSE_CLIENT} --query "DROP TABLE IF EXISTS t_iceberg_dotted_lone"
${CLICKHOUSE_CLIENT} --query "DROP TABLE IF EXISTS t_iceberg_dotted_mixed"
rm -rf "${LONE_PATH}" "${MIXED_PATH}"
Loading