[SPARK-57068][SQL] Make SaveMode.Overwrite create the table when missing for SupportsCatalogOptions sources by LuciferYang · Pull Request #56111 · apache/spark

LuciferYang · 2026-05-26T07:46:35Z

What changes were proposed in this pull request?

DataFrameWriter.saveCommand calls catalog.loadTable(ident) for SaveMode.Append | SaveMode.Overwrite without a try/catch when the V2 source implements SupportsCatalogOptions, so writing to a brand-new identifier throws:

org.apache.spark.sql.catalyst.analysis.NoSuchTableException: ...
  at DataFrameWriter.saveCommand(DataFrameWriter.scala:179)

This PR catches NoSuchTableException in the Overwrite case and falls back to CreateTableAsSelect. Append keeps throwing, because it is defined as appending to existing data and silently creating a table would hide user mistakes. A new internal conf spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows restores the old behavior. The shared CreateTableAsSelect construction and its schema-evolution guard are pulled into small private helpers (createTableAsSelectForCatalogOptions and assertSchemaEvolutionNotEnabledForCreateTableWrite, mirroring the existing assertSchemaEvolutionNotEnabledForV1Write), leaving the ErrorIfExists/Ignore path's behavior and error precedence unchanged.

Why are the changes needed?

df.write.format(provider).mode("overwrite").save("/some/new/path") should work on a brand-new path the same way it does for parquet/json/orc. Today it throws NoSuchTableException for any V2 source implementing SupportsCatalogOptions (Iceberg, Lance, custom connectors). The asymmetry has been around since SPARK-29219 introduced SupportsCatalogOptions in Spark 3.0: the V1 branch never goes through loadTable, so this only shows up on the V2 path. The full behavior matrix after this PR:

Mode × target	V1	V2 before	V2 after
Overwrite, missing	creates	throws	creates
Overwrite, existing	truncate + write	overwrite	unchanged
Append, missing	creates	throws	throws
Append, existing	append	append	unchanged
ErrorIfExists, missing	creates	creates	unchanged
ErrorIfExists, existing	throws	throws	unchanged
Ignore, missing	creates	creates	unchanged
Ignore, existing	no-op	no-op	unchanged

Only the first row changes. Append-on-missing intentionally stays a strict failure; aligning it with V1 would silently create a table the user expected to already exist.

Does this PR introduce any user-facing change?

Yes. Writing with mode("overwrite") to a not-yet-existing table now creates it instead of throwing NoSuchTableException; the migration guide is updated to reflect this. Setting spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows=true restores the old behavior, and mode("append") on a missing table still throws.

How was this patch tested?

New tests in SupportsCatalogOptionsSuite. Four reuse the existing testCreateAndRead helper, so Overwrite-on-missing now has the same coverage shape (session/testcat catalog, with and without partitioning) as the existing ErrorIfExists and Ignore tests. Five more pin specific edges:

Append-on-missing still throws NoSuchTableException.
The legacy conf restores the throw on Overwrite-missing.
The legacy conf leaves Overwrite of an existing table untouched (truncate + write), since the flag is only consulted inside the missing-table catch.
withSchemaEvolution() + mode("overwrite") on a missing table raises UNSUPPORTED_SCHEMA_EVOLUTION.CREATE_TABLE.
With the legacy conf on, the NoSuchTableException throw takes precedence over the schema-evolution gate, even when withSchemaEvolution() is set.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

LuciferYang · 2026-05-26T07:50:43Z

+      .doc("When set to true, SaveMode.Overwrite against a missing table on a " +
+        "SupportsCatalogOptions source throws NoSuchTableException instead of " +
+        "creating the table. Restores the pre-SPARK-57068 behavior.")
+      .version("4.3.0")


4.3.0 or 4.2.0？

…ing for SupportsCatalogOptions sources ### What changes were proposed in this pull request? In `DataFrameWriter.saveCommand`, the `SaveMode.Append | SaveMode.Overwrite` branch calls `catalog.loadTable(ident)` without catching `NoSuchTableException` when the V2 source implements `SupportsCatalogOptions`. The exception propagates straight to the user, even though `SaveMode.ErrorIfExists` and `SaveMode.Ignore` on the same call succeed by routing to `CreateTableAsSelect`. This change catches `NoSuchTableException` for `SaveMode.Overwrite` only and routes to `CreateTableAsSelect(ignoreIfExists = false)`, mirroring the `createMode` arm immediately below. `SaveMode.Append` on a non-existent identifier intentionally continues to throw, because Append explicitly expects an existing table and silently creating would mask user mistakes. A new internal SQL conf `spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows` restores the pre-fix behavior for users who depend on it. The `CreateTableAsSelect` construction shared between the new fall-back path and the existing `createMode` arm is extracted into a private helper `createTableAsSelectForCatalogOptions` to keep both sites in sync. ### Why are the changes needed? The most idiomatic write call for any V2 connector, df.write.format(provider).mode("overwrite").save(newPath) fails with `NoSuchTableException` when `newPath` does not yet exist, whereas the equivalent V1 call (e.g. `format("parquet")`) succeeds by creating the table. V2 sources that implement `SupportsCatalogOptions` (Iceberg, Lance, and custom connectors) all hit this asymmetry. The fix aligns V2 `SaveMode.Overwrite` semantics with V1: overwrite-on-missing creates the table, overwrite-on-existing truncates and writes. Behavior matrix after this change: | Mode × Target | V1 | V2 before | V2 after | |------------------------|---------------|--------------|------------| | Overwrite, missing | creates | **throws** | creates | | Overwrite, existing | truncate+write| overwrite | unchanged | | Append, missing | creates | throws | throws* | | Append, existing | append | append | unchanged | | ErrorIfExists, missing | creates | creates | unchanged | | ErrorIfExists, existing| throws | throws | unchanged | | Ignore, missing | creates | creates | unchanged | | Ignore, existing | no-op | no-op | unchanged | \* Intentional V1 divergence — see PR description. There is an inherent race window between `loadTable` (throws) and `CreateTableAsSelect`: a concurrent writer creating the table in between will cause `TableAlreadyExistsException` rather than overwriting. This is acceptable; V1's filesystem-atomic path doesn't expose it because V1 never consults a catalog. Users retry. ### Does this PR introduce _any_ user-facing change? Yes. `df.write.format(<V2 SupportsCatalogOptions source>).mode("overwrite") .save(<new identifier>)` now creates the table instead of throwing `NoSuchTableException`. No behavior change for paths that already exist. The migration guide has been updated. The legacy flag `spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows` restores the prior behavior. ### How was this patch tested? New tests in `SupportsCatalogOptionsSuite`: - `save works with Overwrite - no table, no partitioning, session catalog` - `save works with Overwrite - no table, with partitioning, session catalog` - `save works with Overwrite - no table, no partitioning, testcat catalog` - `save works with Overwrite - no table, with partitioning, testcat catalog` These reuse the existing `testCreateAndRead` helper, which verifies catalog state (table identity, partitioning, columns) in addition to data. Plus three behavior-pinning tests: - `Append mode still fails when table is missing - testcat catalog` (pins the intentional Append divergence) - `legacy flag restores throw on Overwrite-missing` (verifies the new conf) - `Overwrite + withSchemaEvolution on missing table is rejected` (verifies the schema-evolution gate fires with the expected error class) Existing tests continue to pass. ### Was this patch authored or co-authored using generative AI tooling? No.

…ecedence and strengthen tests - Hoist the "creating a table rejects schema evolution" check into a dedicated assertSchemaEvolutionNotEnabledForCreateTableWrite() helper (mirroring the existing assertSchemaEvolutionNotEnabledForV1Write), called before identifier/catalog extraction on the createMode arm. The previous refactor had moved this check after extraction, changing which error surfaces when a withSchemaEvolution() write also has a bad catalog/identifier. This restores the original precedence. createTableAsSelectForCatalogOptions is now a pure plan builder. - Add two behavior-pinning tests: the legacy flag is inert for Overwrite of an existing table (it is only consulted inside the missing-table catch), and the legacy throw takes precedence over the schema-evolution gate on a missing table. - Drop the redundant `s` interpolator on the four new test names. - Reword the migration-guide entry so it no longer implies save() takes the identifier as a positional argument.

gengliangwang · 2026-06-03T03:50:41Z

The changes look reasonable. However, I am not sure about whether this was intentional for some reasons. cc @cloud-fan @brkyvz @aokolnychyi WDYT?

…ptions

cloud-fan

1 blocking, 0 non-blocking, 0 nits. Well-implemented and well-tested, but I think the design premise needs reconsidering.

Design / architecture (1)

DataFrameWriter.scala:182: for a SupportsCatalogOptions source, save() should follow SQL table semantics, not the path-based "overwrite creates" model — see inline

For a SupportsCatalogOptions source, save() resolves a real catalog table identity and is already DML: mode("append") builds AppendData (= SQL INSERT INTO) and mode("overwrite") on an existing table builds OverwriteByExpression(Literal(true)) (= SQL INSERT OVERWRITE — truncate + write, schema preserved; DataFrameWriter.scala:210-214). Both SQL DML verbs require an existing table — a missing target resolves through UnresolvedRelation/ResolveRelations and fails with TABLE_OR_VIEW_NOT_FOUND/NoSuchTableException; there is no auto-create on insert (the only create-on-write SQL surface is CREATE [OR REPLACE] TABLE ... AS SELECT).

So under the SQL/catalog model that fits this source, both Append-missing and Overwrite-missing should throw — which is exactly the pre-PR behavior, and it's internally consistent. This PR special-cases only Overwrite-missing → CreateTableAsSelect, importing the path-based model (apt for V1 pathless sources) into a catalog-backed source and breaking the Append/Overwrite symmetry. The create-or-replace verb already exists as a separate tool: saveAsTable's Overwrite uses ReplaceTableAsSelect(orCreate = true) (DataFrameWriter.scala:523-550) = CREATE OR REPLACE TABLE AS SELECT.

Suggestion: keep save().mode("overwrite") throwing on a missing catalog table (INSERT OVERWRITE semantics); users who want create-or-replace should use saveAsTable. (cc @gengliangwang — I think this answers your question on whether the change is intentional.)

cloud-fan · 2026-06-03T06:22:40Z

+              try {
+                (catalog.loadTable(ident), Some(catalog), Some(ident))
+              } catch {
+                // SPARK-57068: align Overwrite-on-missing with V1 source behavior


I'd reconsider aligning with V1 source behavior here. A SupportsCatalogOptions source resolves a real catalog table, so save() is DML on that table, not path-based data writing: mode("overwrite") on an existing table is OverwriteByExpression(Literal(true)) just below (= SQL INSERT OVERWRITE, truncate + write), and mode("append") is AppendData (= INSERT INTO).

SQL INSERT OVERWRITE/INSERT INTO both require an existing table and throw TABLE_OR_VIEW_NOT_FOUND on a missing one — there's no auto-create on insert. So under SQL table semantics both Overwrite-missing and Append-missing should throw, which is the current (consistent) behavior. Creating only on Overwrite-missing breaks that symmetry and mixes in the V1 pathless-source model.

The create-or-replace verb already exists via saveAsTable (ReplaceTableAsSelect(orCreate = true) = CREATE OR REPLACE TABLE AS SELECT). My suggestion is to leave save().mode("overwrite") throwing on a missing catalog table and point users at saveAsTable for create-or-replace.

Agreed. Will close this pr. Thanks @cloud-fan and @gengliangwang

LuciferYang commented May 26, 2026

View reviewed changes

LuciferYang mentioned this pull request May 26, 2026

SaveMode.Overwrite on a non-existent path throws NoSuchTableException lance-format/lance-spark#555

Open

LuciferYang force-pushed the fix-overwrite-nonexistent-catalog-options branch from 4a2a359 to 0b3a0ef Compare May 26, 2026 11:50

LuciferYang requested review from cloud-fan and gengliangwang June 2, 2026 05:50

LuciferYang marked this pull request as draft June 2, 2026 11:30

LuciferYang marked this pull request as ready for review June 2, 2026 13:00

Merge branch 'apache:master' into fix-overwrite-nonexistent-catalog-o…

310d398

…ptions

cloud-fan requested changes Jun 3, 2026

View reviewed changes

LuciferYang closed this Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57068][SQL] Make SaveMode.Overwrite create the table when missing for SupportsCatalogOptions sources#56111

[SPARK-57068][SQL] Make SaveMode.Overwrite create the table when missing for SupportsCatalogOptions sources#56111
LuciferYang wants to merge 3 commits into
apache:masterfrom
LuciferYang:fix-overwrite-nonexistent-catalog-options

LuciferYang commented May 26, 2026 •

edited

Loading

Uh oh!

LuciferYang May 26, 2026

Uh oh!

gengliangwang commented Jun 3, 2026

Uh oh!

cloud-fan left a comment

Uh oh!

cloud-fan Jun 3, 2026

Uh oh!

LuciferYang Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LuciferYang commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Jun 3, 2026

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Design / architecture (1)

Uh oh!

cloud-fan Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LuciferYang commented May 26, 2026 •

edited

Loading