Skip to content

[SPARK-57068][SQL] Make SaveMode.Overwrite create the table when missing for SupportsCatalogOptions sources#56111

Closed
LuciferYang wants to merge 3 commits into
apache:masterfrom
LuciferYang:fix-overwrite-nonexistent-catalog-options
Closed

[SPARK-57068][SQL] Make SaveMode.Overwrite create the table when missing for SupportsCatalogOptions sources#56111
LuciferYang wants to merge 3 commits into
apache:masterfrom
LuciferYang:fix-overwrite-nonexistent-catalog-options

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented May 26, 2026

What changes were proposed in this pull request?

DataFrameWriter.saveCommand calls catalog.loadTable(ident) for SaveMode.Append | SaveMode.Overwrite without a try/catch when the V2 source implements SupportsCatalogOptions, so writing to a brand-new identifier throws:

org.apache.spark.sql.catalyst.analysis.NoSuchTableException: ...
  at DataFrameWriter.saveCommand(DataFrameWriter.scala:179)

This PR catches NoSuchTableException in the Overwrite case and falls back to CreateTableAsSelect. Append keeps throwing, because it is defined as appending to existing data and silently creating a table would hide user mistakes. A new internal conf spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows restores the old behavior. The shared CreateTableAsSelect construction and its schema-evolution guard are pulled into small private helpers (createTableAsSelectForCatalogOptions and assertSchemaEvolutionNotEnabledForCreateTableWrite, mirroring the existing assertSchemaEvolutionNotEnabledForV1Write), leaving the ErrorIfExists/Ignore path's behavior and error precedence unchanged.

Why are the changes needed?

df.write.format(provider).mode("overwrite").save("/some/new/path") should work on a brand-new path the same way it does for parquet/json/orc. Today it throws NoSuchTableException for any V2 source implementing SupportsCatalogOptions (Iceberg, Lance, custom connectors). The asymmetry has been around since SPARK-29219 introduced SupportsCatalogOptions in Spark 3.0: the V1 branch never goes through loadTable, so this only shows up on the V2 path. The full behavior matrix after this PR:

Mode × target V1 V2 before V2 after
Overwrite, missing creates throws creates
Overwrite, existing truncate + write overwrite unchanged
Append, missing creates throws throws
Append, existing append append unchanged
ErrorIfExists, missing creates creates unchanged
ErrorIfExists, existing throws throws unchanged
Ignore, missing creates creates unchanged
Ignore, existing no-op no-op unchanged

Only the first row changes. Append-on-missing intentionally stays a strict failure; aligning it with V1 would silently create a table the user expected to already exist.

Does this PR introduce any user-facing change?

Yes. Writing with mode("overwrite") to a not-yet-existing table now creates it instead of throwing NoSuchTableException; the migration guide is updated to reflect this. Setting spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows=true restores the old behavior, and mode("append") on a missing table still throws.

How was this patch tested?

New tests in SupportsCatalogOptionsSuite. Four reuse the existing testCreateAndRead helper, so Overwrite-on-missing now has the same coverage shape (session/testcat catalog, with and without partitioning) as the existing ErrorIfExists and Ignore tests. Five more pin specific edges:

  • Append-on-missing still throws NoSuchTableException.
  • The legacy conf restores the throw on Overwrite-missing.
  • The legacy conf leaves Overwrite of an existing table untouched (truncate + write), since the flag is only consulted inside the missing-table catch.
  • withSchemaEvolution() + mode("overwrite") on a missing table raises UNSUPPORTED_SCHEMA_EVOLUTION.CREATE_TABLE.
  • With the legacy conf on, the NoSuchTableException throw takes precedence over the schema-evolution gate, even when withSchemaEvolution() is set.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

.doc("When set to true, SaveMode.Overwrite against a missing table on a " +
"SupportsCatalogOptions source throws NoSuchTableException instead of " +
"creating the table. Restores the pre-SPARK-57068 behavior.")
.version("4.3.0")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4.3.0 or 4.2.0?

…ing for SupportsCatalogOptions sources

### What changes were proposed in this pull request?

In `DataFrameWriter.saveCommand`, the `SaveMode.Append | SaveMode.Overwrite`
branch calls `catalog.loadTable(ident)` without catching `NoSuchTableException`
when the V2 source implements `SupportsCatalogOptions`. The exception
propagates straight to the user, even though `SaveMode.ErrorIfExists` and
`SaveMode.Ignore` on the same call succeed by routing to `CreateTableAsSelect`.

This change catches `NoSuchTableException` for `SaveMode.Overwrite` only and
routes to `CreateTableAsSelect(ignoreIfExists = false)`, mirroring the
`createMode` arm immediately below. `SaveMode.Append` on a non-existent
identifier intentionally continues to throw, because Append explicitly expects
an existing table and silently creating would mask user mistakes.

A new internal SQL conf `spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows`
restores the pre-fix behavior for users who depend on it.

The `CreateTableAsSelect` construction shared between the new fall-back path
and the existing `createMode` arm is extracted into a private helper
`createTableAsSelectForCatalogOptions` to keep both sites in sync.

### Why are the changes needed?

The most idiomatic write call for any V2 connector,

    df.write.format(provider).mode("overwrite").save(newPath)

fails with `NoSuchTableException` when `newPath` does not yet exist, whereas
the equivalent V1 call (e.g. `format("parquet")`) succeeds by creating the
table. V2 sources that implement `SupportsCatalogOptions` (Iceberg, Lance, and
custom connectors) all hit this asymmetry. The fix aligns V2
`SaveMode.Overwrite` semantics with V1: overwrite-on-missing creates the
table, overwrite-on-existing truncates and writes.

Behavior matrix after this change:

| Mode × Target          | V1            | V2 before    | V2 after   |
|------------------------|---------------|--------------|------------|
| Overwrite, missing     | creates       | **throws**   | creates    |
| Overwrite, existing    | truncate+write| overwrite    | unchanged  |
| Append, missing        | creates       | throws       | throws*    |
| Append, existing       | append        | append       | unchanged  |
| ErrorIfExists, missing | creates       | creates      | unchanged  |
| ErrorIfExists, existing| throws        | throws       | unchanged  |
| Ignore, missing        | creates       | creates      | unchanged  |
| Ignore, existing       | no-op         | no-op        | unchanged  |

\* Intentional V1 divergence — see PR description.

There is an inherent race window between `loadTable` (throws) and
`CreateTableAsSelect`: a concurrent writer creating the table in between
will cause `TableAlreadyExistsException` rather than overwriting. This is
acceptable; V1's filesystem-atomic path doesn't expose it because V1 never
consults a catalog. Users retry.

### Does this PR introduce _any_ user-facing change?

Yes. `df.write.format(<V2 SupportsCatalogOptions source>).mode("overwrite")
.save(<new identifier>)` now creates the table instead of throwing
`NoSuchTableException`. No behavior change for paths that already exist. The
migration guide has been updated. The legacy flag
`spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows` restores the
prior behavior.

### How was this patch tested?

New tests in `SupportsCatalogOptionsSuite`:
- `save works with Overwrite - no table, no partitioning, session catalog`
- `save works with Overwrite - no table, with partitioning, session catalog`
- `save works with Overwrite - no table, no partitioning, testcat catalog`
- `save works with Overwrite - no table, with partitioning, testcat catalog`

These reuse the existing `testCreateAndRead` helper, which verifies catalog
state (table identity, partitioning, columns) in addition to data.

Plus three behavior-pinning tests:
- `Append mode still fails when table is missing - testcat catalog` (pins
  the intentional Append divergence)
- `legacy flag restores throw on Overwrite-missing` (verifies the new conf)
- `Overwrite + withSchemaEvolution on missing table is rejected` (verifies
  the schema-evolution gate fires with the expected error class)

Existing tests continue to pass.

### Was this patch authored or co-authored using generative AI tooling?

No.
@LuciferYang LuciferYang force-pushed the fix-overwrite-nonexistent-catalog-options branch from 4a2a359 to 0b3a0ef Compare May 26, 2026 11:50
@LuciferYang LuciferYang marked this pull request as draft June 2, 2026 11:30
…ecedence and strengthen tests

- Hoist the "creating a table rejects schema evolution" check into a dedicated
  assertSchemaEvolutionNotEnabledForCreateTableWrite() helper (mirroring the
  existing assertSchemaEvolutionNotEnabledForV1Write), called before
  identifier/catalog extraction on the createMode arm. The previous refactor had
  moved this check after extraction, changing which error surfaces when a
  withSchemaEvolution() write also has a bad catalog/identifier. This restores
  the original precedence. createTableAsSelectForCatalogOptions is now a pure
  plan builder.
- Add two behavior-pinning tests: the legacy flag is inert for Overwrite of an
  existing table (it is only consulted inside the missing-table catch), and the
  legacy throw takes precedence over the schema-evolution gate on a missing table.
- Drop the redundant `s` interpolator on the four new test names.
- Reword the migration-guide entry so it no longer implies save() takes the
  identifier as a positional argument.
@LuciferYang LuciferYang marked this pull request as ready for review June 2, 2026 13:00
@gengliangwang
Copy link
Copy Markdown
Member

The changes look reasonable. However, I am not sure about whether this was intentional for some reasons. cc @cloud-fan @brkyvz @aokolnychyi WDYT?

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 blocking, 0 non-blocking, 0 nits. Well-implemented and well-tested, but I think the design premise needs reconsidering.

Design / architecture (1)

  • DataFrameWriter.scala:182: for a SupportsCatalogOptions source, save() should follow SQL table semantics, not the path-based "overwrite creates" model — see inline

For a SupportsCatalogOptions source, save() resolves a real catalog table identity and is already DML: mode("append") builds AppendData (= SQL INSERT INTO) and mode("overwrite") on an existing table builds OverwriteByExpression(Literal(true)) (= SQL INSERT OVERWRITE — truncate + write, schema preserved; DataFrameWriter.scala:210-214). Both SQL DML verbs require an existing table — a missing target resolves through UnresolvedRelation/ResolveRelations and fails with TABLE_OR_VIEW_NOT_FOUND/NoSuchTableException; there is no auto-create on insert (the only create-on-write SQL surface is CREATE [OR REPLACE] TABLE ... AS SELECT).

So under the SQL/catalog model that fits this source, both Append-missing and Overwrite-missing should throw — which is exactly the pre-PR behavior, and it's internally consistent. This PR special-cases only Overwrite-missing → CreateTableAsSelect, importing the path-based model (apt for V1 pathless sources) into a catalog-backed source and breaking the Append/Overwrite symmetry. The create-or-replace verb already exists as a separate tool: saveAsTable's Overwrite uses ReplaceTableAsSelect(orCreate = true) (DataFrameWriter.scala:523-550) = CREATE OR REPLACE TABLE AS SELECT.

Suggestion: keep save().mode("overwrite") throwing on a missing catalog table (INSERT OVERWRITE semantics); users who want create-or-replace should use saveAsTable. (cc @gengliangwang — I think this answers your question on whether the change is intentional.)

try {
(catalog.loadTable(ident), Some(catalog), Some(ident))
} catch {
// SPARK-57068: align Overwrite-on-missing with V1 source behavior
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd reconsider aligning with V1 source behavior here. A SupportsCatalogOptions source resolves a real catalog table, so save() is DML on that table, not path-based data writing: mode("overwrite") on an existing table is OverwriteByExpression(Literal(true)) just below (= SQL INSERT OVERWRITE, truncate + write), and mode("append") is AppendData (= INSERT INTO).

SQL INSERT OVERWRITE/INSERT INTO both require an existing table and throw TABLE_OR_VIEW_NOT_FOUND on a missing one — there's no auto-create on insert. So under SQL table semantics both Overwrite-missing and Append-missing should throw, which is the current (consistent) behavior. Creating only on Overwrite-missing breaks that symmetry and mixes in the V1 pathless-source model.

The create-or-replace verb already exists via saveAsTable (ReplaceTableAsSelect(orCreate = true) = CREATE OR REPLACE TABLE AS SELECT). My suggestion is to leave save().mode("overwrite") throwing on a missing catalog table and point users at saveAsTable for create-or-replace.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will close this pr. Thanks @cloud-fan and @gengliangwang

@LuciferYang LuciferYang closed this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants