[SPARK-57068][SQL] Make SaveMode.Overwrite create the table when missing for SupportsCatalogOptions sources#56111
Conversation
| .doc("When set to true, SaveMode.Overwrite against a missing table on a " + | ||
| "SupportsCatalogOptions source throws NoSuchTableException instead of " + | ||
| "creating the table. Restores the pre-SPARK-57068 behavior.") | ||
| .version("4.3.0") |
There was a problem hiding this comment.
4.3.0 or 4.2.0?
…ing for SupportsCatalogOptions sources
### What changes were proposed in this pull request?
In `DataFrameWriter.saveCommand`, the `SaveMode.Append | SaveMode.Overwrite`
branch calls `catalog.loadTable(ident)` without catching `NoSuchTableException`
when the V2 source implements `SupportsCatalogOptions`. The exception
propagates straight to the user, even though `SaveMode.ErrorIfExists` and
`SaveMode.Ignore` on the same call succeed by routing to `CreateTableAsSelect`.
This change catches `NoSuchTableException` for `SaveMode.Overwrite` only and
routes to `CreateTableAsSelect(ignoreIfExists = false)`, mirroring the
`createMode` arm immediately below. `SaveMode.Append` on a non-existent
identifier intentionally continues to throw, because Append explicitly expects
an existing table and silently creating would mask user mistakes.
A new internal SQL conf `spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows`
restores the pre-fix behavior for users who depend on it.
The `CreateTableAsSelect` construction shared between the new fall-back path
and the existing `createMode` arm is extracted into a private helper
`createTableAsSelectForCatalogOptions` to keep both sites in sync.
### Why are the changes needed?
The most idiomatic write call for any V2 connector,
df.write.format(provider).mode("overwrite").save(newPath)
fails with `NoSuchTableException` when `newPath` does not yet exist, whereas
the equivalent V1 call (e.g. `format("parquet")`) succeeds by creating the
table. V2 sources that implement `SupportsCatalogOptions` (Iceberg, Lance, and
custom connectors) all hit this asymmetry. The fix aligns V2
`SaveMode.Overwrite` semantics with V1: overwrite-on-missing creates the
table, overwrite-on-existing truncates and writes.
Behavior matrix after this change:
| Mode × Target | V1 | V2 before | V2 after |
|------------------------|---------------|--------------|------------|
| Overwrite, missing | creates | **throws** | creates |
| Overwrite, existing | truncate+write| overwrite | unchanged |
| Append, missing | creates | throws | throws* |
| Append, existing | append | append | unchanged |
| ErrorIfExists, missing | creates | creates | unchanged |
| ErrorIfExists, existing| throws | throws | unchanged |
| Ignore, missing | creates | creates | unchanged |
| Ignore, existing | no-op | no-op | unchanged |
\* Intentional V1 divergence — see PR description.
There is an inherent race window between `loadTable` (throws) and
`CreateTableAsSelect`: a concurrent writer creating the table in between
will cause `TableAlreadyExistsException` rather than overwriting. This is
acceptable; V1's filesystem-atomic path doesn't expose it because V1 never
consults a catalog. Users retry.
### Does this PR introduce _any_ user-facing change?
Yes. `df.write.format(<V2 SupportsCatalogOptions source>).mode("overwrite")
.save(<new identifier>)` now creates the table instead of throwing
`NoSuchTableException`. No behavior change for paths that already exist. The
migration guide has been updated. The legacy flag
`spark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows` restores the
prior behavior.
### How was this patch tested?
New tests in `SupportsCatalogOptionsSuite`:
- `save works with Overwrite - no table, no partitioning, session catalog`
- `save works with Overwrite - no table, with partitioning, session catalog`
- `save works with Overwrite - no table, no partitioning, testcat catalog`
- `save works with Overwrite - no table, with partitioning, testcat catalog`
These reuse the existing `testCreateAndRead` helper, which verifies catalog
state (table identity, partitioning, columns) in addition to data.
Plus three behavior-pinning tests:
- `Append mode still fails when table is missing - testcat catalog` (pins
the intentional Append divergence)
- `legacy flag restores throw on Overwrite-missing` (verifies the new conf)
- `Overwrite + withSchemaEvolution on missing table is rejected` (verifies
the schema-evolution gate fires with the expected error class)
Existing tests continue to pass.
### Was this patch authored or co-authored using generative AI tooling?
No.
4a2a359 to
0b3a0ef
Compare
…ecedence and strengthen tests - Hoist the "creating a table rejects schema evolution" check into a dedicated assertSchemaEvolutionNotEnabledForCreateTableWrite() helper (mirroring the existing assertSchemaEvolutionNotEnabledForV1Write), called before identifier/catalog extraction on the createMode arm. The previous refactor had moved this check after extraction, changing which error surfaces when a withSchemaEvolution() write also has a bad catalog/identifier. This restores the original precedence. createTableAsSelectForCatalogOptions is now a pure plan builder. - Add two behavior-pinning tests: the legacy flag is inert for Overwrite of an existing table (it is only consulted inside the missing-table catch), and the legacy throw takes precedence over the schema-evolution gate on a missing table. - Drop the redundant `s` interpolator on the four new test names. - Reword the migration-guide entry so it no longer implies save() takes the identifier as a positional argument.
|
The changes look reasonable. However, I am not sure about whether this was intentional for some reasons. cc @cloud-fan @brkyvz @aokolnychyi WDYT? |
cloud-fan
left a comment
There was a problem hiding this comment.
1 blocking, 0 non-blocking, 0 nits. Well-implemented and well-tested, but I think the design premise needs reconsidering.
Design / architecture (1)
- DataFrameWriter.scala:182: for a
SupportsCatalogOptionssource,save()should follow SQL table semantics, not the path-based "overwrite creates" model — see inline
For a SupportsCatalogOptions source, save() resolves a real catalog table identity and is already DML: mode("append") builds AppendData (= SQL INSERT INTO) and mode("overwrite") on an existing table builds OverwriteByExpression(Literal(true)) (= SQL INSERT OVERWRITE — truncate + write, schema preserved; DataFrameWriter.scala:210-214). Both SQL DML verbs require an existing table — a missing target resolves through UnresolvedRelation/ResolveRelations and fails with TABLE_OR_VIEW_NOT_FOUND/NoSuchTableException; there is no auto-create on insert (the only create-on-write SQL surface is CREATE [OR REPLACE] TABLE ... AS SELECT).
So under the SQL/catalog model that fits this source, both Append-missing and Overwrite-missing should throw — which is exactly the pre-PR behavior, and it's internally consistent. This PR special-cases only Overwrite-missing → CreateTableAsSelect, importing the path-based model (apt for V1 pathless sources) into a catalog-backed source and breaking the Append/Overwrite symmetry. The create-or-replace verb already exists as a separate tool: saveAsTable's Overwrite uses ReplaceTableAsSelect(orCreate = true) (DataFrameWriter.scala:523-550) = CREATE OR REPLACE TABLE AS SELECT.
Suggestion: keep save().mode("overwrite") throwing on a missing catalog table (INSERT OVERWRITE semantics); users who want create-or-replace should use saveAsTable. (cc @gengliangwang — I think this answers your question on whether the change is intentional.)
| try { | ||
| (catalog.loadTable(ident), Some(catalog), Some(ident)) | ||
| } catch { | ||
| // SPARK-57068: align Overwrite-on-missing with V1 source behavior |
There was a problem hiding this comment.
I'd reconsider aligning with V1 source behavior here. A SupportsCatalogOptions source resolves a real catalog table, so save() is DML on that table, not path-based data writing: mode("overwrite") on an existing table is OverwriteByExpression(Literal(true)) just below (= SQL INSERT OVERWRITE, truncate + write), and mode("append") is AppendData (= INSERT INTO).
SQL INSERT OVERWRITE/INSERT INTO both require an existing table and throw TABLE_OR_VIEW_NOT_FOUND on a missing one — there's no auto-create on insert. So under SQL table semantics both Overwrite-missing and Append-missing should throw, which is the current (consistent) behavior. Creating only on Overwrite-missing breaks that symmetry and mixes in the V1 pathless-source model.
The create-or-replace verb already exists via saveAsTable (ReplaceTableAsSelect(orCreate = true) = CREATE OR REPLACE TABLE AS SELECT). My suggestion is to leave save().mode("overwrite") throwing on a missing catalog table and point users at saveAsTable for create-or-replace.
There was a problem hiding this comment.
Agreed. Will close this pr. Thanks @cloud-fan and @gengliangwang
What changes were proposed in this pull request?
DataFrameWriter.saveCommandcallscatalog.loadTable(ident)forSaveMode.Append | SaveMode.Overwritewithout a try/catch when the V2 source implementsSupportsCatalogOptions, so writing to a brand-new identifier throws:This PR catches
NoSuchTableExceptionin theOverwritecase and falls back toCreateTableAsSelect.Appendkeeps throwing, because it is defined as appending to existing data and silently creating a table would hide user mistakes. A new internal confspark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrowsrestores the old behavior. The sharedCreateTableAsSelectconstruction and its schema-evolution guard are pulled into small private helpers (createTableAsSelectForCatalogOptionsandassertSchemaEvolutionNotEnabledForCreateTableWrite, mirroring the existingassertSchemaEvolutionNotEnabledForV1Write), leaving theErrorIfExists/Ignorepath's behavior and error precedence unchanged.Why are the changes needed?
df.write.format(provider).mode("overwrite").save("/some/new/path")should work on a brand-new path the same way it does for parquet/json/orc. Today it throwsNoSuchTableExceptionfor any V2 source implementingSupportsCatalogOptions(Iceberg, Lance, custom connectors). The asymmetry has been around since SPARK-29219 introducedSupportsCatalogOptionsin Spark 3.0: the V1 branch never goes throughloadTable, so this only shows up on the V2 path. The full behavior matrix after this PR:Only the first row changes. Append-on-missing intentionally stays a strict failure; aligning it with V1 would silently create a table the user expected to already exist.
Does this PR introduce any user-facing change?
Yes. Writing with
mode("overwrite")to a not-yet-existing table now creates it instead of throwingNoSuchTableException; the migration guide is updated to reflect this. Settingspark.sql.legacy.dataFrameWriter.overwriteOnMissingTableThrows=truerestores the old behavior, andmode("append")on a missing table still throws.How was this patch tested?
New tests in
SupportsCatalogOptionsSuite. Four reuse the existingtestCreateAndReadhelper, so Overwrite-on-missing now has the same coverage shape (session/testcat catalog, with and without partitioning) as the existingErrorIfExistsandIgnoretests. Five more pin specific edges:NoSuchTableException.withSchemaEvolution() + mode("overwrite")on a missing table raisesUNSUPPORTED_SCHEMA_EVOLUTION.CREATE_TABLE.NoSuchTableExceptionthrow takes precedence over the schema-evolution gate, even whenwithSchemaEvolution()is set.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code