[SPARK-57190][SQL] Fix API inconsistency for 4-argument regexp_replace#56240
[SPARK-57190][SQL] Fix API inconsistency for 4-argument regexp_replace#56240pchintar wants to merge 1 commit into
Conversation
e203506 to
e9a27de
Compare
cloud-fan
left a comment
There was a problem hiding this comment.
1 blocking, 0 non-blocking, 1 nit.
Clean, well-tested API addition that faithfully mirrors the existing 3-arg pattern. The one blocking item is the @since / versionchanged version.
Design / architecture (1)
- (blocking)
@since/versionchangedshould be 4.3.0, not4.2.0.branch-4.2is already cut and master is5.0.0-SNAPSHOT(SPARK-56740), so the next open feature release this new API can ship in is 4.3.0. This is the same fix made in SPARK-56820 forcounter_diff("was not merged in 4.2"). Affects functions.scala (2 sites) and PySpark builtin.py (1 site) — see inline.
Nits: 1 minor item (Scaladoc position vs pos param) — see inline.
| * `position`. | ||
| * | ||
| * @group string_funcs | ||
| * @since 4.2.0 |
There was a problem hiding this comment.
@since should be 4.3.0, not 4.2.0: 4.2 is already cut (branch-4.2 exists) and master is on 5.0.0-SNAPSHOT, so this new overload first ships in 4.3.0. Same correction as SPARK-56820 for counter_diff.
| * @since 4.2.0 | |
| * @since 4.3.0 |
| * `position`. | ||
| * | ||
| * @group string_funcs | ||
| * @since 4.2.0 |
There was a problem hiding this comment.
Same here — @since should be 4.3.0.
| * @since 4.2.0 | |
| * @since 4.3.0 |
|
|
||
| .. versionchanged:: 3.4.0 | ||
| Supports Spark Connect. | ||
| .. versionchanged:: 4.2.0 |
There was a problem hiding this comment.
versionchanged should be 4.3.0 — the position parameter first appears in 4.3.0, not the already-cut 4.2.0.
| .. versionchanged:: 4.2.0 | |
| .. versionchanged:: 4.3.0 |
| regexp_replace(e, lit(pattern), lit(replacement)) | ||
|
|
||
| /** | ||
| * Replace all substrings of the specified string value that match regexp with rep, starting at |
There was a problem hiding this comment.
Nit: the prose says "starting at position", but the parameter is named pos — the backtick'd position implies an identifier that doesn't exist in this signature. Consider matching the prose to the param name (or renaming pos to position to align with the PySpark binding).
|
The change LGTM. It is a clean API addition that mirrors the existing 3-arg pattern.
(Minor / non-blocking - just for docs completeness.) |
What changes were proposed in this pull request?
Spark SQL already supports the 4-argument form of
regexp_replace:However, the corresponding Scala, PySpark, and Spark Connect APIs currently expose only the 3-argument variants.
This PR exposes the existing 4-argument functionality through:
functions.regexp_replace)functions.regexp_replace)functions.regexp_replace)and adds corresponding Scala, PySpark, and Connect test coverage.
Why are the changes needed?
The underlying SQL functionality already exists and is available through SQL expressions, but it is not accessible through the public Scala, PySpark, and Spark Connect APIs.
This creates an inconsistency between SQL and programmatic interfaces. Exposing the optional
positionargument aligns the public APIs with existing SQL functionality.Does this PR introduce any user-facing change?
Yes.
Users can now specify the optional
positionargument through the Scala, PySpark, and Spark Connect APIs.Before:
After:
Similarly, PySpark users can now call:
How was this patch tested?
Added test coverage in:
StringFunctionsSuiteFunctionsTests.test_regexp_replaceSparkConnectFunctionTests.test_string_functions_multi_argsVerified with:
Was this patch authored or co-authored using generative AI tooling?
Developed with assistance from GPT-5.