Skip to content

[SPARK-57190][SQL] Fix API inconsistency for 4-argument regexp_replace#56240

Open
pchintar wants to merge 1 commit into
apache:masterfrom
pchintar:regexp_replace
Open

[SPARK-57190][SQL] Fix API inconsistency for 4-argument regexp_replace#56240
pchintar wants to merge 1 commit into
apache:masterfrom
pchintar:regexp_replace

Conversation

@pchintar
Copy link
Copy Markdown

@pchintar pchintar commented Jun 1, 2026

What changes were proposed in this pull request?

Spark SQL already supports the 4-argument form of regexp_replace:

regexp_replace(str, regexp, replacement, position)

However, the corresponding Scala, PySpark, and Spark Connect APIs currently expose only the 3-argument variants.

This PR exposes the existing 4-argument functionality through:

  • Scala API (functions.regexp_replace)
  • PySpark API (functions.regexp_replace)
  • Spark Connect API (functions.regexp_replace)

and adds corresponding Scala, PySpark, and Connect test coverage.

Why are the changes needed?

The underlying SQL functionality already exists and is available through SQL expressions, but it is not accessible through the public Scala, PySpark, and Spark Connect APIs.

This creates an inconsistency between SQL and programmatic interfaces. Exposing the optional position argument aligns the public APIs with existing SQL functionality.

Does this PR introduce any user-facing change?

Yes.

Users can now specify the optional position argument through the Scala, PySpark, and Spark Connect APIs.

Before:

regexp_replace(col("s"), "(\\d+)", "--")

After:

regexp_replace(col("s"), "(\\d+)", "--", 5)

Similarly, PySpark users can now call:

F.regexp_replace("s", r"(\d+)", "--", 5)

How was this patch tested?

Added test coverage in:

  • StringFunctionsSuite
  • FunctionsTests.test_regexp_replace
  • SparkConnectFunctionTests.test_string_functions_multi_args

Verified with:

./build/sbt "sql-api/compile"

./build/sbt "sql/Test/compile"

./build/sbt \
"sql/testOnly org.apache.spark.sql.StringFunctionsSuite -- -z \"regex_replace / regex_extract\""

python3.11 -m pytest \
python/pyspark/sql/tests/test_functions.py::FunctionsTests::test_regexp_replace -v

python3.11 -m pytest \
python/pyspark/sql/tests/connect/test_connect_function.py::SparkConnectFunctionTests::test_string_functions_multi_args -v

Was this patch authored or co-authored using generative AI tooling?

Developed with assistance from GPT-5.

@pchintar pchintar force-pushed the regexp_replace branch 3 times, most recently from e203506 to e9a27de Compare June 1, 2026 08:57
@pchintar pchintar changed the title [SPARK-57190][SQL] Resolve API inconsistency for 4-argument regexp_replace [SPARK-57190][SQL] Fix API inconsistency for 4-argument regexp_replace Jun 1, 2026
@pchintar
Copy link
Copy Markdown
Author

pchintar commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 blocking, 0 non-blocking, 1 nit.
Clean, well-tested API addition that faithfully mirrors the existing 3-arg pattern. The one blocking item is the @since / versionchanged version.

Design / architecture (1)

  • (blocking) @since / versionchanged should be 4.3.0, not 4.2.0. branch-4.2 is already cut and master is 5.0.0-SNAPSHOT (SPARK-56740), so the next open feature release this new API can ship in is 4.3.0. This is the same fix made in SPARK-56820 for counter_diff ("was not merged in 4.2"). Affects functions.scala (2 sites) and PySpark builtin.py (1 site) — see inline.

Nits: 1 minor item (Scaladoc position vs pos param) — see inline.

* `position`.
*
* @group string_funcs
* @since 4.2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@since should be 4.3.0, not 4.2.0: 4.2 is already cut (branch-4.2 exists) and master is on 5.0.0-SNAPSHOT, so this new overload first ships in 4.3.0. Same correction as SPARK-56820 for counter_diff.

Suggested change
* @since 4.2.0
* @since 4.3.0

* `position`.
*
* @group string_funcs
* @since 4.2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here — @since should be 4.3.0.

Suggested change
* @since 4.2.0
* @since 4.3.0

Comment thread python/pyspark/sql/functions/builtin.py Outdated

.. versionchanged:: 3.4.0
Supports Spark Connect.
.. versionchanged:: 4.2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionchanged should be 4.3.0 — the position parameter first appears in 4.3.0, not the already-cut 4.2.0.

Suggested change
.. versionchanged:: 4.2.0
.. versionchanged:: 4.3.0

regexp_replace(e, lit(pattern), lit(replacement))

/**
* Replace all substrings of the specified string value that match regexp with rep, starting at
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the prose says "starting at position", but the parameter is named pos — the backtick'd position implies an identifier that doesn't exist in this signature. Consider matching the prose to the param name (or renaming pos to position to align with the PySpark binding).

@shrirangmhalgi
Copy link
Copy Markdown
Contributor

The change LGTM. It is a clean API addition that mirrors the existing 3-arg pattern.

One minor suggestion: The PySpark docstring mentions the position parameter but the Examples section only shows 3-arg usage. Consider adding a doctest for the 4-arg form so that users discover it via help(F.regexp_replace):

>>> df = spark.createDataFrame([("100-200",)], ["str"])
>>> df.select(regexp_replace('str', r'(\d+)', '--', 5).alias('d')).collect()
[Row(d='100---')]

(Minor / non-blocking - just for docs completeness.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants