Skip to content

Feat/regexp extract#20985

Open
olsemeno wants to merge 2 commits intoapache:mainfrom
olsemeno:feat/regexp-extract
Open

Feat/regexp extract#20985
olsemeno wants to merge 2 commits intoapache:mainfrom
olsemeno:feat/regexp-extract

Conversation

@olsemeno
Copy link

Which issue does this PR close?

  • Closes #.

Rationale for this change

Apache DataFusion is missing a regexp_extract scalar function that is available in Apache Spark (via functions.regexp_extract). This function is commonly used in ETL pipelines to extract specific capture groups from strings using regular expressions. Adding it improves compatibility with Spark workloads migrating to DataFusion.

What changes are included in this PR?

  • Added a new scalar UDF regexp_extract(str, regexp, idx) in datafusion/functions/src/regex/regexpextract.rs
    • Accepts Utf8, LargeUtf8, and Utf8View string types
    • Third argument idx (Int64) selects the capture group: 0 returns the full match, ≥1 returns the corresponding group
    • Returns an empty string "" when the pattern does not match or the requested group is absent (Spark-compatible behavior)
    • Returns NULL when the input string is NULL
    • Returns an error for negative idx values
    • Supports per-row idx values (column argument), not just scalar literals
    • Uses the existing compile_and_cache_regex helper for regex compilation and caching
  • Registered the function in datafusion/functions/src/regex/mod.rs
  • Added regexp_extract usage examples to datafusion-examples/examples/builtin_functions/regexp.rs

Are these changes tested?

Yes. Ten unit tests are included in regexpextract.rs covering:

Test Scenario
test_basic_group Extract a specific capture group
test_idx_zero_returns_whole_match idx=0 returns the full match
test_no_match_returns_empty_string No match → ""
test_null_input_returns_null NULL input → NULL output
test_empty_pattern_returns_empty_string Empty pattern → ""
test_idx_out_of_range_returns_empty_string Group index beyond available groups → ""
test_negative_idx_returns_error Negative idx → error
test_multiple_groups Multiple capture groups, select second
test_idx_as_column idx as a per-row column (not scalar)
test_batch_mixed_nulls Batch with mixed NULL and non-NULL rows

Are there any user-facing changes?

Yes. A new scalar function regexp_extract(str, regexp, idx) is now available in both SQL and the DataFrame API.

SELECT regexp_extract('2024-03-16', '(\d{4})-(\d{2})-(\d{2})', 1);
-- returns: '2024'

@github-actions github-actions bot added the functions Changes to functions implementation label Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant