Skip to content

Conversation

@shuch3ng
Copy link

Which issue does this PR close?

Closes #2976.

Rationale for this change

What changes are included in this PR?

This PR implements a custom spark_translate function optimised for Spark's translate.

How are these changes tested?

Unit tests are included

Benchmark:

OpenJDK 64-Bit Server VM 17.0.17+10 on Mac OS X 26.2
Apple M2 Pro
translate:                                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                             39348          39545         280          0.0       37524.8       1.0X
Comet (Scan)                                      38891          38904          18          0.0       37089.8       1.0X
Comet (Scan + Exec)                               22343          22442         139          0.0       21308.1       1.8X

@shuch3ng shuch3ng force-pushed the improve-translate-perf branch from 21ebf69 to 0363685 Compare December 27, 2025 09:26
@coderfender
Copy link
Contributor

coderfender commented Dec 27, 2025

Thank you for the PR @shuch3ng , Any reason why we cant improve upstream (datafusion)'s spark_translate to not repeat ourselves ?

@shuch3ng
Copy link
Author

Hi @coderfender.

That's a good question. I implemented spark_translate here because Andy created issue #2976 in this project so I didn’t think too much at the time.

Upon checking DataFusion translate, I noticed two differences compared to Spark's:

  1. It sets graphemes(true), while Spark uses codepoints, which might lead to incompatibility
  2. It uses last-occurrence-wins for duplicates, while Spark uses first-occurrence-wins.

For example:

❯ datafusion-cli
DataFusion CLI v51.0.0
> SELECT translate('abcbcb', 'bcbc', '1234') AS result;
+--------+
| result |
+--------+
| a34343 |
+--------+
1 row(s) fetched.
Elapsed 0.005 seconds.
 ❯ spark-sql
spark-sql (default)> SELECT translate('abcbcb', 'bcbc', '1234');
a12121
Time taken: 1.273 seconds, Fetched 1 row(s)

I'm not very familiar with the DataFusion project so I could be wrong, but pushing the changes upstream could affect other use cases in DataFusion.

@shuch3ng
Copy link
Author

See my comment on #2975, it seems like a non-issue.

w.r.t. the two differences mentioned above, they are out of the scope

@codecov-commenter
Copy link

codecov-commenter commented Dec 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.24%. Comparing base (f09f8af) to head (0363685).
⚠️ Report is 803 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2993      +/-   ##
============================================
+ Coverage     56.12%   59.24%   +3.11%     
- Complexity      976     1375     +399     
============================================
  Files           119      167      +48     
  Lines         11743    15493    +3750     
  Branches       2251     2569     +318     
============================================
+ Hits           6591     9179    +2588     
- Misses         4012     5011     +999     
- Partials       1140     1303     +163     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance of translate expression

3 participants