Skip to content

feat: add Java scalar and table UDF runtime APIs#598

Closed
lfkpoa wants to merge 3 commits intoduckdb:mainfrom
lfkpoa:feature/java-udf-single
Closed

feat: add Java scalar and table UDF runtime APIs#598
lfkpoa wants to merge 3 commits intoduckdb:mainfrom
lfkpoa:feature/java-udf-single

Conversation

@lfkpoa
Copy link
Copy Markdown
Contributor

@lfkpoa lfkpoa commented Mar 9, 2026

Summary

  • Add end-to-end Java UDF runtime support for scalar and table functions, including JNI bindings, Java API types, and registration plumbing.
  • Extend scalar UDF ergonomics with arity/zero-arg/varargs/class-mapped overloads and document usage in UDF.MD and README.md.
  • Add/expand coverage in TestBindings and TestDuckDBJDBC, and include runtime fixes required for Variant and appender/thread-safety integration.

@staticlibs
Copy link
Copy Markdown
Collaborator

Hi, thanks for the PR! This is a highly requested feature.

I will need a few days to do a full review, just a quick question so far: is a there a fundamental reason to have functions like register_scalar_udf in duckdb_java.cpp instead of pushing all this logic to Java side on top of DuckDBBindings calls?

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Mar 10, 2026

Hi! Thank you for the feedback.

I really want this feature, that's why I tried to build it.
I had to make some choices, but feel free to question them or point to another direction.

  • DuckDB executes UDF callbacks from native execution threads and expects C function pointers plus native callback state (extra_info).
  • Java cannot directly provide those C callbacks; JNI glue must own thread attach/detach, global references, callback lifecycle, and error translation.
     
    I implemented functions like register_scalar_udf and table-function registration in duckdb_java.cpp because that is where the callback bridge (DuckDB C callback <-> JNI <-> Java objects) can be implemented safely.

But if you rather have them in DuckBindings *.cpp files, I can try to refactor it, but I think it has to stay at C side.

@staticlibs
Copy link
Copy Markdown
Collaborator

@lfkpoa

Thanks for the details!

But if you rather have them in DuckBindings *.cpp files

No, under DuckDBBindings I meant DuckDBBindings.java that exposes C API from duckdb.h with as minimal additions on C++ JNI side as possible.

I had to make some choices, but feel free to question them or point to another direction.

The ideal direction will be to do everything (or almost everything) on Java side calling C API methods from DuckDBBindings.java. I may underestimate the amount of plumbing required and it may indeed be better to keep the logic in C++ JNI layer in this case - but lets look into the details of this point by point.

DuckDB executes UDF callbacks from native execution threads and expects C function pointers plus native callback state (extra_info).

I would think we can just attach DuckDB native threads to JVM? I mean when calling duckdb_table_function_set_function from Java - to wrap the Java callback in the native callback that will attach the thread and handle the extra_info passing?

Java cannot directly provide those C callbacks; JNI glue must own thread attach/detach, global references, callback lifecycle, and error translation.

On error translation - I would think that may be enough to catch exceptions in a Java wrapper and call duckdb_function_set_error in it. Not ready to comment on other points.

I implemented functions like register_scalar_udf and table-function registration in duckdb_java.cpp because that is where the callback bridge (DuckDB C callback <-> JNI <-> Java objects) can be implemented safely.

The problem here is that we want to get rid of native calls in DuckDBNative.java, that is not possible right now, but we surely don't want new JNI calls there unless absolutely necessary. Some ideal case would be to have the plumbing logic in util.hpp/util.cpp (or separate headers) and to use it from bindings_*.cpp calls trying to keep the overall logic as close to the intended original C API calls logic as possible.

Again, it is possible that I don't understand properly all the involved complexity of the required translation, as the above is written from the general desired approach about JDBC. So if you can immediately see the blockers with this approach - lets look into the details of these blockers.

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Mar 10, 2026

Thanks a lot for the detailed feedback — this is very helpful.
I don't see any blockers.

I understand your point now: the goal is to keep DuckDBBindings.java as the main Java-side surface for C API usage, with minimal JNI-specific additions, and avoid adding new DuckDBNative calls unless absolutely necessary. That direction makes sense to me.

I’ll do a deeper pass on the current UDF registration path and evaluate how much of the orchestration can be moved/refactored toward the DuckDBBindings approach you described, while keeping only the unavoidable native callback plumbing.

@lfkpoa lfkpoa force-pushed the feature/java-udf-single branch from 0cc24d5 to c4eeec1 Compare March 11, 2026 18:10
@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Mar 11, 2026

Hi.
I refactored the code to align with your feedback.
Please let me know if you have any question or feedback.

@staticlibs
Copy link
Copy Markdown
Collaborator

@lfkpoa

Thanks for the update! Just FYI, I will be able to go though the PR in details only after the 1.5.1 update freeze next week.

@staticlibs
Copy link
Copy Markdown
Collaborator

@lfkpoa

Just FYI, I am looking at this change now and will provide the feedback as soon as I get it.

@staticlibs
Copy link
Copy Markdown
Collaborator

@lfkpoa

Sorry for the delay, I should have more time on this during this and the following week.

We have about 10 days before 1.5.2 freeze, I think it would be great to get at least part of this PR in.

Two main comments so far:

  1. can we please split this into multiple parts? It will be much easier to review and integrate. I suggest to scope the first partial PR as: "scalar functions only supporting basic types".

  2. can we move more of the function registration logic to Java side, for example to call duckdb_scalar_function_add_parameter (and alikes) from Java, not from native
    register_scalar_udf_on_function?

@staticlibs
Copy link
Copy Markdown
Collaborator

One more comment on scalar functions:

Instead of the native create_scalar_udf_input_reader can we just pass the DataChunk to Java side (as ByteBuffer-wrapped ptr) and create the Java reader over this chunk? And the same for create_scalar_udf_output_writer?

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Mar 26, 2026

Ok. I'll try to do it.
Should I open a new PR just for Scalar Functions?

@staticlibs
Copy link
Copy Markdown
Collaborator

Either a new PR or a from-scratch rebase of this PR will be fine.

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Mar 30, 2026

I decided to review some of my choices and to try to minimize new structures.
I thought of reusing DuckDBVector for scalar functions for reading parameters in chunks (using ProcessVector), but currently it is package-private. The advantage is that it implements reading types from the chunk.
For saving the result, DuckDBAppender has all the methods to save different types, but it is used only for appending data to tables. Maybe we could refactor it so that we could reuse the code for appending data and also to saving scalar function results to chunks and later in saving results from table functions.
Any thoughts?

@staticlibs
Copy link
Copy Markdown
Collaborator

Hi! About DuckDBVector and ProcessVector - they are planned to be reworked on top of C API. I believe only vector cast is currently missing on C API to do so. I would suggest to not refactor them. And to not use C++ API operations they use. Using (copying) other necessary bits from them is fine.

About DuckDBAppender refactor - yes, I think this is good idea to factor out some common duckdb_vector processing logic from there. The bits that were factored out in this PR looked good.

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Mar 31, 2026

I created a new PR for scalar functions.
I made some performance benchmarks and ended up creating new classes for reading and writing data chunks without some of the java overheads.
In comparison, the new approach was over 20% faster than using the classes from this PR.
I also tried to keep the C side to a minimum.

@staticlibs
Copy link
Copy Markdown
Collaborator

@lfkpoa

Hi, thanks for the update! I am looking at it now, will add the first comments (will split review in multiple passes) in a few mins.

@staticlibs
Copy link
Copy Markdown
Collaborator

Closing as superseded by #630.

@staticlibs staticlibs closed this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants