Add Java vectorized scalar function support#630
Conversation
staticlibs
left a comment
There was a problem hiding this comment.
Thanks for the PR! The comments below are only for native side. I will look at the Java side separately and will comment on them, most probably tomorrow.
staticlibs
left a comment
There was a problem hiding this comment.
Thanks for the update! I think the native part is very close to be final. The 3 request raised in inline comments:
- do not use C++ API for logical type handling
- add Java wrapper to user-specified functor
- introduce
duckdb_scalar_function_set_functioninbindings_scalar_function.cpp(move there existing impl)
There may be more nits (like function naming), but they are unimportant.
I will comment on Java part separately. I believe the function registration is mostly fine (if the points about logical types and a wrapper are added). But I would like to take a closer look at user facing API (functor signature). If wrapper for it is added - I would think possible changes to that signature should not affect the native part.
|
New commit: Implemented changes were driven by a set of architectural and API-alignment requests, with the goal of keeping Java as the orchestration layer and JNI as a minimal bridge.
|
staticlibs
left a comment
There was a problem hiding this comment.
Thanks for the update! I think the native part is almost ready, added 2 minor comments about it.
On Java part, I think vector reader and writer are fine (bar possible minor things like method names or method chaining support).
Three other big things are:
- function registration
- logical type handling
- user-facing callable signature/API
I collected some thoughts on function registration (will post below now), but will need more time on logical types and on callable signature.
|
On function registration: In general, with new Java API we want to support both streamlined usage ("just let me call this method from SQL") and a "poweruser" usage ("my specially crafted hi-perf method does not allocate, so the wrapper should not too"). Suggested registration logic using a dedicated "builder" object to register a function (examples 1, 2):
|
|
Thoughts on the callable user-facing signature:
(ctx) -> {
DuckDBReadableVector in = ctx.input().vector(0);
for (int i = 0; i < ctx.inputRowCount(); i++) {
ctx.output().setLong(i, in.getLong(i) + 1);
}
});
Please let me know what you think on this and on the registration! |
|
About registration, we can do that. About user-facing signature: On the other hand, Python allows to register just any function.: About using context instead of passing input vectors + output vector. |
|
I've just read that python has actually 2 types of function:
|
I don't think there is much difference, we don't expect to have half-constructed builder objects passed around. So there should be not much difference in throwing from I will comment on function signatures separately. |
|
I wonder if you can share a few examples of a Java scalar functions that can be applicable to your domain area/application? Would it be more like a call to some external service/Kafka/DB or a call to a library to generate a file (image, PDF) and write it to disk, or something else? |
|
Should I commit the changes from C part now and start working on the java part? |
I think native part can be pushed to the PR branch now. In general, please feel free to push as frequently as convenient. |
I must say that I am more interested in table functions, not that scalar functions are not needed. |
Replace the vectorized callback interface with DuckDBScalarFunction, remove explicit rowCount from callback signatures, and derive row count from input chunks in Java. Update JNI callback invocation signature accordingly, align connection registration and wrapper plumbing, remove legacy RowCountScalarFunction test adapters, and migrate scalar UDF tests/docs to the new (input, out) callback format.
|
new commit:
I'll start working on the proposed changes in the java side now. |
Few questions before I start this:
|
I think this naming will work. An alternative may be to have a
I believe only the last step (
I think |
The builder will return a DuckDBScalarFunction instance after calling build(connection), but it will not be used for anything after that, right? |
Yes, perhaps we should destroy the native scalar function on the Alternatively may use Not sure which variant is less confusing. |
|
On function signatures: I am lacking the first-hand experience here, so my view can be off. Still I would assume that a substantial number of scalar functions are never going to be vectorized. So the idea is to allow the function implementation to "just get the input arguments" and "just set the result" without dealing with vectors. Assuming user writes his first scalar function that fetches a JSON by ID and writes it to disk. The idea is to not burden this with We still don't want
I would think we can overload
I think you are correct, the difference is minimal. The idea was to allow to create reader and writer lazily. And sometimes to not create them at all, for example if there is no input. As creating them may incur additional JNI calls. But we only just now changed the JNI entry point to not pass the |
I think the majority of points discussed in this PR also apply to table functions. If you have cycle for that, I think it is well possible to get table functions into 1.5.2 in a followup PR to this PR.
Unfortunately Arrow support in JDBC was put on a backburner. Here is my comment in Discord from more than half a year ago (there were zero Arrow changes since then):
|
|
About This is my last pending comment on Java API. |
|
Just FYI, I was in touch with our dev-rel, there is quite a bit of interest in this feature, especially in Java table functions. If you are interested in writing a blog post about it, we can publish it in the same format as the recent post about UDFs in C#. With usual notes that there is no guaranteed publishing timing and that editorial requirements may be pretty strict. |
|
I'm trying to fit Function and BiFunction into the builder. One problem is type erasure from generics so we might not be able to get types from the function reflection. A different approach could be something like below. This gives the user more control over types. I think this last one would be safer (without type guessing), but it is not as easy as the first. |
+1 to this builder API naming, this one looks consistent to me.
M, I think we actually can get the reflection information from there, despite type erasure, by requiring additional call
The point is that we can relax the types handling (add conversion of
As I understand this, we can add any number of From usage point of view, I think supporting actual return values (instead of |
|
Hi! I added all the direct code comments I have. I have a few more comments on API/naming, null-handling and exceptions. I will collect them in more structured way and post in a few hours. |
|
On exceptions: both catching About null handling - I would think that About interfaces - the amount of functionality exposed to function callbacks is much bigger now, so there is no "minimal interface" anymore. And current interfaces usage for reader and writer, but not for ctx and row is inconsistent. So we need to either wrap everything in interfaces, or remove the interfaces from reader and writer - I suggest to do the latter. About naming - lets do another pass over the naming of new classes when everything else is ready and only the naming is left. |
When I decided to add propagateNulls was for row streams and for not primitive functional interfaces, because I kept adding null checks that would set the output to null on every function callback and I thought that in most cases it would be nice to only deal with non null values. |
Let me prototype this and I will elaborate more. The idea was to use |
Proposed null handling change - commit. I suggest to cherry-pick it into your branch. It moves setting the null propagation flag from builder to context. The flag still exists on builder (and is passed down to a function wrapper), but it is false by default and is only enabled automatically for primitive functions registrations. On row the primitive getters are made null-aware. I would think it makes sense to move all these getters into the reader and just forward call to them from the row. Please let me know what you think! |
Squash all local scalar-function work into a single cohesive change set guided by review feedback. This adopts context-driven null propagation, moves null-aware primitive handling into vector readers, unifies callback-time failures under DuckDBFunctionException (while documenting bounds violations as IndexOutOfBoundsException), and removes checked SQLExceptions from callback runtime APIs. It also adds complete HUGEINT/UHUGEINT callback read/write support, keeps BigInteger class mapping on HUGEINT, hoists vector native byte-order setup, syncs scalar JNI sources in CMake templates, and adds JNI ExceptionCheck guards. Tests and docs are updated to cover the new behavior across scalar callbacks, bindings, null handling, and unsigned 128-bit round-trips.
|
This commit consolidates all scalar-UDF branch work into one cohesive change set driven by review feedback. |
Run project formatter and keep only style/layout changes required by format-check, with no functional modifications.
|
Thanks for the update! I think it can be integrated in this form. I have a few minor comments on Java API and also would like to change some naming (like 1.5.2 is planned for next week, I am going to backport this change from |
|
Please, feel free to do as you see fit.
This is fine. |
|
Merging this now. I will follow-up with renaming and minor API changes. In general, after 1.5.2, when adding non-trivial changes in this area I will wait a day before merging new PRs. Your input on such PRs is more than welcome! For any questions about the UDF blog publishing please contact me on alexkasko@duckdblabs.com - I can connect you with relevant blogging people. |
This is a follow-up to PR duckdb#630. It makes the following changes to newly added Scalar Functions Java API: - moves exception and registered shell classes into the `DuckDBFunction.java` - removes abstract classes for vector reader and writer - renames `DuckDBScalarContext` into `DuckDBScalarFunctionCallData` - removes `DuckDBScalarRow` in favour of streaming plain indices of the input vector rows (as a `LongStream`); the row object inteface appeared to have an unintended overhead of creating a Java object for every input row that we would like to avoid.
This is a follow-up to PR duckdb#630. It makes the following changes to newly added Scalar Functions Java API: - moves exception and registered shell classes into the `DuckDBFunction.java` - removes abstract classes for vector reader and writer - removes `DuckDBScalarContext` and `DuckDBScalarRow` in favour of streaming plain indices (as a `LongStream`) from the input data chunk; the row object inteface appeared to have an unintended overhead of creating a Java object for every input row that we would like to avoid. And without it the context abstraction appeared to be unnecessary Null-propagation handling is changed the following way: - null propagation on Java side is enabled only for primitive callbacks (set automatically) and not exposed to the user, null propagation support for object callbacks is removed - null propagation on DuckDB engine side ( `duckdb_scalar_function_set_special_handling` C API call skip) is also enabled automatically only for primitive callbacks, but it is additionally exposed to users as `withNullInNullOut()` builder call (replaces awkwardly named `withSpecialHandling()`); in some cases NULLs still can be passed to callbacks wheh `withNullInNullOut()` is set so callback still must check for nulls Testing: more tests added aroung the null handling
This is a follow-up to PR duckdb#630. It makes the following changes to newly added Scalar Functions Java API: - moves exception and registered shell classes into the `DuckDBFunction.java` - removes abstract classes for vector reader and writer - removes `DuckDBScalarContext` and `DuckDBScalarRow` in favour of streaming plain indices (as a `LongStream`) from the input data chunk; the row object inteface appeared to have an unintended overhead of creating a Java object for every input row that we would like to avoid. And without it the context abstraction appeared to be unnecessary Null-propagation handling is changed the following way: - null propagation on Java side is enabled only for primitive callbacks (set automatically) and not exposed to the user, null propagation support for object callbacks is removed - null propagation on DuckDB engine side ( `duckdb_scalar_function_set_special_handling` C API call skip) is also enabled automatically only for primitive callbacks, but it is additionally exposed to users as `withNullInNullOut()` builder call (replaces awkwardly named `withSpecialHandling()`); in some cases NULLs still can be passed to callbacks wheh `withNullInNullOut()` is set so callback still must check for nulls Testing: more tests added aroung the null handling
This is a follow-up to PR duckdb#630. It makes the following changes to newly added Scalar Functions Java API: - moves exception and registered shell classes into the `DuckDBFunction.java` - removes abstract classes for vector reader and writer - removes `DuckDBScalarContext` and `DuckDBScalarRow` in favour of streaming plain indices (as a `LongStream`) from the input data chunk; the row object inteface appeared to have an unintended overhead of creating a Java object for every input row that we would like to avoid. And without it the context abstraction appeared to be unnecessary Null-propagation handling is changed the following way: - null propagation on Java side is enabled only for primitive callbacks (set automatically) and not exposed to the user, null propagation support for object callbacks is removed - null propagation on DuckDB engine side ( `duckdb_scalar_function_set_special_handling` C API call skip) is also enabled automatically only for primitive callbacks, but it is additionally exposed to users as `withNullInNullOut()` builder call (replaces awkwardly named `withSpecialHandling()`); in some cases NULLs still can be passed to callbacks wheh `withNullInNullOut()` is set so callback still must check for nulls Testing: more tests added aroung the null handling
This is a follow-up to PR #630. It makes the following changes to newly added Scalar Functions Java API: - moves exception and registered shell classes into the `DuckDBFunction.java` - removes abstract classes for vector reader and writer - removes `DuckDBScalarContext` and `DuckDBScalarRow` in favour of streaming plain indices (as a `LongStream`) from the input data chunk; the row object inteface appeared to have an unintended overhead of creating a Java object for every input row that we would like to avoid. And without it the context abstraction appeared to be unnecessary Null-propagation handling is changed the following way: - null propagation on Java side is enabled only for primitive callbacks (set automatically) and not exposed to the user, null propagation support for object callbacks is removed - null propagation on DuckDB engine side ( `duckdb_scalar_function_set_special_handling` C API call skip) is also enabled automatically only for primitive callbacks, but it is additionally exposed to users as `withNullInNullOut()` builder call (replaces awkwardly named `withSpecialHandling()`); in some cases NULLs still can be passed to callbacks wheh `withNullInNullOut()` is set so callback still must check for nulls Testing: more tests added aroung the null handling
|
Hi! Thanks! Could you share a bit more detail on how you think the implementation of table functions should be done? Also, just as a heads-up, I will not be available next week, so I will only be able to start working on this in the following week. |
|
Hi! Thanks for the info! Table functions are intended to be done in a similar way to scalar ones: @FunctionalInterface
public interface DuckDBTableFunctionBind<BIND_DATA> {
BIND_DATA apply(DuckDBTableFunctionBindInfo info) throws Exception;
}@FunctionalInterface
public interface DuckDBTableFunction<BIND_DATA> {
void apply(BIND_DATA bindData, DuckDBDataChunkWriter output) throws Exception;
}public class DuckDBTableFunctionBindInfo {
private final ByteBuffer bindInfoRef;
public DuckDBTableFunctionBindInfo(ByteBuffer bindInfoRef) {
this.bindInfoRef = bindInfoRef;
}
public void addResultColumn(String name, DuckDBLogicalType type) {
byte[] name_bytes = name.getBytes(UTF_8);
duckdb_bind_add_result_column(bindInfoRef, name_bytes, type.logicalTypeRef());
}
public DuckDBValue getPositionalParameter(long index) {
ByteBuffer valueRef = duckdb_bind_get_parameter(bindInfoRef, index);
return new DuckDBValue(valueRef);
}
public DuckDBValue getNamedParameter(String name) {
byte[] name_bytes = name.getBytes(UTF_8);
ByteBuffer valueRef = duckdb_bind_get_named_parameter(bindInfoRef, name_bytes);
return new DuckDBValue(valueRef);
}
public long parametersCount() {
return duckdb_bind_get_parameter_count(bindInfoRef);
}
}I am working on it now, but don't have it running yet so the API may change. Will tag you in a PR when it is up! |
This PR is a follow-up to duckdb#630 and duckdb#637. It removes JNI utilities specific to scalar functions in favour of more generic `GlobalRefHolder` utility. Testing: no functional changes, no new tests
This PR is a continuation of duckdb#630, it adds support for writing DuckDB table functions in Java. Example: ```java try (Connection conn = DriverManager.getConnection(JDBC_URL); Statement stmt = conn.createStatement()) { DuckDBFunctions.tableFunction() .withName("java_table_basic") .withParameter(int.class) .withFunction(new DuckDBTableFunction<Integer, AtomicBoolean, Object>() { @OverRide public Integer bind(DuckDBTableFunctionBindInfo info) throws Exception { info.addResultColumn("col1", Integer.TYPE) .addResultColumn("col2", String.class); return info.getParameter(0).getInt(); } @OverRide public AtomicBoolean init(DuckDBTableFunctionInitInfo info) throws Exception { info.setMaxThreads(1); return new AtomicBoolean(false); } @OverRide public long apply(DuckDBTableFunctionCallInfo info, DuckDBDataChunkWriter output) throws Exception { Integer bindData = info.getBindData(); AtomicBoolean done = info.getInitData(); if (done.get()) { return 0; } output.vector(0).setInt(0, bindData); output.vector(1).setString(0, "foo"); output.vector(0).setNull(1); output.vector(1).setString(1, "bar"); done.set(true); return 2; } }) .register(conn); ... } ``` ```sql FROM java_table_basic(42); ``` Documentation is pending. Testing: new test added
Summary
This PR adds the implementation of Java Scalar Functions (UDFs) in duckdb-java, using a vectorized callback model for execution.
It introduces function registration, callback bridging, typed vector read/write APIs, documentation, and test coverage for supported types.
What this PR adds
Main design decisions
1) Prioritize Java-side logic
The design keeps most registration and type wiring logic in Java, with JNI used only for unavoidable callback bridging
responsibilities.
2) Keep JNI additions minimal and essential
JNI is limited to:
3) Performance-focused vector path
The UDF execution path uses dedicated typed vector classes (DuckDBReadableVector/DuckDBWritableVector) instead of generic JDBC row/object paths, to reduce overhead in callback hot loops:
Correctness and hardening included
Testing