Skip to content

Add Java vectorized scalar function support#630

Merged
staticlibs merged 9 commits intoduckdb:mainfrom
lfkpoa:feature/scalar-function
Apr 8, 2026
Merged

Add Java vectorized scalar function support#630
staticlibs merged 9 commits intoduckdb:mainfrom
lfkpoa:feature/scalar-function

Conversation

@lfkpoa
Copy link
Copy Markdown
Contributor

@lfkpoa lfkpoa commented Mar 31, 2026

Summary

This PR adds the implementation of Java Scalar Functions (UDFs) in duckdb-java, using a vectorized callback model for execution.

It introduces function registration, callback bridging, typed vector read/write APIs, documentation, and test coverage for supported types.

What this PR adds

  • New public API on DuckDBConnection:
    • registerScalarFunction(String name, String[] parameterTypes, String returnType, DuckDBVectorizedScalarFunction function)
  • New callback contract and vector APIs:
    • DuckDBVectorizedScalarFunction
    • DuckDBDataChunkReader
    • DuckDBReadableVector
    • DuckDBWritableVector
  • JNI/C bridge needed to connect Java callbacks to DuckDB native scalar callback execution
  • SQL type parsing helper used by the string-based Java registration API
  • Scalar UDF documentation (UDF.MD) and README reference
  • Dedicated test suite (TestScalarFunctions) plus binding-level regression tests

Main design decisions

1) Prioritize Java-side logic

The design keeps most registration and type wiring logic in Java, with JNI used only for unavoidable callback bridging
responsibilities.

2) Keep JNI additions minimal and essential

JNI is limited to:

  • native callback pointer/state installation
  • JVM thread attach/detach from DuckDB execution threads
  • callback lifecycle and error propagation
  • required helpers for logical type parsing and safe VARCHAR extraction

3) Performance-focused vector path

The UDF execution path uses dedicated typed vector classes (DuckDBReadableVector/DuckDBWritableVector) instead of generic JDBC row/object paths, to reduce overhead in callback hot loops:

  • primitive typed access/write APIs
  • direct output vector writes
  • explicit null-mask handling
  • reduced boxing/unboxing and object allocation

Correctness and hardening included

  • DECIMAL output validates declared precision/scale
  • VARCHAR helper validates row bounds
  • VARCHAR null rows are guarded in Java and JNI
  • Vector code uses ByteOrder.nativeOrder() consistently
  • UBIGINT read/write is endian-correct

Testing

  • Added broad scalar UDF coverage in TestScalarFunctions

Copy link
Copy Markdown
Collaborator

@staticlibs staticlibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The comments below are only for native side. I will look at the Java side separately and will comment on them, most probably tomorrow.

Copy link
Copy Markdown
Collaborator

@staticlibs staticlibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I think the native part is very close to be final. The 3 request raised in inline comments:

  • do not use C++ API for logical type handling
  • add Java wrapper to user-specified functor
  • introduce duckdb_scalar_function_set_function in bindings_scalar_function.cpp (move there existing impl)

There may be more nits (like function naming), but they are unimportant.

I will comment on Java part separately. I believe the function registration is mostly fine (if the points about logical types and a wrapper are added). But I would like to take a closer look at user facing API (functor signature). If wrapper for it is added - I would think possible changes to that signature should not affect the native part.

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 1, 2026

New commit: Implemented changes were driven by a set of architectural and API-alignment requests, with the goal of keeping Java as the orchestration layer and JNI as a minimal bridge.

  1. Logical type handling aligned to C API-first design
  • Scalar function registration now uses typed logical types (DuckDBLogicalType) instead of SQL type strings.
  • Logical type creation is done through C API bindings (duckdb_create_logical_type, duckdb_create_decimal_type) exposed in DuckDBBindings.
  • The string-based logical-type parsing path used during scalar registration was removed.
  • Java-side validation remains for constraints such as DECIMAL width/scale, with native-side validation kept consistent for safety.
  1. Callback execution moved toward Java wrapper ownership
  • A dedicated Java wrapper (DuckDBScalarFunctionWrapper) was introduced around the user functor.
  • The wrapper receives raw native pointers (as ByteBuffers), constructs DuckDBDataChunkReader and DuckDBWritableVector, invokes the user callback, and reports failures via
    duckdb_scalar_function_set_error.
  • This shifts argument preparation and user-exception translation out of the C++ execution body and into Java, reducing native callback complexity.
  1. Scalar callback binding organization
  • The JNI entrypoint for installing scalar callbacks is now exposed as duckdb_scalar_function_set_function and lives in bindings_scalar_function.cpp.
  • JNI plumbing internals (thread attach/detach lifecycle, global refs, callback state ownership) remain isolated in scalar_functions.cpp/.hpp.
  • Naming and placement were aligned so binding entrypoints stay with binding files while helper plumbing remains reusable and internal.
  1. Native callback path minimized and cleaned
  • The scalar execute path no longer relies on duckdb.hpp DataChunk/Vector references for callback forwarding.
  • It now forwards C API pointers (duckdb_data_chunk, duckdb_vector) directly to Java wrapper execution.
  • Row count is obtained via C API (duckdb_data_chunk_get_size), keeping callback execution closer to C API semantics and reducing C++ API coupling.
  1. API surface and tests updated for typed usage
  • Scalar registration call sites were updated to use typed logical types end-to-end.
  • Scalar-function tests were kept in the dedicated test file and migrated away from string helper parsing.
  • Coverage includes success paths across implemented basic types, temporal variants, DECIMAL validations, null behavior, and Java-exception propagation into SQL errors.

Copy link
Copy Markdown
Collaborator

@staticlibs staticlibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I think the native part is almost ready, added 2 minor comments about it.

On Java part, I think vector reader and writer are fine (bar possible minor things like method names or method chaining support).

Three other big things are:

  • function registration
  • logical type handling
  • user-facing callable signature/API

I collected some thoughts on function registration (will post below now), but will need more time on logical types and on callable signature.

@staticlibs
Copy link
Copy Markdown
Collaborator

On function registration:

In general, with new Java API we want to support both streamlined usage ("just let me call this method from SQL") and a "poweruser" usage ("my specially crafted hi-perf method does not allocate, so the wrapper should not too").

Suggested registration logic using a dedicated "builder" object to register a function (examples 1, 2):

  • DuckDB[Vectorized]ScalarFunction (lets remove 'Vectorized' part)
  • DuckDBScalarFunction.builder() (creates DuckDBScalarFunctionBuilder, calls duckdb_create_scalar_function in its constructor)
  • builder methods (return DuckDBScalarFunctionBuilder):
    • setName()
    • setReturnType()
    • setParameter()
    • setFunction()
    • setVarArgs()
    • setVolatile()
    • setSpecialHandling()
    • build() - calls duckdb_register_scalar_function, returns DuckDBScalarFunction
  • DuckDBScalarFunction is immutable once built
  • builder.setFunction() can be overloaded (not necessary for initial version) to take @FunctionalInterface (to call apply), Callable, Runnable, method reference (can infer return params and args and call with reflection if necessary) etc.

@staticlibs
Copy link
Copy Markdown
Collaborator

Thoughts on the callable user-facing signature:

  • we can support multiple signatures, with different wrappers for different signatures (Runnable, Callable, java.util.function.*)
  • we do not need to support all possible ones now - can be added later
  • in initial version we need to support a signature that is "easy to use" and does not add too much overhead (not necessary to be zero-overhead - we can provide a raw ByteBuffer info, input, output signature for that in future)
  • IMO current signature is a bit too rigid, it takes reader and writer that may be too complex for simple scalar function. Also reader and writer are not yet tried, we may want to evolve them in future.
  • suggested initial signature, passing a single DuckDBScalarFunctionCallContext object that holds both input and output and also can provide more methods for different kid of input/output access:
(ctx) -> {
    DuckDBReadableVector in = ctx.input().vector(0);
    for (int i = 0; i < ctx.inputRowCount(); i++) {
        ctx.output().setLong(i, in.getLong(i) + 1);
    }
});
  • the context itself and all objects it returns need to be hidden behind interfaces (with non-public impl), we are highly likely will want to extend it in future
  • I am not sure if this signture is enough for a streamlined example, for example with a non-vectorized call that can take arguments as int ctx.getIntArg(int pos) and void ctx.setIntResult(int val) (we likely will need primitive specializations to not wrap everything by default and at the same time to not require to use vector reader/writer)

Please let me know what you think on this and on the registration!

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 1, 2026

About registration, we can do that.
I'm just not sure about calling duckdb_create_scalar_function in builder constructor.
Maybe it is because you would like to call the c api on each builder method.
But should we do that or just gather all the information in builder and call everything on build() ?

About user-facing signature:
I chose to use vectors because I want maximum performance.
Java functions can add a lot of overhead, such as boxing/unboxing, creating objects, garbage collect, etc.
I just checked DuckDB.DotNET and it uses a similar approach for the callback, passing input, output and chunk size.
https://github.com/Giorgi/DuckDB.NET/blob/develop/DuckDB.NET.Data/DuckDBConnection.ScalarFunction.cs#L110
And the tests use this approach:
https://github.com/Giorgi/DuckDB.NET/blob/develop/DuckDB.NET.Test/ScalarFunctionTests.cs

On the other hand, Python allows to register just any function.:
https://duckdb.org/docs/current/clients/python/function
Maybe we can allow both, a very simple approach like python, with all the overheads, and another dealing with chunks/vectors.
For this, we could accept Supplier, BiFunction<T, U, R> and maybe create TriFunction.

About using context instead of passing input vectors + output vector.
I really don't get the difference.
You mentioned using interfaces to hide implementations. That works both for using context and vectors directly.
You mentioned int ctx.getIntArg(int pos) but remember that the function can have multiple inputs and IMO I rather use input.vector(0).getInt(row) or even ctx.input.vector(0).getInt(row).
But I really don't see the gain in using context.

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 1, 2026

I've just read that python has actually 2 types of function:

@staticlibs
Copy link
Copy Markdown
Collaborator

About registration, we can do that.
I'm just not sure about calling duckdb_create_scalar_function in builder constructor.
Maybe it is because you would like to call the c api on each builder method.
But should we do that or just gather all the information in builder and call everything on build() ?

I don't think there is much difference, we don't expect to have half-constructed builder objects passed around. So there should be not much difference in throwing from .build() or from some earlier call. The idea was to stay closer to C API, so each builder method is more or less the same as a corresponding C API call. There is a concern about leaking non-registered native scalar function object, but this is not expected to be a problem in reality, as a registration is likely to happen on app startup, and we can make the builder auto-closable or add a dispose in its finalizer (even if deprecated, in this case may work). Having everything in .build() is fine too, just stays farther from C API and we likely will need to duplicate some of the checks in earlier method calls. Also this can be changed later (does not affect public API) - so please do as you see fit.

I will comment on function signatures separately.

@staticlibs
Copy link
Copy Markdown
Collaborator

I wonder if you can share a few examples of a Java scalar functions that can be applicable to your domain area/application? Would it be more like a call to some external service/Kafka/DB or a call to a library to generate a file (image, PDF) and write it to disk, or something else?

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 1, 2026

Should I commit the changes from C part now and start working on the java part?
Or do you prefer that I only commit after all the changes?

@staticlibs
Copy link
Copy Markdown
Collaborator

Should I commit the changes from C part now and start working on the java part?
Or do you prefer that I only commit after all the changes?

I think native part can be pushed to the PR branch now. In general, please feel free to push as frequently as convenient.

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 1, 2026

I wonder if you can share a few examples of a Java scalar functions that can be applicable to your domain area/application? Would it be more like a call to some external service/Kafka/DB or a call to a library to generate a file (image, PDF) and write it to disk, or something else?

I must say that I am more interested in table functions, not that scalar functions are not needed.
But DuckDB and its extensions provide so many functions already, that what I miss the most is to be able to provide tables generated from Java. I already use the registerArrowStream, but it is not enough.
For scalar functions I would use to add my own custom functions/calculations locally or from a java library, but also calls to remote services.

Replace the vectorized callback interface with DuckDBScalarFunction, remove explicit rowCount from callback signatures, and derive row count from input chunks in Java.

Update JNI callback invocation signature accordingly, align connection registration and wrapper plumbing, remove legacy RowCountScalarFunction test adapters, and migrate scalar UDF tests/docs to the new (input, out) callback format.
@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 1, 2026

new commit:
Implemented updates for the first Java scalar UDF API pass:

  1. Callback API simplification
  • Removed the explicit rowCount argument from the Java callback.
  • New user callback is DuckDBScalarFunction.apply(DuckDBDataChunkReader input, DuckDBWritableVector out).
  • Row count is now derived from the input chunk (input.rowCount()), so Java and JNI signatures are simpler and less error-prone.
  1. Java-side wrapper for user functors
  • Added/updated Java wrapper execution path so callback preparation and exception handling are done on the Java side.
  • The wrapper builds chunk/vector readers, invokes the user function, and reports failures through duckdb_scalar_function_set_error.
  1. Naming and API cleanup
  • Renamed public interface from DuckDBVectorizedScalarFunction to DuckDBScalarFunction.
  • Removed remaining test adapters tied to the old (input, rowCount, out) shape and migrated all scalar-function tests to the new callback form.
  1. Type/registration direction
  • Registration remains C API-oriented (duckdb_create_scalar_function, add parameters, set return type, register).
  • Logical type handling follows the typed Java path (DuckDBLogicalType) instead of SQL type-name parsing in this first implementation.
  1. Validation
  • Formatting checks pass.
  • Scalar function and bindings test suites pass.

I'll start working on the proposed changes in the java side now.

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 1, 2026

On function registration:

In general, with new Java API we want to support both streamlined usage ("just let me call this method from SQL") and a "poweruser" usage ("my specially crafted hi-perf method does not allocate, so the wrapper should not too").

Suggested registration logic using a dedicated "builder" object to register a function (examples 1, 2):

  • DuckDB[Vectorized]ScalarFunction (lets remove 'Vectorized' part)

  • DuckDBScalarFunction.builder() (creates DuckDBScalarFunctionBuilder, calls duckdb_create_scalar_function in its constructor)

  • builder methods (return DuckDBScalarFunctionBuilder):

    • setName()
    • setReturnType()
    • setParameter()
    • setFunction()
    • setVarArgs()
    • setVolatile()
    • setSpecialHandling()
    • build() - calls duckdb_register_scalar_function, returns DuckDBScalarFunction
  • DuckDBScalarFunction is immutable once built

  • builder.setFunction() can be overloaded (not necessary for initial version) to take @FunctionalInterface (to call apply), Callable, Runnable, method reference (can infer return params and args and call with reflection if necessary) etc.

Few questions before I start this:

@staticlibs
Copy link
Copy Markdown
Collaborator

Should I use DuckDBScalarFunction for this and change the callback interface name used to process chunks to something like DuckDBScalarFunctionCallback?

I think this naming will work. An alternative may be to have a DuckDBFunctions class with scalarBuilder()/tableBuilder()/aggregateBuilder(). I think DuckDBScalarFunction.builder() + DuckDBScalarFunctionCallback is more straightforward.

The builder needs the connection. How do you prefer do pass the connection? DuckDBScalarFunction.builder(connection) ? Ou DuckDBScalarFunction.builder()....build(connection) ?

I believe only the last step (duckdb_register_scalar_function) needs connection, so .build(connection) (or even .register(connection)) will be preferred.

I've seen many builders just use the name of the param, like .name() instead of .setName(). Which do you prefer?

I think .name() is fine for accessors (as a replacement for getName()), but for mutators - some prefix will be preferred. If .setName() does not look suitable - perhaps .withName() can be used. Prefix also may provide slightly better autocomplete. In general I am fine here with any naming, as long as it is consistent and resembles corresponding C API calls.

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 1, 2026

Should I use DuckDBScalarFunction for this and change the callback interface name used to process chunks to something like DuckDBScalarFunctionCallback?

I think this naming will work. An alternative may be to have a DuckDBFunctions class with scalarBuilder()/tableBuilder()/aggregateBuilder(). I think DuckDBScalarFunction.builder() + DuckDBScalarFunctionCallback is more straightforward.

The builder needs the connection. How do you prefer do pass the connection? DuckDBScalarFunction.builder(connection) ? Ou DuckDBScalarFunction.builder()....build(connection) ?

I believe only the last step (duckdb_register_scalar_function) needs connection, so .build(connection) (or even .register(connection)) will be preferred.

I've seen many builders just use the name of the param, like .name() instead of .setName(). Which do you prefer?

I think .name() is fine for accessors (as a replacement for getName()), but for mutators - some prefix will be preferred. If .setName() does not look suitable - perhaps .withName() can be used. Prefix also may provide slightly better autocomplete. In general I am fine here with any naming, as long as it is consistent and resembles corresponding C API calls.

The builder will return a DuckDBScalarFunction instance after calling build(connection), but it will not be used for anything after that, right?
Should it contain all the properties passed to builder?

@staticlibs
Copy link
Copy Markdown
Collaborator

The builder will return a DuckDBScalarFunction instance after calling build(connection), but it will not be used for anything after that, right?
Should it contain all the properties passed to builder?

Yes, perhaps we should destroy the native scalar function on the build() call. And then return an immutable "shell" with the readonly details about the registered function. Not sure if it will be useful, but should not harm.

Alternatively may use DuckDBScalarFunction.registrar(), DuckDBScalarFunctionRegistrar and the void register(connection) call.

Not sure which variant is less confusing.

@staticlibs
Copy link
Copy Markdown
Collaborator

On function signatures:

I am lacking the first-hand experience here, so my view can be off. Still I would assume that a substantial number of scalar functions are never going to be vectorized. So the idea is to allow the function implementation to "just get the input arguments" and "just set the result" without dealing with vectors. Assuming user writes his first scalar function that fetches a JSON by ID and writes it to disk. The idea is to not burden this with vector() calls until user actually needs a vector.

We still don't want Object[] args to not have the autoboxing. So the suggestion was to have some kind of input.intArg(0) and input.boolArg(1) methods. And cover common types, like input.timestampArg(0) -> LocalDateTime.

For this, we could accept Supplier, BiFunction<T, U, R> and maybe create TriFunction.

I would think we can overload builder.setFunction() with a few input types (with its own wrapper for each type).

About using context instead of passing input vectors + output vector.
I really don't get the difference.

I think you are correct, the difference is minimal. The idea was to allow to create reader and writer lazily. And sometimes to not create them at all, for example if there is no input. As creating them may incur additional JNI calls. But we only just now changed the JNI entry point to not pass the rowCount (fetching that will incur extra JNI call). Pehaps it would be better to pass opaque input and output objects protected by the interfaces. And allow to have both explicitly vectorized and also "simple" access on them. Basically adding new methods on them (and deprecating existing methods) as needed keeping the function signature the same.

@staticlibs
Copy link
Copy Markdown
Collaborator

I must say that I am more interested in table functions, not that scalar functions are not needed.

I think the majority of points discussed in this PR also apply to table functions. If you have cycle for that, I think it is well possible to get table functions into 1.5.2 in a followup PR to this PR.

But DuckDB and its extensions provide so many functions already, that what I miss the most is to be able to provide tables generated from Java. I already use the registerArrowStream, but it is not enough.

Unfortunately Arrow support in JDBC was put on a backburner. Here is my comment in Discord from more than half a year ago (there were zero Arrow changes since then):

No changes were done to Arrow interface so far. The idea about the Arrow support is to expose enough of ADBC API in Java to allow client apps to implement something like GizmoSQL in Java over ADBC and without per-row overhead. But this is not in short-term plans.

@staticlibs
Copy link
Copy Markdown
Collaborator

staticlibs commented Apr 1, 2026

About DuckDBLogicalType - lets make it completely optional. So the user can use DuckDBColumnType enum values directly when primitive arguments and return types are registered. To force DuckDBLogicalType usage only for composite types. Not sure the class can be made opaque with some kind of a generics trick like createMapType(new TypeToken<Map<String, Long>>()) -> DuckDBLogicalType (like in Gson). But it is better to not require it (construct/destroy automatically) for primitive types anyway.

This is my last pending comment on Java API.

@staticlibs
Copy link
Copy Markdown
Collaborator

@lfkpoa

Just FYI, I was in touch with our dev-rel, there is quite a bit of interest in this feature, especially in Java table functions. If you are interested in writing a blog post about it, we can publish it in the same format as the recent post about UDFs in C#. With usual notes that there is no guaranteed publishing timing and that editorial requirements may be pretty strict.

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 2, 2026

I'm trying to fit Function and BiFunction into the builder.
Would something like below be acceptable?
DuckDBFunctions.scalarFunction()
.withName("g")
.withParameter(DuckDBColumnType.INTEGER)
.withParameter(DuckDBColumnType.INTEGER)
.withReturnType(DuckDBColumnType.INTEGER)
.withFunction((BiFunction<Integer, Integer, Integer>) (a, b) -> a + b)
.register(conn);

One problem is type erasure from generics so we might not be able to get types from the function reflection.
We will have to guess the types from DuckDB types, but I think it is ok.

A different approach could be something like below. This gives the user more control over types.
DuckDBFunctions.scalarFunction()
.withName("concat_id")
.withParameter(DuckDBColumnType.INTEGER)
.withParameter(DuckDBColumnType.VARCHAR)
.withReturnType(DuckDBColumnType.VARCHAR)
.withFunction((id, txt, out) -> out.setString(id.getInt() + "_" + txt.getString()) )
.register(conn);

I think this last one would be safer (without type guessing), but it is not as easy as the first.

@staticlibs
Copy link
Copy Markdown
Collaborator

staticlibs commented Apr 2, 2026

Would something like below be acceptable?

+1 to this builder API naming, this one looks consistent to me.

One problem is type erasure from generics so we might not be able to get types from the function reflection.

M, I think we actually can get the reflection information from there, despite type erasure, by requiring additional call .setFunctionTypeToken(TypeToken<BiFunction<Integer, Integer, Integer>>). I cannot see why this won't work. Just I do not think we actually need the type information from the function callback. It may be better to follow the strict type checking approach similar to one used in Appender:

  • require explicit declaration of input and output types (LogicalType, passed as now with DuckDBColumnType enum values)
  • one-to-one strict mapping between DuckDBColumnType and Java types
  • in wrapper get Java values from the vector and call the type-erased callback
  • if it fails with ClassCastException - just report it immediately

The point is that we can relax the types handling (add conversion of LocalDateTime to java.util.Date etc) later without breaking the existing code. And also can later add an alternative way to specify parameters and return types through reflection (internally just converting it to withParameter and withReturnType).

I think this last one would be safer (without type guessing), but it is not as easy as the first.

As I understand this, we can add any number of withFunction variations later. Even if overloading cannot be used in some cases, we can just introduce things like withFunctionAsUnaryOperatorCallback(). So for initial impl I suggest to choose the variant that is easier to implement/has less gotchas.

From usage point of view, I think supporting actual return values (instead of out.setString) is more convenient for casual/initial use.

@staticlibs
Copy link
Copy Markdown
Collaborator

Hi! I added all the direct code comments I have. I have a few more comments on API/naming, null-handling and exceptions. I will collect them in more structured way and post in a few hours.

@staticlibs
Copy link
Copy Markdown
Collaborator

On exceptions: both catching SQLExceptions and throwing IllegalStateExceptions looks wrong to me. Lets keep SQLException only on the outside boundary of the call - basically only in the builder. And for exceptions inside the reader/writer/context/row - lets create a new runtime one DuckDBFunctionException - and use it everywhere during the call when something needs to be thrown.

About null handling - I would think that propagateNulls will be useful with primitive streams, but not with the row stream. Instead, for null handling in row for null fields lets return null objects for object-typed getters, throw from get(colIdx) getters, and provide additional null-aware get(colIdx, defaultValue) getters.

About interfaces - the amount of functionality exposed to function callbacks is much bigger now, so there is no "minimal interface" anymore. And current interfaces usage for reader and writer, but not for ctx and row is inconsistent. So we need to either wrap everything in interfaces, or remove the interfaces from reader and writer - I suggest to do the latter.

About naming - lets do another pass over the naming of new classes when everything else is ready and only the naming is left.

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 7, 2026

About null handling - I would think that propagateNulls will be useful with primitive streams, but not with the row stream. Instead, for null handling in row for null fields lets return null objects for object-typed getters, throw from get(colIdx) getters, and provide additional null-aware get(colIdx, defaultValue) getters.

When I decided to add propagateNulls was for row streams and for not primitive functional interfaces, because I kept adding null checks that would set the output to null on every function callback and I thought that in most cases it would be nice to only deal with non null values.
Primitive Streams need propagateNulls to be true, so this is not really an option.
I might be misunderstanding your comment.

@staticlibs
Copy link
Copy Markdown
Collaborator

I might be misunderstanding your comment.

Let me prototype this and I will elaborate more. The idea was to use propagateNulls only for primitive streams - need to check whether it will actually be better.

@staticlibs
Copy link
Copy Markdown
Collaborator

Let me prototype this and I will elaborate more.

Proposed null handling change - commit. I suggest to cherry-pick it into your branch.

It moves setting the null propagation flag from builder to context. The flag still exists on builder (and is passed down to a function wrapper), but it is false by default and is only enabled automatically for primitive functions registrations. On row the primitive getters are made null-aware. I would think it makes sense to move all these getters into the reader and just forward call to them from the row.

Please let me know what you think!

Squash all local scalar-function work into a single cohesive change set guided by review feedback. This adopts context-driven null propagation, moves null-aware primitive handling into vector readers, unifies callback-time failures under DuckDBFunctionException (while documenting bounds violations as IndexOutOfBoundsException), and removes checked SQLExceptions from callback runtime APIs. It also adds complete HUGEINT/UHUGEINT callback read/write support, keeps BigInteger class mapping on HUGEINT, hoists vector native byte-order setup, syncs scalar JNI sources in CMake templates, and adds JNI ExceptionCheck guards. Tests and docs are updated to cover the new behavior across scalar callbacks, bindings, null handling, and unsigned 128-bit round-trips.
@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 8, 2026

This commit consolidates all scalar-UDF branch work into one cohesive change set driven by review feedback.
It adopts context-driven null propagation behavior, moves null-aware primitive default handling into vector readers (with row objects forwarding to readers), and removes checked SQLException from callback-runtime reader/writer/context APIs. Callback-time type/value failures are unified under DuckDBFunctionException, while bounds violations remain IndexOutOfBoundsException and are now explicitly documented.
It also finalizes type support updates by mapping BigInteger callbacks to HUGEINT and adding complete UHUGEINT callback support across reader/writer/row accessors and adapter codecs, including unsigned 128-bit conversions and round-trip coverage. The vector path is optimized by setting native byte order once and removing duplicate-buffer patterns in hot paths. JNI/CMake integration is aligned by mirroring scalar JNI sources in CMakeLists.txt.in and adding JNI ExceptionCheck() guards after Java string conversion calls.
Documentation and tests are updated comprehensively (UDF.MD, scalar-function tests, bindings tests) to validate null handling, runtime error semantics, HUGEINT/UHUGEINT behavior, and vector read/write correctness.

Run project formatter and keep only style/layout changes required by format-check, with no functional modifications.
@staticlibs
Copy link
Copy Markdown
Collaborator

Thanks for the update! I think it can be integrated in this form. I have a few minor comments on Java API and also would like to change some naming (like Context should be CallInfo to stay closer to C API naming and to not be confused with native ClientContext). If convenient I can merge this now and follow-up with these minor things myself in subsequent PRs? Aternatively I can just post that as a review in this PR?

1.5.2 is planned for next week, I am going to backport this change from main branch to v1.5-variegata branch at the end of this week.

@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 8, 2026

Please, feel free to do as you see fit.

If convenient I can merge this now and follow-up with these minor things myself in subsequent PRs?

This is fine.

@staticlibs
Copy link
Copy Markdown
Collaborator

Merging this now. I will follow-up with renaming and minor API changes.

In general, after 1.5.2, when adding non-trivial changes in this area I will wait a day before merging new PRs. Your input on such PRs is more than welcome!

For any questions about the UDF blog publishing please contact me on alexkasko@duckdblabs.com - I can connect you with relevant blogging people.

@staticlibs staticlibs merged commit f97500d into duckdb:main Apr 8, 2026
14 checks passed
staticlibs added a commit to staticlibs/duckdb-java that referenced this pull request Apr 9, 2026
This is a follow-up to PR duckdb#630.

It makes the following changes to newly added Scalar Functions Java
API:

 - moves exception and registered shell classes into the
   `DuckDBFunction.java`
 - removes abstract classes for vector reader and writer
 - renames `DuckDBScalarContext` into `DuckDBScalarFunctionCallData`
 - removes `DuckDBScalarRow` in favour of streaming plain indices of the
   input vector rows (as a `LongStream`); the row object inteface
   appeared to have an unintended overhead of creating a Java object
   for every input row that we would like to avoid.
staticlibs added a commit to staticlibs/duckdb-java that referenced this pull request Apr 9, 2026
This is a follow-up to PR duckdb#630.

It makes the following changes to newly added Scalar Functions Java
API:

 - moves exception and registered shell classes into the
   `DuckDBFunction.java`
 - removes abstract classes for vector reader and writer
 - removes `DuckDBScalarContext` and `DuckDBScalarRow` in favour of
   streaming plain indices (as a `LongStream`) from the input data
   chunk; the row object inteface appeared to have an unintended
   overhead of creating a Java object for every input row that we would
   like to avoid. And without it the context abstraction appeared to be
   unnecessary

Null-propagation handling is changed the following way:

 - null propagation on Java side is enabled only for primitive callbacks
   (set automatically) and not exposed to the user, null propagation
   support for object callbacks is removed
 - null propagation on DuckDB engine side (
 `duckdb_scalar_function_set_special_handling` C API call skip) is also
 enabled automatically only for primitive callbacks, but it is
 additionally exposed to users as `withNullInNullOut()` builder call
 (replaces awkwardly named `withSpecialHandling()`); in some cases NULLs
 still can be passed to callbacks wheh `withNullInNullOut()` is
 set so callback still must check for nulls

Testing: more tests added aroung the null handling
staticlibs added a commit to staticlibs/duckdb-java that referenced this pull request Apr 9, 2026
This is a follow-up to PR duckdb#630.

It makes the following changes to newly added Scalar Functions Java
API:

 - moves exception and registered shell classes into the
   `DuckDBFunction.java`
 - removes abstract classes for vector reader and writer
 - removes `DuckDBScalarContext` and `DuckDBScalarRow` in favour of
   streaming plain indices (as a `LongStream`) from the input data
   chunk; the row object inteface appeared to have an unintended
   overhead of creating a Java object for every input row that we would
   like to avoid. And without it the context abstraction appeared to be
   unnecessary

Null-propagation handling is changed the following way:

 - null propagation on Java side is enabled only for primitive callbacks
   (set automatically) and not exposed to the user, null propagation
   support for object callbacks is removed
 - null propagation on DuckDB engine side (
 `duckdb_scalar_function_set_special_handling` C API call skip) is also
 enabled automatically only for primitive callbacks, but it is
 additionally exposed to users as `withNullInNullOut()` builder call
 (replaces awkwardly named `withSpecialHandling()`); in some cases NULLs
 still can be passed to callbacks wheh `withNullInNullOut()` is
 set so callback still must check for nulls

Testing: more tests added aroung the null handling
staticlibs added a commit to staticlibs/duckdb-java that referenced this pull request Apr 9, 2026
This is a follow-up to PR duckdb#630.

It makes the following changes to newly added Scalar Functions Java
API:

 - moves exception and registered shell classes into the
   `DuckDBFunction.java`
 - removes abstract classes for vector reader and writer
 - removes `DuckDBScalarContext` and `DuckDBScalarRow` in favour of
   streaming plain indices (as a `LongStream`) from the input data
   chunk; the row object inteface appeared to have an unintended
   overhead of creating a Java object for every input row that we would
   like to avoid. And without it the context abstraction appeared to be
   unnecessary

Null-propagation handling is changed the following way:

 - null propagation on Java side is enabled only for primitive callbacks
   (set automatically) and not exposed to the user, null propagation
   support for object callbacks is removed
 - null propagation on DuckDB engine side (
 `duckdb_scalar_function_set_special_handling` C API call skip) is also
 enabled automatically only for primitive callbacks, but it is
 additionally exposed to users as `withNullInNullOut()` builder call
 (replaces awkwardly named `withSpecialHandling()`); in some cases NULLs
 still can be passed to callbacks wheh `withNullInNullOut()` is
 set so callback still must check for nulls

Testing: more tests added aroung the null handling
staticlibs added a commit that referenced this pull request Apr 9, 2026
This is a follow-up to PR #630.

It makes the following changes to newly added Scalar Functions Java
API:

 - moves exception and registered shell classes into the
   `DuckDBFunction.java`
 - removes abstract classes for vector reader and writer
 - removes `DuckDBScalarContext` and `DuckDBScalarRow` in favour of
   streaming plain indices (as a `LongStream`) from the input data
   chunk; the row object inteface appeared to have an unintended
   overhead of creating a Java object for every input row that we would
   like to avoid. And without it the context abstraction appeared to be
   unnecessary

Null-propagation handling is changed the following way:

 - null propagation on Java side is enabled only for primitive callbacks
   (set automatically) and not exposed to the user, null propagation
   support for object callbacks is removed
 - null propagation on DuckDB engine side (
 `duckdb_scalar_function_set_special_handling` C API call skip) is also
 enabled automatically only for primitive callbacks, but it is
 additionally exposed to users as `withNullInNullOut()` builder call
 (replaces awkwardly named `withSpecialHandling()`); in some cases NULLs
 still can be passed to callbacks wheh `withNullInNullOut()` is
 set so callback still must check for nulls

Testing: more tests added aroung the null handling
@lfkpoa
Copy link
Copy Markdown
Contributor Author

lfkpoa commented Apr 10, 2026

Hi! Thanks!

Could you share a bit more detail on how you think the implementation of table functions should be done?
I would like to better understand the direction you have in mind, both in terms of design and API, so I can follow the approach that you expect.

Also, just as a heads-up, I will not be available next week, so I will only be able to start working on this in the following week.

@staticlibs
Copy link
Copy Markdown
Collaborator

Hi!

Thanks for the info! Table functions are intended to be done in a similar way to scalar ones:

@FunctionalInterface
public interface DuckDBTableFunctionBind<BIND_DATA> {
    BIND_DATA apply(DuckDBTableFunctionBindInfo info) throws Exception;
}
@FunctionalInterface
public interface DuckDBTableFunction<BIND_DATA> {
    void apply(BIND_DATA bindData, DuckDBDataChunkWriter output) throws Exception;
}
public class DuckDBTableFunctionBindInfo {
    private final ByteBuffer bindInfoRef;

    public DuckDBTableFunctionBindInfo(ByteBuffer bindInfoRef) {
        this.bindInfoRef = bindInfoRef;
    }

    public void addResultColumn(String name, DuckDBLogicalType type) {
        byte[] name_bytes = name.getBytes(UTF_8);
        duckdb_bind_add_result_column(bindInfoRef, name_bytes, type.logicalTypeRef());
    }

    public DuckDBValue getPositionalParameter(long index) {
        ByteBuffer valueRef = duckdb_bind_get_parameter(bindInfoRef, index);
        return new DuckDBValue(valueRef);
    }

    public DuckDBValue getNamedParameter(String name) {
        byte[] name_bytes = name.getBytes(UTF_8);
        ByteBuffer valueRef = duckdb_bind_get_named_parameter(bindInfoRef, name_bytes);
        return new DuckDBValue(valueRef);
    }

    public long parametersCount() {
        return duckdb_bind_get_parameter_count(bindInfoRef);
    }
}

I am working on it now, but don't have it running yet so the API may change. Will tag you in a PR when it is up!

staticlibs added a commit to staticlibs/duckdb-java that referenced this pull request Apr 10, 2026
This PR is a follow-up to duckdb#630 and duckdb#637.

It removes JNI utilities specific to scalar functions in favour of more
generic `GlobalRefHolder` utility.

Testing: no functional changes, no new tests
staticlibs added a commit that referenced this pull request Apr 11, 2026
This PR is a follow-up to #630 and #637.

It removes JNI utilities specific to scalar functions in favour of more
generic `GlobalRefHolder` utility.

Testing: no functional changes, no new tests
staticlibs added a commit to staticlibs/duckdb-java that referenced this pull request Apr 11, 2026
This PR is a continuation of duckdb#630, it adds support for writing DuckDB
table functions in Java.

Example:

```java
try (Connection conn = DriverManager.getConnection(JDBC_URL);
    Statement stmt = conn.createStatement()) {
    DuckDBFunctions.tableFunction()
        .withName("java_table_basic")
        .withParameter(int.class)
        .withFunction(new DuckDBTableFunction<Integer, AtomicBoolean, Object>() {
            @OverRide
            public Integer bind(DuckDBTableFunctionBindInfo info) throws Exception {
                info.addResultColumn("col1", Integer.TYPE)
                    .addResultColumn("col2", String.class);
                return info.getParameter(0).getInt();
            }

            @OverRide
            public AtomicBoolean init(DuckDBTableFunctionInitInfo info) throws Exception {
                info.setMaxThreads(1);
                return new AtomicBoolean(false);
            }

            @OverRide
            public long apply(DuckDBTableFunctionCallInfo info, DuckDBDataChunkWriter output) throws Exception {
                Integer bindData = info.getBindData();
                AtomicBoolean done = info.getInitData();
                if (done.get()) {
                    return 0;
                }
                output.vector(0).setInt(0, bindData);
                output.vector(1).setString(0, "foo");
                output.vector(0).setNull(1);
                output.vector(1).setString(1, "bar");
                done.set(true);
                return 2;
            }
        })
        .register(conn);
...
}
```
```sql
FROM java_table_basic(42);
```

Documentation is pending.

Testing: new test added
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants