Skip to content

Introduce tool calling support#116

Merged
orionpapadakis merged 18 commits into
beehive-lab:mainfrom
orionpapadakis:feat/tool-calling
Jun 16, 2026
Merged

Introduce tool calling support#116
orionpapadakis merged 18 commits into
beehive-lab:mainfrom
orionpapadakis:feat/tool-calling

Conversation

@orionpapadakis

@orionpapadakis orionpapadakis commented May 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds tool calling to the engine through a model-agnostic ChatFormat API: the engine
handles prompt encoding and tool-call detection in strings/tokens, while orchestration
(LangChain4j ToolExecutionRequests, multi-turn loops) lives in the quarkus-langchain4j
gpu-llama3 provider (separate PR). Validated against Qwen3-0.6B (f16) and Llama-3.2-1B-Instruct.

Tool calling

  • ChatFormat gains the tool methods as defaults (no-op; supportsToolCalling() returns false), so existing formats are unchanged and families opt in by overriding: toolSystemPromptSuffix,
    encodeToolCallAssistantTurn (single + batch), encodeToolResultTurn, extractToolCall, extractAllToolCalls, getToolAwareStopTokens.
  • ToolCallExtract record (name, argumentsJson, Optional<String> id) — the hand-off type between engine and caller.
  • ToolCallParserUtils — stateless parsing of <tool_call>…</tool_call> (Qwen3 / Llama 3.2, closed and unclosed), <|python_tag|> (Llama 3.1), and raw / fenced JSON fallbacks. Brace counting is
    string-aware (skips braces inside JSON strings) so arguments containing code/braces aren't truncated.
  • LlamaChatFormat (3.1 + 3.2; tools injected into the first user message) and Qwen3ChatFormat (system message; <tool_call> / <tool_response> tags).

Complementary features

  • Thinking controlChatFormat.supportsThinking() / encodeThinkingControl(boolean) (default no-op). Qwen3ChatFormat primes a pre-closed <think>\n\n</think> block to skip reasoning, using the
    canonical <think>/</think> token ids (now captured by Qwen3Tokenizer before they're stripped from the special-token map). DeepSeek-R1 reports false and is never forced off.
  • Per-format default temperature / top-p, with related Options validation tidy-up.

Testing

Unit: new ToolCallParserUtilsTest (16 cases — tags, python_tag, raw/fenced JSON, unclosed blocks, batch calls, brace-in-string, escaped quotes).

End-to-end via the quarkus-langchain4j weather-agent sample (geocoding → forecast tool chain).

0. Environment

sdk use java 25.0.2-open
sdk use tornadovm 4.0.1-jdk25-ptx
  1. GPULlama3.java — build the current branch:
  cd ~/GPULlama3.java && ./mvnw clean install -DskipTests -q
  1. quarkus-langchain4j — clone the PR: https://github.com/quarkiverse/quarkus-langchain4j/pull/2604

Wire the sample's pom.xml to the local snapshot, swap OpenAI for gpu-llama3, and pass the TornadoVM argfile:

<quarkus-langchain4j.version>999-SNAPSHOT</quarkus-langchain4j.version>
...
<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-gpu-llama3</artifactId>
    <version>${quarkus-langchain4j.version}</version>
</dependency>
...
<configuration>
    <jvmArgs>@${env.TORNADOVM_HOME}/tornado-argfile</jvmArgs>
</configuration>

Configure the sample's application.properties:

quarkus.langchain4j.log-requests=true
quarkus.langchain4j.log-responses=true

quarkus.rest-client.geocoding.url=https://geocoding-api.open-meteo.com
quarkus.rest-client.openmeteo.url=https://api.open-meteo.com

# Qwen3 (tool-calling driver)
quarkus.langchain4j.gpu-llama3.chat-model.model-name=ggml-org/Qwen3-0.6B-GGUF
quarkus.langchain4j.gpu-llama3.chat-model.quantization=f16
# or Llama 3.2:
#quarkus.langchain4j.gpu-llama3.chat-model.model-name=unsloth/Llama-3.2-1B-Instruct-GGUF
#quarkus.langchain4j.gpu-llama3.chat-model.quantization=F16

quarkus.langchain4j.gpu-llama3.chat-model.temperature=0.6
quarkus.langchain4j.gpu-llama3.chat-model.top-p=0.95
quarkus.langchain4j.gpu-llama3.chat-model.max-tokens=2048
quarkus.langchain4j.gpu-llama3.chat-model.device-memory=5GB
quarkus.langchain4j.gpu-llama3.chat-model.enable-thinking=false
quarkus.langchain4j.gpu-llama3.chat-model.prefill-decode=true
quarkus.langchain4j.gpu-llama3.chat-model.prefill-batch-size=32

# tool-call / prompt tracing:
#quarkus.log.category."io.quarkiverse.langchain4j.gpullama3".level=DEBUG

Build the gpu-llama3 provider against the engine:

  cd ~/quarkus-langchain4j && mvn install -pl model-providers/gpu-llama3/runtime,model-providers/gpu-llama3/deployment -am -DskipTests -q

Run the weather-agent sample:

 cd ~/quarkus-langchain4j/samples/weather-agent && mvn quarkus:dev

Ask for a city's weather; the model emits a tool call, the agent runs geocoding → forecast, and the final answer is grounded in the tool results:

curl "http://localhost:8080/weather?city=Chania"

Notes

  • No behavioural change for non-tool, non-thinking paths — all new ChatFormat surface is default-implemented.

This comment was marked as outdated.

@orionpapadakis orionpapadakis marked this pull request as ready for review June 15, 2026 20:48
…l calls, user message integration, and enhanced response parsing.

@stratika stratika left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@orionpapadakis orionpapadakis merged commit 60cf0f4 into beehive-lab:main Jun 16, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants