Llamaj.cpp

Llamaj.cpp is a Java and JVM port of llama.cpp using jextract, enabling local large language model (LLM) inference through native foreign function & memory API interop. Natively supports macOS M-series and Linux x86_64 with GPU acceleration. Platform and hardware support (Windows, ARM, CUDA, etc.) can be extended through custom builds.

Keywords

llama.cpp · java · jvm · llm · large language models · inference · ai · native interop · foreign function & memory api · jextract

Requirements

Java 25
mvn
MacOS M-series / Linux x86_64 (CPU) (you can check the last section if you do not see your platform here)

How to use

Include the dependency in your pom.xml

    <dependencies>
        ...
        <dependency>
            <groupId>io.gravitee.llama.cpp</groupId>
            <artifactId>llamaj.cpp</artifactId>
            <version>x.x.x</version>
        </dependency>
    </dependencies>

Note: All examples below use LlamaVocab to handle tokenization. It's obtained from a loaded LlamaModel and is essential for converting between tokens and text representations.

Example 1: Basic Conversation

import io.gravitee.llama.cpp.*;
import java.lang.foreign.Arena;
import java.nio.file.Path;

public class BasicExample {
    public static void main(String[] args) {
        var arena = Arena.ofConfined();

        // Initialize runtime
        LlamaRuntime.llama_backend_init();

        // Load model
        var modelParams = new LlamaModelParams(arena);
        var model = new LlamaModel(arena, Path.of("models/model.gguf"), modelParams);

        // Create context
        var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);
        var context = new LlamaContext(model, contextParams);

        // Set up tokenizer and sampler
        var vocab = new LlamaVocab(model);
        var tokenizer = new LlamaTokenizer(vocab, context);
        var sampler = new LlamaSampler(arena)
            .temperature(0.7f)
            .topK(40)
            .topP(0.9f, 1)
            .seed(42);

        // Create conversation state
        var state = ConversationState.create(arena, context, tokenizer, sampler, 0)
            .setMaxTokens(100)
            .initialize("What is the capital of France?");

        // Generate response
        var iterator = new DefaultLlamaIterator(state);
        while (iterator.hasNext()) {
            var output = iterator.next();
            System.out.print(output.text());
        }

        // Cleanup
        context.free();
        sampler.free();
        model.free();
        LlamaRuntime.llama_backend_free();
    }
}

Example 2: Log Probabilities

Enable log-probability collection to inspect the model's confidence at each token position. Set topLogprobs to the number of top-alternative tokens you want alongside the sampled one (0 = disabled, no overhead):

import io.gravitee.llama.cpp.*;
import java.lang.foreign.Arena;
import java.nio.file.Path;

public class LogprobsExample {
    public static void main(String[] args) {
        var arena = Arena.ofConfined();
        LlamaRuntime.llama_backend_init();

        var model = new LlamaModel(arena, Path.of("models/model.gguf"), new LlamaModelParams(arena));
        var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);
        var context = new LlamaContext(arena, model, contextParams);
        var vocab = new LlamaVocab(model);
        var tokenizer = new LlamaTokenizer(vocab, context);
        var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);

        var state = ConversationState.create(arena, context, tokenizer, sampler)
            .setMaxTokens(50)
            .setTopLogprobs(5)   // return top-5 alternatives at every token position
            .initialize("What is the capital of France?");

        var iterator = new DefaultLlamaIterator(state);
        while (iterator.hasNext()) {
            var output = iterator.next();
            System.out.print(output.text());

            Logprobs lp = output.logprobs();
            System.out.printf("%n  chosen: \"%s\"  logprob=%.4f%n",
                lp.chosenToken().token(), lp.chosenToken().logprob());
            lp.topLogprobs().forEach(t ->
                System.out.printf("    alt: \"%s\"  logprob=%.4f%n", t.token(), t.logprob()));
        }

        context.free();
        sampler.free();
        model.free();
        LlamaRuntime.llama_backend_free();
    }
}

Each LlamaOutput carries a Logprobs object with:

chosenToken() — the token that was sampled, its text, vocabulary ID, log-probability, and raw UTF-8 bytes
topLogprobs() — up to N alternatives sorted by descending log-probability; the chosen token is always included

When topLogprobs is 0 (the default), output.logprobs() is null and no logit processing is done.

Example 3: Parallel Conversations

Process multiple conversations simultaneously in a single batch:

import io.gravitee.llama.cpp.*;

import java.lang.foreign.Arena;
import java.nio.file.Path;

public class ParallelExample {
    public static void main(String[] args) {
        var arena = Arena.ofConfined();

        // Initialize runtime
        LlamaRuntime.llama_backend_init();

        // Load model
        var modelParams = new LlamaModelParams(arena);
        var model = new LlamaModel(arena, Path.of("models/model.gguf"), modelParams);

        // Create context with multi-sequence support
        var contextParams = new LlamaContextParams(arena)
                .nCtx(2048)
                .nBatch(512)
                .nSeqMax(4);  // Support up to 4 parallel conversations
        var context = new LlamaContext(model, contextParams);

        // Set up shared tokenizer and sampler
        var vocab = new LlamaVocab(model);
        var tokenizer = new LlamaTokenizer(vocab, context);
        var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);

        // Create multiple conversation states with unique sequence IDs
        var state1 = ConversationState.create(arena, context, tokenizer, sampler, 0)
                .setMaxTokens(100).initialize("What is the capital of France?");
        var state2 = ConversationState.create(arena, context, tokenizer, sampler, 1)
                .setMaxTokens(100).initialize("What is the capital of England?");
        var state3 = ConversationState.create(arena, context, tokenizer, sampler, 2)
                .setMaxTokens(100).initialize("What is the capital of Poland?");

        // Create parallel iterator - prompts are auto-processed when states are added
        var parallel = new BatchIterator(arena, context, 512, 4)
                .addState(state1)
                .addState(state2)
                .addState(state3);

        // Generate tokens in parallel
        System.out.println("=== Parallel Generation ===");
        while (parallel.hasNext()) {
            // Each hasNext() generates tokens for all active conversations
            // Get all outputs from this batch (one per active conversation)
            var outputs = parallel.getOutputs();
            for (var output : outputs) {
                System.out.println("Seq " + output.sequenceId() + ": " + output.text());
            }
        }
        System.out.println();

        // Print results
        System.out.println("Conversation 1: " + state1.getAnswer());
        System.out.println("  Tokens: " + state1.getAnswerTokens());
        System.out.println("Conversation 2: " + state2.getAnswer());
        System.out.println("  Tokens: " + state2.getAnswerTokens());
        System.out.println("Conversation 3: " + state3.getAnswer());
        System.out.println("  Tokens: " + state3.getAnswerTokens());

        // Cleanup
        parallel.free();
        context.free();
        sampler.free();
        model.free();
        LlamaRuntime.llama_backend_free();
    }
}

Example 4: Distributed Inference with RPC

Offload model weights and KV-cache to remote machines using the RPC backend. When using --rpc, weights are loaded exclusively on the remote servers -- the local GPU is not used.

Start RPC server nodes first (see containers/README.md):

# On the remote machine (or another terminal)
./scripts/start-rpc-server.sh

Then connect from Java:

import io.gravitee.llama.cpp.*;
import io.gravitee.llama.cpp.nativelib.LlamaLibLoader;
import java.lang.foreign.Arena;
import java.nio.file.Path;

public class RpcExample {
    public static void main(String[] args) {
        var arena = Arena.ofConfined();

        // Initialize runtime
        String libPath = LlamaLibLoader.load();
        LlamaRuntime.llama_backend_init();

        // Register remote RPC servers -- returns their device handles
        var rpcDevices = BackendRegistry.addRpcServer(arena, "127.0.0.1:50052");

        // Print all discovered backends and devices
        BackendRegistry.printSummary();

        // Load model, restricting offloading to only the RPC devices
        var modelParams = new LlamaModelParams(arena)
            .devices(arena, rpcDevices)
            .nGpuLayers(999);
        var model = new LlamaModel(arena, Path.of("models/model.gguf"), modelParams);

        // Everything else works exactly the same as local inference
        var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);
        var context = new LlamaContext(model, contextParams);
        var vocab = new LlamaVocab(model);
        var tokenizer = new LlamaTokenizer(vocab, context);
        var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);

        var state = ConversationState.create(arena, context, tokenizer, sampler, 0)
            .setMaxTokens(100)
            .initialize("What is the capital of France?");

        var iterator = new DefaultLlamaIterator(state);
        while (iterator.hasNext()) {
            System.out.print(iterator.next().text());
        }

        context.free();
        sampler.free();
        model.free();
        LlamaRuntime.llama_backend_free();
    }
}

Or from the CLI:

$ java --enable-preview --enable-native-access=ALL-UNNAMED \
  -jar llamaj.cpp-<version>.jar \
  --model models/model.gguf \
  --rpc 127.0.0.1:50052

Multiple RPC servers:

$ java --enable-preview --enable-native-access=ALL-UNNAMED \
  -jar llamaj.cpp-<version>.jar \
  --model models/model.gguf \
  --rpc 192.168.1.10:50052,192.168.1.11:50052

Build

The build uses a platform-specific Maven profile to download the correct jextract tool and pre-built llama.cpp native libraries, generate the Java FFM bindings, format the code, apply license headers, and install the artifact to your local Maven repository.

macOS (Apple Silicon):

cd llamaj.cpp/
mvn prettier:write license:format clean generate-sources -Pmacosx-aarch64 install

Linux (x86_64):

cd llamaj.cpp/
mvn prettier:write license:format clean generate-sources -Plinux-x86_64 install

On Linux, you also need to set the library path at runtime:
export LD_LIBRARY_PATH="$HOME/.llama.cpp:$LD_LIBRARY_PATH"

Run

$ mvn exec:java -Dexec.mainClass=io.gravitee.llama.cpp.Main \
    -Dexec.args="--model /path/to/model/model.gguf --system 'You are a helpful assistant. Answer question to the best of your ability'"

or

$ java --enable-preview -jar llamaj.cpp-<version>.jar \
  --model models/model.gguf \
  --system 'You are a helpful assistant. Answer question to the best of your ability'

On linux, don't forget to link your libraries with the environment variable below:

$ export LD_LIBRARY_PATH="$HOME/.llama.cpp:$LD_LIBRARY_PATH"

There are plenty of models on HuggingFace, we suggest the one here

Usage

Usage: java -jar llamaj.cpp-<version>.jar --model <path_to_gguf_model> [options...]
Options:
--system <message>       : System message (default: "You are a helpful AI assistant.")
--n_gpu_layers <int>     : Number of GPU layers (default: 999)
--use_mlock <boolean>    : Use mlock (default: true)
--use_mmap <boolean>     : Use mmap (default: true)
--rpc <endpoints>        : Comma-separated RPC server endpoints for distributed inference
                           (e.g., "127.0.0.1:50052,192.168.1.11:50052")
                           When set, weights are offloaded exclusively to the remote servers
--temperature <float>    : Sampler temperature (default: 0.4)
--min_p <float>          : Sampler min_p (default: 0.1)
--min_p_window <int>     : Sampler min_p_window (default: 40)
--top_k <int>            : Sampler top_k (default: 10)
--top_p <float>          : Sampler top_p (default: 0.2)
--top_p_window <int>     : Sampler top_p_window (default: 10)
--seed <long>            : Sampler seed (default: random)
--n_ctx <int>            : Context size (default: 512)
--n_batch <int>          : Batch size (default: 512)
--n_seq_max <int>        : Max sequence length (default: 512)
--quota <int>            : Iterator quota (default: 512)
--n_keep <int>         : Tokens to keep when exceeding ctx size (default: 256)
--log_level <level>      : Logging level (ERROR, WARN, INFO, DEBUG, default: ERROR)

Use your own llama.cpp build

Clone llama.cpp repository

Make sure the jextract folder is in the same path level as your repository

$ git clone https://github.com/ggml-org/llama.cpp
$ cd llama.cpp

Compile sources

Make sure you have gcc / g++ compiler

$ gcc --help
$ g++ --help

On Linux:

$ cmake -B build
$ cmake --build build --config Release -j $(nproc)

On MacOs:

$ cmake -B build
$ cmake --build build --config Release  -j $(sysctl -n hw.ncpu)

If you wish to build llama.cpp with particular configuration (CUDA, OpenBLAS, AVX2, AVX512, ...) Please refer to the llama.cpp documentation

Link sources

You can use the environment variable LLAMA_CPP_LIB_PATH=/path/to/llama.cpp/build/bin/ This will directly load the dynamically shared object library files (.so for linux, .dylib for macos) You can also decide to copy these files into a temporary folder using the environment variable LLAMA_CPP_USE_TMP_LIB_PATH=true The path temporary file will be used to load the shared object libraries

Beyond Apple M-Series and Linux x86_64

To add support for other platforms (Windows, ARM, CUDA, etc.), follow this approach:

1. Build llama.cpp

Clone and build llama.cpp for your target platform:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release

2. Generate FFM API Bindings with jextract

Download jextract for your platform from OpenJDK early-access builds, then generate the Java bindings:

# Example for Windows x86_64
jextract -t io.gravitee.llama.cpp.windows.x86_64 \
  --include-dir /path/to/llama.cpp/ggml/include \
  --include-dir /path/to/llama.cpp/include \
  --output src/main/java \
  --header-class-name llama_h \
  /path/to/llama.cpp/tools/mtmd/mtmd.h \
  /path/to/llama.cpp/tools/mtmd/mtmd-helper.h \
  /path/to/llama.cpp/include/llama.h \
  /path/to/llama.cpp/ggml/include/ggml-rpc.h

3. Post-process Generated Sources

Check the generated sources and apply any necessary fixes (e.g., visibility modifiers, fully-qualified method calls).

4. Build the Bindings JAR

Compile the generated sources and build a JAR using your own build system (Maven, Gradle, etc.).

5. Integrate into Your Classpath

Add the generated JAR to your project's classpath and ensure the native libraries from step 1 are available at runtime.

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.circleci		.circleci
.github		.github
.mvn		.mvn
scripts		scripts
src		src
.gitignore		.gitignore
CONTRIBUTING.adoc		CONTRIBUTING.adoc
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
pom.xml		pom.xml
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Llamaj.cpp

Keywords

Requirements

How to use

Example 1: Basic Conversation

Example 2: Log Probabilities

Example 3: Parallel Conversations

Example 4: Distributed Inference with RPC

Build

Run

Usage

Use your own llama.cpp build

Beyond Apple M-Series and Linux x86_64

1. Build llama.cpp

2. Generate FFM API Bindings with jextract

3. Post-process Generated Sources

4. Build the Bindings JAR

5. Integrate into Your Classpath

About

Uh oh!

Releases 67

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Llamaj.cpp

Keywords

Requirements

How to use

Example 1: Basic Conversation

Example 2: Log Probabilities

Example 3: Parallel Conversations

Example 4: Distributed Inference with RPC

Build

Run

Usage

Use your own llama.cpp build

Beyond Apple M-Series and Linux x86_64

1. Build llama.cpp

2. Generate FFM API Bindings with jextract

3. Post-process Generated Sources

4. Build the Bindings JAR

5. Integrate into Your Classpath

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 67

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages