Skip to content

Commit d92d279

Browse files
committed
Bump version to 0.3.21
Signed-off-by: JamePeng <jame_peng@sina.com>
1 parent 3a2e49c commit d92d279

File tree

2 files changed

+26
-4
lines changed

2 files changed

+26
-4
lines changed

CHANGELOG.md

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.3.21]
11+
- perf: optimize tokenization and detokenization logic
12+
- Refactor `tokenize`, `token_to_piece`, and `detokenize` methods in `_internals.py` to significantly reduce Python loop overhead and improve the batch-processing performance and stability of `load`/`prompt-eval`.
13+
14+
- Key changes:
15+
- Replace `token_to_piece` O(N) Python loops in `detokenize` with `llama.cpp` native batch C-API (`llama_detokenize`).
16+
- Implement dynamic buffer allocation to safely handle arbitrary token lengths (removing the hardcoded 32-byte limit).
17+
- Add automatic buffer resizing for `tokenize` to prevent overflow errors.
18+
19+
- Performance observations (based on simple benchmarks):
20+
- Small Batch Processing (127 tokens):
21+
Latency reduced from ~117ms to ~37ms (approx. 3.1x speedup in processing loop).
22+
- Large Batch Processing (2420 tokens):
23+
Throughput improved from ~6905 t/s to ~8258 t/s.
24+
- General Latency:
25+
Total execution time for standard chat scenarios reduced by ~1.1s (from 8.4s to 7.3s).
26+
- The comparative test results are here: https://github.com/JamePeng/llama-cpp-python/issues/47#issuecomment-3731055087
27+
28+
- feat: Add `Granite-Docling` multimodel support with `GraniteDoclingChatHandler`
29+
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/b1377188784f9aea26b8abde56d4aee8c733eec7](https://github.com/ggml-org/llama.cpp/commit/b1377188784f9aea26b8abde56d4aee8c733eec7)
30+
- feat: Sync llama.cpp llama/mtmd API Binding 20260110
31+
1032
## [0.3.20]
1133
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/cef1d23c5a33156c44a206c1f4bc146f4db729f9](https://github.com/ggml-org/llama.cpp/commit/cef1d23c5a33156c44a206c1f4bc146f4db729f9)
1234
- feat: Update llama_context_params and fixed some embeddings typo
@@ -37,9 +59,9 @@ More information see: https://github.com/JamePeng/llama-cpp-python/compare/2efaa
3759
## [0.3.18]
3860
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/ce734a8a2f9fb6eb4f0383ab1370a1b0014ab787](https://github.com/ggml-org/llama.cpp/commit/ce734a8a2f9fb6eb4f0383ab1370a1b0014ab787)
3961
- feat: Sync llama.cpp llama/mtmd API Binding 20251215
40-
- feat: **implement `GLM46VChatHandler` for GLM-4.6V Series Model**
41-
- feat: **implement `LFM2VLChatHandler` for LFM2-VL series models**
42-
- feat: **implement `GLM41VChatHandler` for GLM-4.1V-9B-Thinking Model**
62+
- feat: **implement `GLM46VChatHandler` for GLM-4.6V Series Multimodel**
63+
- feat: **implement `LFM2VLChatHandler` for LFM2-VL series Multimodel**
64+
- feat: **implement `GLM41VChatHandler` for GLM-4.1V-9B-Thinking Multimodel**
4365
- workflow: Added workflows for compiling with CUDA 13.0.2 on Windows and Linux.
4466
- feat: Added the scan path for CUDA 13.0+ dynamic link libraries under Windows system ($env:CUDA_PATH\bin\x64)
4567
- Optimization: Improved batch token processing logic in Llava15ChatHandler.

llama_cpp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
from .llama_cpp import *
22
from .llama import *
33

4-
__version__ = "0.3.20"
4+
__version__ = "0.3.21"

0 commit comments

Comments
 (0)