You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+25-3Lines changed: 25 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
8
8
## [Unreleased]
9
9
10
+
## [0.3.21]
11
+
- perf: optimize tokenization and detokenization logic
12
+
- Refactor `tokenize`, `token_to_piece`, and `detokenize` methods in `_internals.py` to significantly reduce Python loop overhead and improve the batch-processing performance and stability of `load`/`prompt-eval`.
13
+
14
+
- Key changes:
15
+
- Replace `token_to_piece` O(N) Python loops in `detokenize` with `llama.cpp` native batch C-API (`llama_detokenize`).
16
+
- Implement dynamic buffer allocation to safely handle arbitrary token lengths (removing the hardcoded 32-byte limit).
17
+
- Add automatic buffer resizing for `tokenize` to prevent overflow errors.
18
+
19
+
- Performance observations (based on simple benchmarks):
20
+
- Small Batch Processing (127 tokens):
21
+
Latency reduced from ~117ms to ~37ms (approx. 3.1x speedup in processing loop).
22
+
- Large Batch Processing (2420 tokens):
23
+
Throughput improved from ~6905 t/s to ~8258 t/s.
24
+
- General Latency:
25
+
Total execution time for standard chat scenarios reduced by ~1.1s (from 8.4s to 7.3s).
26
+
- The comparative test results are here: https://github.com/JamePeng/llama-cpp-python/issues/47#issuecomment-3731055087
27
+
28
+
- feat: Add `Granite-Docling` multimodel support with `GraniteDoclingChatHandler`
29
+
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/b1377188784f9aea26b8abde56d4aee8c733eec7](https://github.com/ggml-org/llama.cpp/commit/b1377188784f9aea26b8abde56d4aee8c733eec7)
30
+
- feat: Sync llama.cpp llama/mtmd API Binding 20260110
31
+
10
32
## [0.3.20]
11
33
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/cef1d23c5a33156c44a206c1f4bc146f4db729f9](https://github.com/ggml-org/llama.cpp/commit/cef1d23c5a33156c44a206c1f4bc146f4db729f9)
12
34
- feat: Update llama_context_params and fixed some embeddings typo
@@ -37,9 +59,9 @@ More information see: https://github.com/JamePeng/llama-cpp-python/compare/2efaa
37
59
## [0.3.18]
38
60
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/ce734a8a2f9fb6eb4f0383ab1370a1b0014ab787](https://github.com/ggml-org/llama.cpp/commit/ce734a8a2f9fb6eb4f0383ab1370a1b0014ab787)
39
61
- feat: Sync llama.cpp llama/mtmd API Binding 20251215
40
-
- feat: **implement `GLM46VChatHandler` for GLM-4.6V Series Model**
41
-
- feat: **implement `LFM2VLChatHandler` for LFM2-VL series models**
42
-
- feat: **implement `GLM41VChatHandler` for GLM-4.1V-9B-Thinking Model**
62
+
- feat: **implement `GLM46VChatHandler` for GLM-4.6V Series Multimodel**
63
+
- feat: **implement `LFM2VLChatHandler` for LFM2-VL series Multimodel**
64
+
- feat: **implement `GLM41VChatHandler` for GLM-4.1V-9B-Thinking Multimodel**
43
65
- workflow: Added workflows for compiling with CUDA 13.0.2 on Windows and Linux.
44
66
- feat: Added the scan path for CUDA 13.0+ dynamic link libraries under Windows system ($env:CUDA_PATH\bin\x64)
45
67
- Optimization: Improved batch token processing logic in Llava15ChatHandler.
0 commit comments