Add kv_cache size to pipeline generation log.#3989
Conversation
mzegla
left a comment
There was a problem hiding this comment.
Does not seem to make sense to report 100% of X in every log for dynamic cache size.
It's a minor, but can we drop showing percentage for dynamic cache size and only report size in MB/GB etc. ?
### 🛠 Summary [CVS-180948](https://jira.devtools.intel.com/browse/CVS-180948) Adding new parameter to export_model to enable export of model with requirement for trusting remote code. ### 🧪 Checklist - [ ] Unit tests added. - [ ] The documentation updated. - [ ] Change follows security best practices. `` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Miłosz Żeglarski <milosz.zeglarski@intel.com>
### 🛠 Summary Adding environmental variable for Qwen3-Coder-30B to make it work correctly ### 🧪 Checklist - [ ] Unit tests added. - [ ] The documentation updated. - [ ] Change follows security best practices. ``
### 🛠 Summary Updating export and NPU usage. ### 🧪 Checklist - [ ] Unit tests added. - [ ] The documentation updated. - [ ] Change follows security best practices. `` --------- Co-authored-by: Damian Kalinowski <damian.kalinowski@intel.com> Co-authored-by: Trawinski, Dariusz <dariusz.trawinski@intel.com>
There was a problem hiding this comment.
Pull request overview
Updates OVMS GenAI continuous batching logging to include kv-cache size (and whether the cache is treated as dynamic vs static), and bumps pinned OpenVINO/GenAI/tokenizers dependencies to versions that provide the required metrics field.
Changes:
- Extend continuous batching metrics log line to print cache usage along with kv-cache size (bytes formatted as B/KB/MB/GB/TB).
- Propagate a “dynamic kv-cache” flag into
LLMExecutorWrapperbased onschedulerConfig.cache_size == 0. - Bump OpenVINO / OpenVINO GenAI / OpenVINO tokenizers pinning (Makefile, Windows deps script, demo export requirements) to
dev20260305/ updated commit hashes.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/llm/language_model/continuous_batching/llm_executor.hpp |
Adds formatting helpers and logs kv-cache size + dynamic/static label in metrics output. |
src/llm/language_model/continuous_batching/servable_initializer.cpp |
Passes dynamic/static cache indicator into LLMExecutorWrapper. |
Makefile |
Updates pinned OpenVINO/GenAI/tokenizers branches and nightly package URLs. |
windows_install_build_dependencies.bat |
Updates default GenAI nightly ZIP URL and pinned source branches on Windows. |
demos/common/export_models/requirements.txt |
Updates OpenVINO and tokenizers Python package versions to dev20260305. |
| #include <atomic> | ||
| #include <condition_variable> | ||
| #include <cstdint> | ||
| #include <memory> | ||
| #include <mutex> | ||
| #include <string> | ||
| #include <thread> | ||
| #include <utility> | ||
|
|
||
| #include <openvino/genai/continuous_batching_pipeline.hpp> | ||
|
|
||
| #include "../../../logging.hpp" | ||
| #include "../../../profiler.hpp" | ||
|
|
||
| namespace ovms { | ||
| struct LLMExecutor { | ||
| bool isDynamicKVCache; | ||
| // For logging purposes we could have more information about graph and node here | ||
| std::mutex mutex; | ||
| std::condition_variable cv; | ||
| std::shared_ptr<ov::genai::ContinuousBatchingPipeline> pipe = nullptr; | ||
|
|
||
| LLMExecutor(std::shared_ptr<ov::genai::ContinuousBatchingPipeline> pipe) { | ||
| LLMExecutor(std::shared_ptr<ov::genai::ContinuousBatchingPipeline> pipe, bool isDynamicKVCacheSet = false) { | ||
| this->pipe = std::move(pipe); | ||
| this->isDynamicKVCache = isDynamicKVCacheSet; | ||
| } | ||
|
|
||
| bool hasRequests() { | ||
| return (pipe->has_non_finished_requests()); | ||
| } | ||
|
|
||
| void step() { | ||
| OVMS_PROFILE_FUNCTION(); | ||
| pipe->step(); | ||
| } | ||
|
|
||
| void waitForRequests(std::atomic<bool>* receivedEndSignal) { | ||
| std::unique_lock<std::mutex> lock(mutex); | ||
| cv.wait(lock, [this, receivedEndSignal] { return (pipe->has_non_finished_requests() || *receivedEndSignal); }); | ||
| } | ||
|
|
||
| void notify() { | ||
| std::unique_lock<std::mutex> lock(mutex); | ||
| cv.notify_one(); | ||
| } | ||
|
|
||
| std::string formatCacheInfo(float cacheUsage, size_t cacheBytes, bool isCacheDynamic) { | ||
| std::ostringstream oss; | ||
| oss << std::fixed << std::setprecision(1); |
There was a problem hiding this comment.
This header now uses std::ostringstream and i/o manipulators (std::fixed / std::setprecision) but does not include the required standard headers ( and ). This will fail to compile in TUs that don't already include them indirectly; add the missing includes here (and consider for size_t to keep the header self-contained).
🛠 Summary
JIRA CVS-181044
Depends on https://github.com/openvinotoolkit/openvino.genai/pull/3352/changes and ovms genai update.
Output:
Cache usage static 18.6% of 1022.4 MB;
for static 1GB
Cache usage dynamic 98.6% of 189.9 MB;
for dynamic cache_size = 0
For: {"prompt_tokens":2299,"completion_tokens":30,"total_tokens":2329}}
🧪 Checklist
``