From 678e7c68345ff8b40c02803f01f4a0500788ac70 Mon Sep 17 00:00:00 2001 From: Pawel Date: Tue, 3 Mar 2026 13:59:02 +0100 Subject: [PATCH 1/6] save --- demos/continuous_batching/agentic_ai/README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 4963110568..514f673962 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -570,6 +570,12 @@ pip install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/ma ``` Make sure nodejs and npx are installed. On ubuntu it would require `sudo apt install nodejs npm`. On windows, visit https://nodejs.org/en/download. It is needed for the `file system` MCP server. +For windows applications it may be required to set environmental variable to enforce utf-8 encodeing in python: + +```bat +set PYTHONUTF8=1 +``` + Run the agentic application: From ace7b26f0312b696c6d35671355692673a5ed83d Mon Sep 17 00:00:00 2001 From: Pawel Date: Wed, 4 Mar 2026 13:03:28 +0100 Subject: [PATCH 2/6] draft --- .../continuous_batching/agentic_ai/README.md | 412 +++--------------- 1 file changed, 57 insertions(+), 355 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 514f673962..dda38439f4 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -12,165 +12,6 @@ The tools can also be used for automation purposes based on input in text format > **Note:** On Windows, make sure to use the weekly or 2025.4 release packages for proper functionality. -## Export LLM model -Currently supported models: -- Qwen/Qwen3-8B -- Qwen/Qwen3-4B -- meta-llama/Llama-3.1-8B-Instruct -- meta-llama/Llama-3.2-3B-Instruct -- NousResearch/Hermes-3-Llama-3.1-8B -- mistralai/Mistral-7B-Instruct-v0.3 -- microsoft/Phi-4-mini-instruct -- Qwen/Qwen3-Coder-30B-A3B-Instruct -- openai/gpt-oss-20b - - -### Export using python script - -Use those steps to convert the model from HugginFace Hub to OpenVINO format and export it to a local storage. - -```console -# Download export script, install its dependencies and create directory for the models -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py -pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt -mkdir models -``` -Run `export_model.py` script to download and quantize the model: - -> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" or "https://www.modelscope.cn/models" before running the export script to connect to the HF Hub. - -::::{tab-set} -:::{tab-item} Qwen3-8B -:sync: Qwen3-8B -```console -python export_model.py text_generation --source_model Qwen/Qwen3-8B --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser hermes3 -``` -::: -:::{tab-item} Qwen3-4B -:sync: Qwen3-4B -```console -python export_model.py text_generation --source_model Qwen/Qwen3-4B --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser hermes3 -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```console -python export_model.py text_generation --source_model meta-llama/Llama-3.1-8B-Instruct --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser llama3 -curl -L -o models/meta-llama/Llama-3.1-8B-Instruct/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.1_json.jinja -``` -::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct -```console -python export_model.py text_generation --source_model meta-llama/Llama-3.2-3B-Instruct --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser llama3 -curl -L -o models/meta-llama/Llama-3.2-3B-Instruct/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.2_json.jinja -``` -::: -:::{tab-item} Hermes-3-Llama-3.1-8B -:sync: Hermes-3-Llama-3.1-8B -```console -python export_model.py text_generation --source_model NousResearch/Hermes-3-Llama-3.1-8B --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser hermes3 -curl -L -o models/NousResearch/Hermes-3-Llama-3.1-8B/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_hermes.jinja -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -```console -python export_model.py text_generation --source_model mistralai/Mistral-7B-Instruct-v0.3 --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser mistral --extra_quantization_params "--task text-generation-with-past" -curl -L -o models/mistralai/Mistral-7B-Instruct-v0.3/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.10.1.1/examples/tool_chat_template_mistral_parallel.jinja -``` -::: -:::{tab-item} Qwen3-Coder-30B-A3B-Instruct -:sync: Qwen3-Coder-30B-A3B-Instruct -```console -python export_model.py text_generation --source_model Qwen/Qwen3-Coder-30B-A3B-Instruct --weight-format int4 --config_file_path models/config.json --model_repository_path models --tool_parser qwen3coder -curl -L -o models/Qwen/Qwen3-Coder-30B-A3B-Instruct/chat_template.jinja https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/extras/chat_template_examples/chat_template_qwen3coder_instruct.jinja -``` -::: -:::{tab-item} gpt-oss-20b -:sync: gpt-oss-20b -```console -python export_model.py text_generation --source_model openai/gpt-oss-20b --weight-format int4 --config_file_path models/config.json --model_repository_path models --tool_parser gptoss --reasoning_parser gptoss -curl -L -o models/openai/gpt-oss-20b/chat_template.jinja https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/extras/chat_template_examples/chat_template_gpt_oss.jinja -``` -> **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. - -::: -:::{tab-item} Phi-4-mini-instruct -:sync: microsoft/Phi-4-mini-instruct -Note: This model requires a fix in optimum-intel which is currently on a fork. -```console -pip3 install transformers==4.53.3 --force-reinstall -pip3 install "optimum-intel[openvino]"@git+https://github.com/helena-intel/optimum-intel/@ea/lonrope_exp -python export_model.py text_generation --source_model microsoft/Phi-4-mini-instruct --weight-format int4 --config_file_path models/config.json --model_repository_path models --tool_parser phi4 --max_num_batched_tokens 99999 -curl -L -o models/microsoft/Phi-4-mini-instruct/chat_template.jinja https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/extras/chat_template_examples/chat_template_phi4_mini.jinja -``` -::: -:::: - -> **Note:** To use these models on NPU, set `--weight-format` to either **int4** or **nf4**. When specifying `--extra_quantization_params`, ensure that `ratio` is set to **1.0** and `group_size` is set to **-1** or **128**. For more details, see [OpenVINO GenAI on NPU](https://docs.openvino.ai/nightly/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.html). - -### Direct pulling of pre-configured HuggingFace models from docker containers - -This procedure can be used to pull preconfigured models from OpenVINO organization in HuggingFace Hub -::::{tab-set} -:::{tab-item} Qwen3-8B-int4-ov -:sync: Qwen3-8B-int4-ov -```bash -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:weekly --pull --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --task text_generation --tool_parser hermes3 -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-int4-ov -:sync: Mistral-7B-Instruct-v0.3-int4-ov -```bash -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:weekly --pull --model_repository_path /models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --task text_generation --tool_parser mistral -curl -L -o models/OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.10.1.1/examples/tool_chat_template_mistral_parallel.jinja -``` -::: -:::{tab-item} Phi-4-mini-instruct-int4-ov -:sync: Phi-4-mini-instruct-int4-ov -```bash -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:weekly --pull --model_repository_path /models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --task text_generation --tool_parser phi4 -curl -L -o models/OpenVINO/Phi-4-mini-instruct-int4-ov/chat_template.jinja https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/extras/chat_template_examples/chat_template_phi4_mini.jinja -``` -::: -:::: - - -### Direct pulling of pre-configured HuggingFace models on Windows - -Assuming you have unpacked model server package with python enabled version, make sure to run `setupvars` script -as mentioned in [deployment guide](../../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server. - -::::{tab-set} -:::{tab-item} Qwen3-8B-int4-ov -:sync: Qwen3-8B-int4-ov -```bat -ovms.exe --pull --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --task text_generation --tool_parser hermes3 --enable_prefix_caching true -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-int4-ov -:sync: Mistral-7B-Instruct-v0.3-int4-ov -```bat -ovms.exe --pull --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --task text_generation --tool_parser mistral --enable_prefix_caching true -curl -L -o models\OpenVINO\Mistral-7B-Instruct-v0.3-int4-ov\chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.10.1.1/examples/tool_chat_template_mistral_parallel.jinja -``` -::: -:::{tab-item} Phi-4-mini-instruct-int4-ov -:sync: Phi-4-mini-instruct-int4-ov -```bat -ovms.exe --pull --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --task text_generation --tool_parser phi4 --enable_prefix_caching true -curl -L -o models\OpenVINO\Phi-4-mini-instruct-int4-ov\chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_phi4_mini.jinja -``` -::: -:::: - -You can use similar commands for different models and precision. Change the source_model and other configuration parameters. -> **Note:** Some models give more reliable responses with tuned chat template. -> **Note:** Currently tool parsers are supported for formats compatible with Phi4, Llama3, Mistral, Devstral, Hermes3 or GPT-OSS. - - - ## Start OVMS This deployment procedure assumes the model was pulled or exported using the procedure above. The exception are models from OpenVINO organization if they support tools correctly with the default template like "OpenVINO/Qwen3-8B-int4-ov" - they can be deployed in a single command pulling and staring the server. @@ -184,68 +25,44 @@ as mentioned in [deployment guide](../../../docs/deploying_server_baremetal.md), :::{tab-item} Qwen3-8B :sync: Qwen3-8B ```bat -ovms.exe --rest_port 8000 --source_model Qwen/Qwen3-8B --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bat -ovms.exe --rest_port 8000 --source_model Qwen/Qwen3-4B --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```bat -ovms.exe --rest_port 8000 --source_model meta-llama/Llama-3.1-8B-Instruct --model_repository_path models --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-4B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct ```bat -ovms.exe --rest_port 8000 --source_model meta-llama/Llama-3.2-3B-Instruct --model_repository_path models --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --model_repository_path models --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --enable_prefix_caching true ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 ```bat -ovms.exe --rest_port 8000 --source_model mistralai/Mistral-7B-Instruct-v0.3 --model_repository_path models --tool_parser mistral --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --model_repository_path models --tool_parser mistral --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true ``` ::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct ```bat -ovms.exe --rest_port 8000 --source_model microsoft/Phi-4-mini-instruct --model_repository_path models --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --max_num_batched_tokens 99999 --enable_prefix_caching true -``` -::: -:::{tab-item} Qwen3-8B-int4-ov -:sync: Qwen3-8B-int4-ov -```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-int4-ov -:sync: Mistral-7B-Instruct-v0.3-int4-ov -```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --model_repository_path models --tool_parser mistral --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true -``` -::: -:::{tab-item} Phi-4-mini-instruct-int4-ov -:sync: Phi-4-mini-instruct-int4-ov -```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --model_repository_path models --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --model_repository_path models --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --max_num_batched_tokens 99999 --enable_prefix_caching true ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct ```bat set MOE_USE_MICRO_GEMM_PREFILL=0 -ovms.exe --rest_port 8000 --source_model Qwen/Qwen3-Coder-30B-A3B-Instruct --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true ``` ::: :::{tab-item} gpt-oss-20b :sync: gpt-oss-20b ```bat -ovms.exe --rest_port 8000 --source_model openai/gpt-oss-20b --model_repository_path models --tool_parser gptoss --reasoning_parser gptoss --task text_generation --enable_prefix_caching true --target_device GPU +ovms.exe --rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path models --tool_parser gptoss --reasoning_parser gptoss --task text_generation --enable_prefix_caching true --target_device GPU ``` > **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. @@ -258,46 +75,27 @@ ovms.exe --rest_port 8000 --source_model openai/gpt-oss-20b --model_repository_p :::{tab-item} Qwen3-8B :sync: Qwen3-8B ```bat -ovms.exe --rest_port 8000 --source_model Qwen/Qwen3-8B --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 +ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bat -ovms.exe --rest_port 8000 --source_model Qwen/Qwen3-4B --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```bat -ovms.exe --rest_port 8000 --source_model meta-llama/Llama-3.1-8B-Instruct --model_repository_path models --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 +ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct ```bat -ovms.exe --rest_port 8000 --source_model meta-llama/Llama-3.2-3B-Instruct --model_repository_path models --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 +ovms.exe --rest_port 8000 --source_model llmware/llama-3.2-3b-instruct-npu-ov --model_repository_path models --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 ```bat -ovms.exe --rest_port 8000 --source_model mistralai/Mistral-7B-Instruct-v0.3 --model_repository_path models --tool_parser mistral --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 -``` -::: -:::{tab-item} Qwen3-4B-int4-ov -:sync: Qwen3-4B-int4-ov -```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-4B-int4-ov --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-cw-int4-ov -:sync: Mistral-7B-Instruct-v0.3-cw-int4-ov -```bat ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov --model_repository_path models --tool_parser mistral --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` ::: -:::: > **Note:** Setting the `--max_prompt_len` parameter too high may lead to performance degradation. It is recommended to use the smallest value that meets your requirements. @@ -308,77 +106,42 @@ ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4- :sync: Qwen3-8B ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model Qwen/Qwen3-8B --tool_parser hermes3 --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model Qwen/Qwen3-4B --tool_parser hermes3 --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model meta-llama/Llama-3.1-8B-Instruct --tool_parser llama3 --task text_generation --enable_prefix_caching true --enable_tool_guided_generation true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model meta-llama/Llama-3.2-3B-Instruct --tool_parser llama3 --task text_generation --enable_prefix_caching true --enable_tool_guided_generation true -``` -::: -:::{tab-item} Hermes-3-Llama-3.1-8B -:sync: Hermes-3-Llama-3.1-8B -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model NousResearch/Hermes-3-Llama-3.1-8B --tool_parser hermes3 --task text_generation --enable_prefix_caching true --enable_tool_guided_generation true +--rest_port 8000 --model_repository_path models --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --tool_parser llama3 --task text_generation --enable_prefix_caching true --enable_tool_guided_generation true ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model mistralai/Mistral-7B-Instruct-v0.3 --tool_parser mistral --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --task text_generation --enable_prefix_caching true ``` ::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model microsoft/Phi-4-mini-instruct --tool_parser phi4 --task text_generation --enable_prefix_caching true --max_num_batched_tokens 99999 --enable_tool_guided_generation true -``` -::: -:::{tab-item} Qwen3-8B-int4-ov -:sync: Qwen3-8B-int4-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-int4-ov -:sync: Mistral-7B-Instruct-v0.3-int4-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Phi-4-mini-instruct-int4-ov -:sync: Phi-4-mini-instruct-int4-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --task text_generation --enable_prefix_caching true --max_num_batched_tokens 99999 --enable_tool_guided_generation true ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct ```bash docker run -d --user $(id -u):$(id -g) --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --source_model Qwen/Qwen3-Coder-30B-A3B-Instruct --model_repository_path models --tool_parser qwen3coder --task text_generation --enable_prefix_caching true +--rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --task text_generation --enable_prefix_caching true ``` ::: :::: @@ -395,84 +158,49 @@ It can be applied using the commands below: :sync: Qwen3-8B ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model Qwen/Qwen3-8B --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model Qwen/Qwen3-4B --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model meta-llama/Llama-3.1-8B-Instruct --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model meta-llama/Llama-3.2-3B-Instruct --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true -``` -::: -:::{tab-item} Hermes-3-Llama-3.1-8B -:sync: Hermes-3-Llama-3.1-8B -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model NousResearch/Hermes-3-Llama-3.1-8B --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model mistralai/Mistral-7B-Instruct-v0.3 --tool_parser mistral --target_device GPU --task text_generation --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --target_device GPU --task text_generation --enable_prefix_caching true ``` ::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model microsoft/Phi-4-mini-instruct --tool_parser phi4 --target_device GPU --task text_generation --max_num_batched_tokens 99999 --enable_prefix_caching true -``` -::: -:::{tab-item} Qwen3-8B-int4-ov -:sync: Qwen3-8B-int4-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-int4-ov -:sync: Mistral-7B-Instruct-v0.3-int4-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --target_device GPU --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Phi-4-mini-instruct-int4-ov -:sync: Phi-4-mini-instruct-int4-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --target_device GPU --task text_generation --max_num_batched_tokens 99999 --enable_prefix_caching true ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct ```bash docker run -d --user $(id -u):$(id -g) -e MOE_USE_MICRO_GEMM_PREFILL=0 --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --source_model Qwen/Qwen3-Coder-30B-A3B-Instruct --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true +--rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true ``` ::: :::{tab-item} gpt-oss-20b :sync: gpt-oss-20b ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --source_model openai/gpt-oss-20b --model_repository_path models \ +--rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path models \ --tool_parser gptoss --reasoning_parser gptoss --target_device GPU --task text_generation --enable_prefix_caching true ``` > **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. @@ -491,52 +219,30 @@ It can be applied using the commands below: :sync: Qwen3-8B ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model Qwen/Qwen3-8B --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model Qwen/Qwen3-4B --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model meta-llama/Llama-3.1-8B-Instruct --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --max_prompt_len 4000 +--rest_port 8000 --model_repository_path models --source_model FluidInference/qwen3-4b-int4-ov-npu --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model meta-llama/Llama-3.2-3B-Instruct --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --max_prompt_len 4000 +--rest_port 8000 --model_repository_path models --source_model llmware/llama-3.2-3b-instruct-npu-ov --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --max_prompt_len 4000 ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 ```bash docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model mistralai/Mistral-7B-Instruct-v0.3 --tool_parser mistral --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 -``` -::: -:::{tab-item} Qwen3-8B-int4-cw-ov -:sync: Qwen3-8B-int4-cw-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3-int4-cw-ov -:sync: Mistral-7B-Instruct-v0.3-int4-cw-ov -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov --tool_parser mistral --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 ``` ::: -:::: ### Deploy all models in a single container Those steps deploy all the models exported earlier. The python script added the models to `models/config.json` so just the remaining models pulled directly from HuggingFace Hub are to be added: @@ -583,73 +289,49 @@ Run the agentic application: :::{tab-item} Qwen3-8B :sync: Qwen3-8B ```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model Qwen/Qwen3-8B --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking ``` ```bash -python openai_agent.py --query "List the files in folder /root" --model Qwen/Qwen3-8B --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all +python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model Qwen/Qwen3-4B --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream ``` ```bash -python openai_agent.py --query "List the files in folder /root" --model Qwen/Qwen3-4B --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all -``` -::: -:::{tab-item} Llama-3.1-8B-Instruct -:sync: Llama-3.1-8B-Instruct -```bash -python openai_agent.py --query "List the files in folder /root" --model meta-llama/Llama-3.1-8B-Instruct --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all +python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 ```bash -python openai_agent.py --query "List the files in folder /root" --model mistralai/Mistral-7B-Instruct-v0.3 --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all --tool_choice required +python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all --tool_choice required ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct ```bash -python openai_agent.py --query "List the files in folder /root" --model meta-llama/Llama-3.2-3B-Instruct --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all +python openai_agent.py --query "List the files in folder /root" --model srang992/Llama-3.2-3B-Instruct-ov-INT4 --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all ``` ::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct ```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model microsoft/Phi-4-mini-instruct --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather -``` -::: -:::{tab-item} Qwen3-8B-int4-ov -:sync: Qwen3-8B-int4-ov -```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather -``` -::: -:::{tab-item} OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov -:sync: OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov -```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --tool-choice required -``` -::: -:::{tab-item} Phi-4-mini-instruct-int4-ov -:sync: Phi-4-mini-instruct-int4-ov -```bash python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Phi-4-mini-instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct ```bash -python openai_agent.py --query "What is the current weather in Tokyo?" --model Qwen3/Qwen3-Coder-30B-A3B-Instruct --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` ::: :::{tab-item} gpt-oss-20b :sync: gpt-oss-20b ```console -python openai_agent.py --query "What is the current weather in Tokyo?" --model openai/gpt-oss-20b --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/gpt-oss-20b-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` ::: :::: @@ -663,7 +345,6 @@ You can try also similar implementation based on llama_index library working the pip install llama-index-llms-openai-like==0.5.3 llama-index-core==0.14.5 llama-index-tools-mcp==0.4.2 curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/llama_index_agent.py -o llama_index_agent.py python llama_index_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking - ``` @@ -685,12 +366,12 @@ mv 1184.txt.utf-8 pg1184.txt docker run -d --name ovms --user $(id -u):$(id -g) --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --enable_prefix_caching true --task text_generation --target_device GPU -python benchmark_serving_multi_turn.py -m Qwen/Qwen3-8B --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 1 -n 50 +python benchmark_serving_multi_turn.py -m OpenVINO/Qwen3-8B-int4-ov --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 1 -n 50 # Testing high concurrency, for example on Xeon CPU with constrained resources (in case of memory constrains, reduce cache_size) docker run -d --name ovms --cpuset-cpus 0-15 --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --enable_prefix_caching true --cache_size 20 --task text_generation -python benchmark_serving_multi_turn.py -m Qwen/Qwen3-8B --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 24 +python benchmark_serving_multi_turn.py -m OpenVINO/Qwen3-8B-int4-ov --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 24 ``` Below is an example of the output captured on iGPU: ``` @@ -741,3 +422,24 @@ Here is example of the response from the OpenVINO/Qwen3-8B-int4-ov model: ``` Models can be also compared using the [leaderboard reports](https://gorilla.cs.berkeley.edu/leaderboard.html#leaderboard). + +### Export using python script + +Use those steps to convert the model from HugginFace Hub to OpenVINO format and export it to a local storage. + +```console +# Download export script, install its dependencies and create directory for the models +curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py +pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt +mkdir models +``` +Run `export_model.py` script to download and quantize the model: + +> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" or "https://www.modelscope.cn/models" before running the export script to connect to the HF Hub. + +```console +python export_model.py text_generation --source_model meta-llama/Llama-3.2-3B-Instruct --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser llama3 +curl -L -o models/meta-llama/Llama-3.2-3B-Instruct/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.2_json.jinja +``` + +> **Note:** To use these models on NPU, set `--weight-format` to either **int4** or **nf4**. When specifying `--extra_quantization_params`, ensure that `ratio` is set to **1.0** and `group_size` is set to **-1** or **128**. For more details, see [OpenVINO GenAI on NPU](https://docs.openvino.ai/nightly/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.html). \ No newline at end of file From 35177a3297b4ab23b1e4022b2c95a5a60a8beee9 Mon Sep 17 00:00:00 2001 From: Pawel Date: Wed, 4 Mar 2026 13:08:11 +0100 Subject: [PATCH 3/6] save --- demos/continuous_batching/agentic_ai/README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index dda38439f4..fab1429200 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -288,43 +288,43 @@ Run the agentic application: ::::{tab-set} :::{tab-item} Qwen3-8B :sync: Qwen3-8B -```bash +```text python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking ``` -```bash +```text python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all ``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B -```bash +```text python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream ``` -```bash +```text python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all ``` ::: :::{tab-item} Mistral-7B-Instruct-v0.3 :sync: Mistral-7B-Instruct-v0.3 -```bash +```text python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all --tool_choice required ``` ::: :::{tab-item} Llama-3.2-3B-Instruct :sync: Llama-3.2-3B-Instruct -```bash +```text python openai_agent.py --query "List the files in folder /root" --model srang992/Llama-3.2-3B-Instruct-ov-INT4 --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all ``` ::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct -```bash +```text python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Phi-4-mini-instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct -```bash +```text python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` ::: From 7a87239a5948500c84e5709ea5cca68e7fdd30ee Mon Sep 17 00:00:00 2001 From: Pawel Date: Wed, 4 Mar 2026 13:32:31 +0100 Subject: [PATCH 4/6] save --- demos/continuous_batching/agentic_ai/README.md | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index fab1429200..10c32a1bb3 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -244,16 +244,6 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/mode ``` ::: -### Deploy all models in a single container -Those steps deploy all the models exported earlier. The python script added the models to `models/config.json` so just the remaining models pulled directly from HuggingFace Hub are to be added: -```bash -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:weekly --add_to_config --model_name OpenVINO/Qwen3-8B-int4-ov --model_path OpenVINO/Qwen3-8B-int4-ov --config_path /models/config.json -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:weekly --add_to_config --model_name OpenVINO/Phi-4-mini-instruct-int4-ov --model_path OpenVINO/Phi-4-mini-instruct-int4-ov --config_path /models/config.json -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:weekly --add_to_config --model_name OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --model_path OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov--config_path /models/config.json -docker run -d --rm -p 8000:8000 -v $(pwd)/models:/models:ro openvino/model_server:weekly --rest_port 8000 --config_path /models/config.json -``` - - ## Start MCP server with SSE interface ### Linux From d19be9af43aea543dbe75b38cf1bf1d53d5762bb Mon Sep 17 00:00:00 2001 From: Pawel Date: Fri, 6 Mar 2026 07:49:59 +0100 Subject: [PATCH 5/6] save --- .../continuous_batching/agentic_ai/README.md | 258 +++++------------- 1 file changed, 61 insertions(+), 197 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 10c32a1bb3..62637275d1 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -12,6 +12,34 @@ The tools can also be used for automation purposes based on input in text format > **Note:** On Windows, make sure to use the weekly or 2025.4 release packages for proper functionality. +## Start MCP server with SSE interface + +### Linux +```bash +git clone https://github.com/isdaniel/mcp_weather_server +cd mcp_weather_server && git checkout v0.5.0 +docker build -t mcp-weather-server:sse . +docker run -d -p 8080:8080 -e PORT=8080 mcp-weather-server:sse uv run python -m mcp_weather_server --mode sse +``` + +> **Note:** On Windows the MCP server will be demonstrated as an instance with stdio interface inside the agent application + +## Start the agent + +Install the application requirements + +```console +curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/openai_agent.py -o openai_agent.py +pip install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/requirements.txt +``` +Make sure nodejs and npx are installed. On ubuntu it would require `sudo apt install nodejs npm`. On windows, visit https://nodejs.org/en/download. It is needed for the `file system` MCP server. + +For windows applications it may be required to set environmental variable to enforce utf-8 encodeing in python: + +```bat +set PYTHONUTF8=1 +``` + ## Start OVMS This deployment procedure assumes the model was pulled or exported using the procedure above. The exception are models from OpenVINO organization if they support tools correctly with the default template like "OpenVINO/Qwen3-8B-int4-ov" - they can be deployed in a single command pulling and staring the server. @@ -84,18 +112,6 @@ ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_re ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` ::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct -```bat -ovms.exe --rest_port 8000 --source_model llmware/llama-3.2-3b-instruct-npu-ov --model_repository_path models --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov --model_repository_path models --tool_parser mistral --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 -``` -::: > **Note:** Setting the `--max_prompt_len` parameter too high may lead to performance degradation. It is recommended to use the smallest value that meets your requirements. @@ -108,6 +124,14 @@ ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4- docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true ``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is -1.5°C. Wind is blowing from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover, and visibility is 24.1 km. +``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B @@ -115,26 +139,13 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true ``` -::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --tool_parser llama3 --task text_generation --enable_prefix_caching true --enable_tool_guided_generation true -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 + ```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --task text_generation --enable_prefix_caching true +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather ``` -::: -:::{tab-item} Phi-4-mini-instruct -:sync: Phi-4-mini-instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --task text_generation --enable_prefix_caching true --max_num_batched_tokens 99999 --enable_tool_guided_generation true + +```text +The current weather in Tokyo is clear with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is at -1.5°C. Winds are coming from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover. Visibility is 24.1 km. ``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct @@ -143,6 +154,14 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/mode docker run -d --user $(id -u):$(id -g) --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ --rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --task text_generation --enable_prefix_caching true ``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is clear sky with a temperature of 5.5°C (feels like 2.8°C). The relative humidity is at 64%, and the dew point is -0.8°C. Wind is blowing from the NNE at 3.2 km/h with gusts up to 10.8 km/h. The atmospheric pressure is 1023.4 hPa with 0% cloud cover. Visibility is 24.1 km. +``` ::: :::: @@ -168,27 +187,6 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true ``` ::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --tool_parser mistral --target_device GPU --task text_generation --enable_prefix_caching true -``` -::: -:::{tab-item} Phi-4-mini-instruct -:sync: Phi-4-mini-instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser phi4 --target_device GPU --task text_generation --max_num_batched_tokens 99999 --enable_prefix_caching true -``` -::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct ```bash @@ -221,6 +219,14 @@ It can be applied using the commands below: docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 ``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-cw-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is at -1.5°C. The wind is blowing from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover, and the visibility is 24.1 km. +``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B @@ -228,100 +234,13 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model FluidInference/qwen3-4b-int4-ov-npu --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 ``` -::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model llmware/llama-3.2-3b-instruct-npu-ov --tool_parser llama3 --target_device NPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true --max_prompt_len 4000 -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -```bash -docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path models --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-cw-ov --tool_parser mistral --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 4000 -``` -::: - -## Start MCP server with SSE interface -### Linux ```bash -git clone https://github.com/isdaniel/mcp_weather_server -cd mcp_weather_server && git checkout v0.5.0 -docker build -t mcp-weather-server:sse . -docker run -d -p 8080:8080 -e PORT=8080 mcp-weather-server:sse uv run python -m mcp_weather_server --mode sse -``` - -> **Note:** On Windows the MCP server will be demonstrated as an instance with stdio interface inside the agent application - -## Start the agent - -Install the application requirements - -```console -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/openai_agent.py -o openai_agent.py -pip install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/requirements.txt -``` -Make sure nodejs and npx are installed. On ubuntu it would require `sudo apt install nodejs npm`. On windows, visit https://nodejs.org/en/download. It is needed for the `file system` MCP server. - -For windows applications it may be required to set environmental variable to enforce utf-8 encodeing in python: - -```bat -set PYTHONUTF8=1 +python openai_agent.py --query "What is the current weather in Tokyo?" --model FluidInference/qwen3-4b-int4-ov-npu --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream ``` -Run the agentic application: - - -::::{tab-set} -:::{tab-item} Qwen3-8B -:sync: Qwen3-8B ```text -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking -``` -```text -python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all -``` -::: -:::{tab-item} Qwen3-4B -:sync: Qwen3-4B -```text -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream -``` -```text -python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all -``` -::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -```text -python openai_agent.py --query "List the files in folder /root" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all --tool_choice required -``` -::: -:::{tab-item} Llama-3.2-3B-Instruct -:sync: Llama-3.2-3B-Instruct -```text -python openai_agent.py --query "List the files in folder /root" --model srang992/Llama-3.2-3B-Instruct-ov-INT4 --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all -``` -::: -:::{tab-item} Phi-4-mini-instruct -:sync: Phi-4-mini-instruct -```text -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Phi-4-mini-instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather -``` -::: -:::{tab-item} Qwen3-Coder-30B-A3B-Instruct -:sync: Qwen3-Coder-30B-A3B-Instruct -```text -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather -``` -::: -:::{tab-item} gpt-oss-20b -:sync: gpt-oss-20b -```console -python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/gpt-oss-20b-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +The current weather in Tokyo is clear sky with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is at -1.5°C. There is a wind blowing from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover. The visibility is 24.1 km. ``` ::: :::: @@ -337,61 +256,6 @@ curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/c python llama_index_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking ``` - -## Testing efficiency in agentic use case - -Using LLM models with AI agents has a unique load characteristics with multi-turn communication and resending bit parts of the prompt as the previous conversation. -To simulate such type of load, we should use a dedicated tool [multi_turn benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks/multi_turn). -```bash -git clone -b v0.10.2 https://github.com/vllm-project/vllm -cd vllm/benchmarks/multi_turn -pip install -r requirements.txt -sed -i -e 's/if not os.path.exists(args.model)/if 1 == 0/g' benchmark_serving_multi_turn.py - -#Download the following text file (used for generation of synthetic conversations) -wget https://www.gutenberg.org/ebooks/1184.txt.utf-8 -mv 1184.txt.utf-8 pg1184.txt - -# Testing single client scenario, for example with GPU execution -docker run -d --name ovms --user $(id -u):$(id -g) --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ ---rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --enable_prefix_caching true --task text_generation --target_device GPU - -python benchmark_serving_multi_turn.py -m OpenVINO/Qwen3-8B-int4-ov --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 1 -n 50 - -# Testing high concurrency, for example on Xeon CPU with constrained resources (in case of memory constrains, reduce cache_size) -docker run -d --name ovms --cpuset-cpus 0-15 --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --enable_prefix_caching true --cache_size 20 --task text_generation - -python benchmark_serving_multi_turn.py -m OpenVINO/Qwen3-8B-int4-ov --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 24 -``` -Below is an example of the output captured on iGPU: -``` -Parameters: -model=OpenVINO/Qwen3-8B-int4-ov -num_clients=1 -num_conversations=100 -active_conversations=None -seed=0 -Conversations Generation Parameters: -text_files=pg1184.txt -input_num_turns=UniformDistribution[12, 18] -input_common_prefix_num_tokens=Constant[500] -input_prefix_num_tokens=LognormalDistribution[6, 4] -input_num_tokens=UniformDistribution[120, 160] -output_num_tokens=UniformDistribution[80, 120] ----------------------------------------------------------------------------------------------------- -Statistics summary: -runtime_sec = 307.569 -requests_per_sec = 0.163 ----------------------------------------------------------------------------------------------------- - count mean std min 25% 50% 75% 90% max -ttft_ms 50.0 1052.97 987.30 200.61 595.29 852.08 1038.50 1193.38 4265.27 -tpot_ms 50.0 51.37 2.37 47.03 49.67 51.45 53.16 54.42 55.23 -latency_ms 50.0 6128.26 1093.40 4603.86 5330.43 5995.30 6485.20 7333.73 9505.51 -input_num_turns 50.0 7.64 4.72 1.00 3.00 7.00 11.00 15.00 17.00 -input_num_tokens 50.0 2298.92 973.02 520.00 1556.50 2367.00 3100.75 3477.70 3867.00 -``` - - ## Testing accuracy Testing model accuracy is critical for a successful adoption in AI application. The recommended methodology is to use BFCL tool like describe in the [testing guide](../accuracy/README.md#running-the-tests-for-agentic-models-with-function-calls). @@ -417,7 +281,7 @@ Models can be also compared using the [leaderboard reports](https://gorilla.cs.b Use those steps to convert the model from HugginFace Hub to OpenVINO format and export it to a local storage. -```console +```text # Download export script, install its dependencies and create directory for the models curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt @@ -427,7 +291,7 @@ Run `export_model.py` script to download and quantize the model: > **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" or "https://www.modelscope.cn/models" before running the export script to connect to the HF Hub. -```console +```text python export_model.py text_generation --source_model meta-llama/Llama-3.2-3B-Instruct --weight-format int8 --config_file_path models/config.json --model_repository_path models --tool_parser llama3 curl -L -o models/meta-llama/Llama-3.2-3B-Instruct/chat_template.jinja https://raw.githubusercontent.com/vllm-project/vllm/refs/tags/v0.9.0/examples/tool_chat_template_llama3.2_json.jinja ``` From 991b4865289195446d292180adef08f060701ed6 Mon Sep 17 00:00:00 2001 From: Pawel Date: Fri, 6 Mar 2026 12:12:45 +0100 Subject: [PATCH 6/6] save --- .../continuous_batching/agentic_ai/README.md | 101 ++++++++++++++++-- 1 file changed, 94 insertions(+), 7 deletions(-) diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md index 62637275d1..21156fd078 100644 --- a/demos/continuous_batching/agentic_ai/README.md +++ b/demos/continuous_batching/agentic_ai/README.md @@ -68,12 +68,6 @@ ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-4B-int4-ov --model_repos ovms.exe --rest_port 8000 --source_model srang992/Llama-3.2-3B-Instruct-ov-INT4 --model_repository_path models --tool_parser llama3 --target_device GPU --task text_generation --enable_tool_guided_generation true --cache_dir .cache --enable_prefix_caching true ``` ::: -:::{tab-item} Mistral-7B-Instruct-v0.3 -:sync: Mistral-7B-Instruct-v0.3 -```bat -ovms.exe --rest_port 8000 --source_model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --model_repository_path models --tool_parser mistral --target_device GPU --task text_generation --cache_dir .cache --enable_prefix_caching true -``` -::: :::{tab-item} Phi-4-mini-instruct :sync: Phi-4-mini-instruct ```bat @@ -105,13 +99,30 @@ ovms.exe --rest_port 8000 --source_model OpenVINO/gpt-oss-20b-int4-ov --model_re ```bat ovms.exe --rest_port 8000 --source_model OpenVINO/Qwen3-8B-int4-cw-ov --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` + +```bat +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-cw-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is mainly clear with a temperature of 11.5°C. The relative humidity is at 82%, and the dew point is 8.5°C. The wind is blowing from the S at 6.8 km/h, with gusts up to 13.7 km/h. The atmospheric pressure is 1017.1 hPa, and there is 21% cloud cover. Visibility is 24.1 km. +``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B ```bat ovms.exe --rest_port 8000 --source_model FluidInference/qwen3-4b-int4-ov-npu --model_repository_path models --tool_parser hermes3 --target_device NPU --task text_generation --enable_prefix_caching true --cache_dir .cache --max_prompt_len 4000 ``` + +```bat +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-cw-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is mainly clear, with a temperature of 11.5°C. The relative humidity is at 82%, and the dew point is at 8.5°C. There is a wind blowing from the south at 6.8 km/h, with gusts up to 13.7 km/h. The atmospheric pressure is 1017.1 hPa, and there is 21% cloud cover. The visibility is 24.1 km. +``` ::: +:::: > **Note:** Setting the `--max_prompt_len` parameter too high may lead to performance degradation. It is recommended to use the smallest value that meets your requirements. @@ -148,6 +159,21 @@ python openai_agent.py --query "What is the current weather in Tokyo?" --model O The current weather in Tokyo is clear with a temperature of 8.3°C (feels like 5.0°C). The relative humidity is at 50%, and the dew point is at -1.5°C. Winds are coming from the NNW at 6.8 km/h with gusts up to 21.2 km/h. The atmospheric pressure is 1021.5 hPa with 0% cloud cover. Visibility is 24.1 km. ``` ::: +:::{tab-item} Phi-4-mini-instruct +:sync: Phi-4-mini-instruct +```bash +docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:weekly \ +--rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-4-mini-instruct-int4-ov --tool_parser hermes3 --task text_generation --enable_prefix_caching true +``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Phi-4-mini-instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --tool-choice required +``` + +```text +The current weather in Tokyo is mostly clear with a temperature of 12.4°C. The relative humidity is at 68%, and the dew point is at 6.7°C. Winds are coming from the SSE at a speed of 5.3 km/h, with gusts reaching up to 25.2 km/h. The atmospheric pressure is 1017.9 hPa, and there is a 23% cloud cover. Visibility is good at 24.1 km. +``` +::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct ```bash @@ -160,7 +186,18 @@ python openai_agent.py --query "What is the current weather in Tokyo?" --model O ``` ```text -The current weather in Tokyo is clear sky with a temperature of 5.5°C (feels like 2.8°C). The relative humidity is at 64%, and the dew point is -0.8°C. Wind is blowing from the NNE at 3.2 km/h with gusts up to 10.8 km/h. The atmospheric pressure is 1023.4 hPa with 0% cloud cover. Visibility is 24.1 km. +The current weather in Tokyo is as follows: +- **Condition**: Mainly clear +- **Temperature**: 11.8°C +- **Relative Humidity**: 78% +- **Dew Point**: 8.1°C +- **Wind**: Blowing from the SSE at 6.4 km/h with gusts up to 9.7 km/h +- **Atmospheric Pressure**: 1017.5 hPa +- **Cloud Cover**: 22% +- **Visibility**: 24.1 km +- **UV Index**: Not specified + +It's a relatively pleasant day with clear skies and mild temperatures. ``` ::: :::: @@ -179,6 +216,14 @@ It can be applied using the commands below: docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true ``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is mainly clear with a temperature of 11.7°C. The relative humidity is at 74%, and the dew point is 7.2°C. The wind is blowing from the southeast at 4.2 km/h, with gusts up to 22.7 km/h. The atmospheric pressure is 1018.0 hPa, and there is 44% cloud cover. Visibility is 24.1 km. +``` ::: :::{tab-item} Qwen3-4B :sync: Qwen3-4B @@ -186,6 +231,14 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --model_repository_path models --source_model OpenVINO/Qwen3-4B-int4-ov --tool_parser hermes3 --target_device GPU --task text_generation --enable_prefix_caching true ``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-4B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is mainly clear. The temperature is 11.7°C, with a relative humidity of 74% and a dew point of 7.2°C. The wind is coming from the SSE at 4.2 km/h, with gusts up to 22.7 km/h. The atmospheric pressure is 1018.0 hPa, with 44% cloud cover. Visibility is 24.1 km. +``` ::: :::{tab-item} Qwen3-Coder-30B-A3B-Instruct :sync: Qwen3-Coder-30B-A3B-Instruct @@ -193,6 +246,22 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model docker run -d --user $(id -u):$(id -g) -e MOE_USE_MICRO_GEMM_PREFILL=0 --rm -p 8000:8000 -v $(pwd)/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly \ --rest_port 8000 --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --tool_parser qwen3coder --target_device GPU --task text_generation --enable_tool_guided_generation true --enable_prefix_caching true ``` + +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +The current weather in Tokyo is as follows: +- **Condition**: Mainly clear +- **Temperature**: 11.7°C +- **Relative Humidity**: 74% +- **Dew Point**: 7.2°C +- **Wind**: SSE at 4.2 km/h, with gusts up to 22.7 km/h +- **Atmospheric Pressure**: 1018.0 hPa +- **Cloud Cover**: 44% +- **Visibility**: 24.1 km +``` ::: :::{tab-item} gpt-oss-20b :sync: gpt-oss-20b @@ -203,6 +272,24 @@ docker run -d --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/model ``` > **Note:** Continuous batching and paged attention are supported for GPT‑OSS. However, when deployed on GPU, the model may experience reduced accuracy under high‑concurrency workloads. This issue will be resolved in version 2026.1 and in the upcoming weekly release. CPU execution is not affected. +```bash +python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/gpt-oss-20b-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather +``` + +```text +**Tokyo – Current Weather** + +- **Condition:** Mainly clear +- **Temperature:** 11.7 °C +- **Humidity:** 74 % +- **Dew Point:** 7.2 °C +- **Wind:** 4.2 km/h from the SSE, gusts up to 22.7 km/h +- **Pressure:** 1018.0 hPa +- **Cloud Cover:** 44 % +- **Visibility:** 24.1 km + +Enjoy your day! +``` ::: ::::