-
Notifications
You must be signed in to change notification settings - Fork 240
Description
Describe the bug
When attempting to load various models, OVMS has some type of runaway memory issue when attempting to use the GPU. For example attempting to load OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov with the CPU flag works fine, CPU RAM usage is as expected and model functions correctly. Attempting to use the GPU the GPU RAM is not utilized, CPU RAM appears to fill instead until it far exceeds the model size and the system runs out of RAM crashing OVMS.
To Reproduce
Steps to reproduce the behavior:
- Use https://huggingface.co/OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov
- .\ovms.exe --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --rest_port 8000 --task text_generation --target_device GPU --metrics_enable --log_level DEBUG
- Error: Exception from src\inference\src\dev\plugin.cpp:53:
Check 'false' failed at src\plugins\intel_gpu\src\plugin\program_builder.cpp:163:
[GPU] ProgramBuilder build failed!
[CL ext] Can not allocate 402653184 bytes for USM Device. ptr: 0000000000000000, error: 0
Expected behavior
I would expect the model to be loaded into GPU memory and consume a parity level of memory as running on the CPU.
Logs
12900HK.txt
Configuration
- OVMS version: 2026
- OVMS config.json file: Default
- CPU, accelerator's versions if applicable: Attempting to run on the 12900HK iGPU
- Model repository directory structure: Default from HF
- Model or publicly available similar model that reproduces the issue: https://huggingface.co/OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov
Additional context
This is what it looks like in Task Manager
Where as loading the model on the CPU uses a normal amount of RAM and performs as expected
This also occurs with
OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov
OpenVINO/gpt-oss-20b-int4-ov