OVMS GPU Memory Allocation Issue - Crashes on model load

**Describe the bug**
When attempting to load various models, OVMS has some type of runaway memory issue when attempting to use the GPU. For example attempting to load OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov with the CPU flag works fine, CPU RAM usage is as expected and model functions correctly. Attempting to use the GPU the GPU RAM is not utilized, CPU RAM appears to fill instead until it far exceeds the model size and the system runs out of RAM crashing OVMS.

**To Reproduce**
Steps to reproduce the behavior:
1. Use https://huggingface.co/OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov
2. .\ovms.exe --source_model OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov --model_repository_path models --rest_port 8000 --task text_generation --target_device GPU --metrics_enable --log_level DEBUG
4. Error: Exception from src\inference\src\dev\plugin.cpp:53:
Check 'false' failed at src\plugins\intel_gpu\src\plugin\program_builder.cpp:163:
[GPU] ProgramBuilder build failed!
[CL ext] Can not allocate 402653184 bytes for USM Device. ptr: 0000000000000000, error: 0

**Expected behavior**
I would expect the model to be loaded into GPU memory and consume a parity level of memory as running on the CPU.

**Logs**
[12900HK.txt](https://github.com/user-attachments/files/25731174/12900HK.txt)

**Configuration**
1. OVMS version: 2026
2. OVMS config.json file: Default
3. CPU, accelerator's versions if applicable: Attempting to run on the 12900HK iGPU 

<img width="1267" height="816" alt="Image" src="https://github.com/user-attachments/assets/59b6fc49-fac4-4eea-84f4-4b882396f947" />

4. Model repository directory structure: Default from HF
5. Model or publicly available similar model that reproduces the issue: https://huggingface.co/OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov

**Additional context**
This is what it looks like in Task Manager

<img width="2888" height="1074" alt="Image" src="https://github.com/user-attachments/assets/5b274d1c-5f6e-4510-ab88-71d1fb5d7a4f" />

Where as loading the model on the CPU uses a normal amount of RAM and performs as expected 

<img width="2810" height="1056" alt="Image" src="https://github.com/user-attachments/assets/0f47052c-4b8e-4996-bbb3-37fa32669536" />

This also occurs with 
OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov
OpenVINO/gpt-oss-20b-int4-ov

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OVMS GPU Memory Allocation Issue - Crashes on model load #4035

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OVMS GPU Memory Allocation Issue - Crashes on model load #4035

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions