Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128,128 -m c:\models\Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |       269.32 ± 13.22 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        260.52 ± 1.17 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        267.10 ± 5.18 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       340.67 ± 22.33 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        356.88 ± 9.24 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       333.40 ± 12.02 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128,128 -m c:\models\Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |       288.13 ± 13.10 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        284.81 ± 2.36 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        289.09 ± 3.86 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       343.03 ± 19.78 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        355.02 ± 4.88 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        353.27 ± 0.69 |

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 22, 2025
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack on the ggml-backend changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants