vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron #18295

jeffbolznv · 2025-12-22T17:00:38Z

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128,128 -m c:\models\Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |       269.32 ± 13.22 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        260.52 ± 1.17 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        267.10 ± 5.18 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       340.67 ± 22.33 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        356.88 ± 9.24 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       333.40 ± 12.02 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128,128 -m c:\models\Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |       288.13 ± 13.10 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        284.81 ± 2.36 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        289.09 ± 3.86 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       343.03 ± 19.78 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        355.02 ± 4.88 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        353.27 ± 0.69 |

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2). Fewer pipeline variants and spec constants, just use push constants. In test_topk_moe, change exp_probs_b to be 1D, matching real networks. Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified.

ggerganov

Ack on the ggml-backend changes

jeffbolznv requested review from 0cc4m and ggerganov as code owners December 22, 2025 17:00

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 22, 2025

ggerganov approved these changes Dec 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron #18295

vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron #18295

jeffbolznv commented Dec 22, 2025

Uh oh!

ggerganov left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron #18295

Are you sure you want to change the base?

vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron #18295

Conversation

jeffbolznv commented Dec 22, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants