Is your feature request related to a problem?
SM120/SM121 (DGX Spark GB10, RTX PRO 6000) has 99 KiB shared memory per block vs 228 KiB on SM100 (B200). The default CUTLASS MoE GEMM backend uses TMA warp-specialized grouped GEMM tile configs sized for 228 KiB, causing most autotuner tactics to fail and falling back to a slow kernel (~4.8 tok/s on Qwen3-Next-80B NVFP4).
Setting TRTLLM_MOE_BACKEND=TRITON bypasses CUTLASS entirely and achieves 32-40 tok/s on the same model — a 6.7x speedup that beats the llama.cpp GGUF baseline (24 tok/s).
This is not documented anywhere.
Proposed solution
Add documentation (e.g., in the DGX Spark/consumer Blackwell section of docs, or as a note in the MoE backend selection logic) recommending TRTLLM_MOE_BACKEND=TRITON for SM120/SM121 devices.
Benchmark data (DGX Spark GB10, SM121, 128GB UMA)
| Model |
MoE Backend |
tok/s |
| Qwen3-30B-A3B-NVFP4 |
CUTLASS (default) |
4.8 |
| Qwen3-30B-A3B-NVFP4 |
TRITON |
40.0 |
| Qwen3-Next-80B-A3B-NVFP4 |
CUTLASS (default) |
4.6 |
| Qwen3-Next-80B-A3B-NVFP4 |
TRITON |
32.1 |
| Qwen3-30B-A3B-NVFP4, batch=8 |
TRITON |
235.8 (aggregate) |
Additional context
The TRITON backend JIT-compiles kernels that adapt to SM121's hardware constraints. First 2-3 inference calls are slow due to Triton JIT compilation, then performance stabilizes.
Related PRs:
Is your feature request related to a problem?
SM120/SM121 (DGX Spark GB10, RTX PRO 6000) has 99 KiB shared memory per block vs 228 KiB on SM100 (B200). The default CUTLASS MoE GEMM backend uses TMA warp-specialized grouped GEMM tile configs sized for 228 KiB, causing most autotuner tactics to fail and falling back to a slow kernel (~4.8 tok/s on Qwen3-Next-80B NVFP4).
Setting
TRTLLM_MOE_BACKEND=TRITONbypasses CUTLASS entirely and achieves 32-40 tok/s on the same model — a 6.7x speedup that beats the llama.cpp GGUF baseline (24 tok/s).This is not documented anywhere.
Proposed solution
Add documentation (e.g., in the DGX Spark/consumer Blackwell section of docs, or as a note in the MoE backend selection logic) recommending
TRTLLM_MOE_BACKEND=TRITONfor SM120/SM121 devices.Benchmark data (DGX Spark GB10, SM121, 128GB UMA)
Additional context
The TRITON backend JIT-compiles kernels that adapt to SM121's hardware constraints. First 2-3 inference calls are slow due to Triton JIT compilation, then performance stabilizes.
Related PRs: