Skip to content

[Feature]: Document TRITON MoE backend as recommended for SM120/SM121 (DGX Spark) #12706

@mihai-chiorean

Description

@mihai-chiorean

Is your feature request related to a problem?

SM120/SM121 (DGX Spark GB10, RTX PRO 6000) has 99 KiB shared memory per block vs 228 KiB on SM100 (B200). The default CUTLASS MoE GEMM backend uses TMA warp-specialized grouped GEMM tile configs sized for 228 KiB, causing most autotuner tactics to fail and falling back to a slow kernel (~4.8 tok/s on Qwen3-Next-80B NVFP4).

Setting TRTLLM_MOE_BACKEND=TRITON bypasses CUTLASS entirely and achieves 32-40 tok/s on the same model — a 6.7x speedup that beats the llama.cpp GGUF baseline (24 tok/s).

This is not documented anywhere.

Proposed solution

Add documentation (e.g., in the DGX Spark/consumer Blackwell section of docs, or as a note in the MoE backend selection logic) recommending TRTLLM_MOE_BACKEND=TRITON for SM120/SM121 devices.

Benchmark data (DGX Spark GB10, SM121, 128GB UMA)

Model MoE Backend tok/s
Qwen3-30B-A3B-NVFP4 CUTLASS (default) 4.8
Qwen3-30B-A3B-NVFP4 TRITON 40.0
Qwen3-Next-80B-A3B-NVFP4 CUTLASS (default) 4.6
Qwen3-Next-80B-A3B-NVFP4 TRITON 32.1
Qwen3-30B-A3B-NVFP4, batch=8 TRITON 235.8 (aggregate)

Additional context

The TRITON backend JIT-compiles kernels that adapt to SM121's hardware constraints. First 2-3 inference calls are slow due to Triton JIT compilation, then performance stabilizes.

Related PRs:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Doc<NV>TRTLLM's textual/illustrative materials: API refs, guides, tutorials. Improvement & clarity.Triton backend<NV>Related to NVIDIA Triton Inference Server backend

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions