[Feature]: Document TRITON MoE backend as recommended for SM120/SM121 (DGX Spark)

### Is your feature request related to a problem?

SM120/SM121 (DGX Spark GB10, RTX PRO 6000) has 99 KiB shared memory per block vs 228 KiB on SM100 (B200). The default CUTLASS MoE GEMM backend uses TMA warp-specialized grouped GEMM tile configs sized for 228 KiB, causing most autotuner tactics to fail and falling back to a slow kernel (~4.8 tok/s on Qwen3-Next-80B NVFP4).

Setting `TRTLLM_MOE_BACKEND=TRITON` bypasses CUTLASS entirely and achieves **32-40 tok/s** on the same model — a 6.7x speedup that beats the llama.cpp GGUF baseline (24 tok/s).

This is not documented anywhere.

### Proposed solution

Add documentation (e.g., in the DGX Spark/consumer Blackwell section of docs, or as a note in the MoE backend selection logic) recommending `TRTLLM_MOE_BACKEND=TRITON` for SM120/SM121 devices.

### Benchmark data (DGX Spark GB10, SM121, 128GB UMA)

| Model | MoE Backend | tok/s |
|-------|------------|-------|
| Qwen3-30B-A3B-NVFP4 | CUTLASS (default) | 4.8 |
| Qwen3-30B-A3B-NVFP4 | TRITON | **40.0** |
| Qwen3-Next-80B-A3B-NVFP4 | CUTLASS (default) | 4.6 |
| Qwen3-Next-80B-A3B-NVFP4 | TRITON | **32.1** |
| Qwen3-30B-A3B-NVFP4, batch=8 | TRITON | **235.8** (aggregate) |

### Additional context

The TRITON backend JIT-compiles kernels that adapt to SM121's hardware constraints. First 2-3 inference calls are slow due to Triton JIT compilation, then performance stabilizes.

Related PRs:
- #12704: CUTLASS MoE SMEM filter for SM121
- #12705: Enable CUDA core NVFP4 fast path for SM121
- #12310: Autotuner bounds checking for SM121
- #12301: KV cache unified memory optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Document TRITON MoE backend as recommended for SM120/SM121 (DGX Spark) #12706

Is your feature request related to a problem?

Proposed solution

Benchmark data (DGX Spark GB10, SM121, 128GB UMA)

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	MoE Backend	tok/s
Qwen3-30B-A3B-NVFP4	CUTLASS (default)	4.8
Qwen3-30B-A3B-NVFP4	TRITON	40.0
Qwen3-Next-80B-A3B-NVFP4	CUTLASS (default)	4.6
Qwen3-Next-80B-A3B-NVFP4	TRITON	32.1
Qwen3-30B-A3B-NVFP4, batch=8	TRITON	235.8 (aggregate)

[Feature]: Document TRITON MoE backend as recommended for SM120/SM121 (DGX Spark) #12706

Description

Is your feature request related to a problem?

Proposed solution

Benchmark data (DGX Spark GB10, SM121, 128GB UMA)

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions