[None][fix] Enable CUDA core fast path for SM121 (DGX Spark)#12705
[None][fix] Enable CUDA core fast path for SM121 (DGX Spark)#12705mihai-chiorean wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughUpdated CUDA-core enablement logic in the Linear module to recognize additional GPU architectures. The condition now enables CUDA cores for both Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/linear.py (1)
2572-2574: Centralize the CUDA-core capability allowlist to avoid cross-file drift.This SM121 enablement is correct, but the same capability check is duplicated elsewhere and is already inconsistent (
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py:109-120still excludes(12,1)). Consider moving the allowlist into a shared helper/constant and reusing it in both places.Suggested direction
+# e.g., in a shared module +CUDA_CORE_CAPABILITIES = {(8, 9), (12, 0), (12, 1)} + # in Linear.__init__ self.enable_cuda_core = False if torch.cuda.is_available(): capability = torch.cuda.get_device_capability(torch.device('cuda:0')) - self.enable_cuda_core = (capability[0] == 8 and capability[1] == 9) \ - or (capability[0] == 12 and capability[1] in (0, 1)) + self.enable_cuda_core = capability in CUDA_CORE_CAPABILITIESApply the same shared constant/helper in
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.pyto keep behavior aligned.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/linear.py` around lines 2572 - 2574, Centralize the CUDA core allowlist by adding a shared constant or helper (e.g., CUDA_CORE_ALLOWLIST or is_cuda_core_supported(capability)) and replace the inline capability check in LinearModule (where enable_cuda_core is set using capability) with a call/reference to that helper; then update the quantization code in tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py to use the same helper so (12,1) is included consistently across both places. Ensure the shared symbol explicitly contains the tuples {(8,9), (12,0), (12,1)} (or logic that yields the same) and update references in the LinearModule (enable_cuda_core) and the quant module to use that single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/_torch/modules/linear.py`:
- Around line 2572-2574: Centralize the CUDA core allowlist by adding a shared
constant or helper (e.g., CUDA_CORE_ALLOWLIST or
is_cuda_core_supported(capability)) and replace the inline capability check in
LinearModule (where enable_cuda_core is set using capability) with a
call/reference to that helper; then update the quantization code in
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py to use the same
helper so (12,1) is included consistently across both places. Ensure the shared
symbol explicitly contains the tuples {(8,9), (12,0), (12,1)} (or logic that
yields the same) and update references in the LinearModule (enable_cuda_core)
and the quant module to use that single source of truth.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e59cfc02-a160-49f6-a50f-c9d5c7c47013
📒 Files selected for processing (1)
tensorrt_llm/_torch/modules/linear.py
The enable_cuda_core check in NVFP4Linear only matches SM89 and SM120 but not SM121 (DGX Spark GB10). CudaCoreNVFP4Runner.MIN_SM_VERSION is 100 so SM121 qualifies, but the linear module early-exit optimization bypasses the autotuner for small M dimensions (M <= 8) and this path was dead on SM121. Add capability (12, 1) to the check. Signed-off-by: Mihai Chiorean <mihai.v.chiorean@gmail.com>
81ea2be to
ee80a41
Compare
Summary
enable_cuda_coreinNVFP4Linear.__init__and_trtllm_fp8_prequant_linear_core()gate the CUDA core scaled-mm fast path for small M dimensions (M <= 8). Both checks match SM89(8,9)and SM120(12,0)but not SM121(12,1), leaving the fast path dead on DGX Spark GB10.This adds
(12,1)to both checks so SM121 gets the same fast path as SM120.Note on device-0 hardcoding
Both
linear.py:2571(torch.device("cuda:0")) andquant.py:111(torch.cuda.get_device_capability(0)) query device 0 at init time rather than the device the tensors will actually live on. This is a pre-existing issue that affects SM89 and SM120 equally — fixing it requires moving the check to dispatch time, which is a larger refactor beyond this PR scope.Test plan
enable_cuda_core=Trueon DGX Spark GB10 (SM121) after fix