Skip to content

[None][fix] Enable CUDA core fast path for SM121 (DGX Spark)#12705

Open
mihai-chiorean wants to merge 1 commit intoNVIDIA:mainfrom
mihai-chiorean:fix/enable-cuda-core-sm121
Open

[None][fix] Enable CUDA core fast path for SM121 (DGX Spark)#12705
mihai-chiorean wants to merge 1 commit intoNVIDIA:mainfrom
mihai-chiorean:fix/enable-cuda-core-sm121

Conversation

@mihai-chiorean
Copy link
Copy Markdown
Contributor

@mihai-chiorean mihai-chiorean commented Apr 2, 2026

Summary

enable_cuda_core in NVFP4Linear.__init__ and _trtllm_fp8_prequant_linear_core() gate the CUDA core scaled-mm fast path for small M dimensions (M <= 8). Both checks match SM89 (8,9) and SM120 (12,0) but not SM121 (12,1), leaving the fast path dead on DGX Spark GB10.

This adds (12,1) to both checks so SM121 gets the same fast path as SM120.

Note on device-0 hardcoding

Both linear.py:2571 (torch.device("cuda:0")) and quant.py:111 (torch.cuda.get_device_capability(0)) query device 0 at init time rather than the device the tensors will actually live on. This is a pre-existing issue that affects SM89 and SM120 equally — fixing it requires moving the check to dispatch time, which is a larger refactor beyond this PR scope.

Test plan

  • Verified enable_cuda_core=True on DGX Spark GB10 (SM121) after fix
  • CI: no regression on SM89/SM120

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

Updated CUDA-core enablement logic in the Linear module to recognize additional GPU architectures. The condition now enables CUDA cores for both (major=12, minor=0) and (major=12, minor=1) GPU capabilities, extending support beyond the previously supported (12,0) architecture.

Changes

Cohort / File(s) Summary
CUDA Core Architecture Support
tensorrt_llm/_torch/modules/linear.py
Modified Linear.__init__ to include GPU architecture (12,1) alongside existing (12,0) in the CUDA-core enablement condition.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title accurately summarizes the main change: enabling CUDA core fast path support for SM121 (DGX Spark), which is the core objective of the PR.
Description check ✅ Passed The PR description includes a clear summary of the issue, solution, and testing approach, though the Description and Test Coverage sections are not explicitly separated with headers as the template suggests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/linear.py (1)

2572-2574: Centralize the CUDA-core capability allowlist to avoid cross-file drift.

This SM121 enablement is correct, but the same capability check is duplicated elsewhere and is already inconsistent (tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py:109-120 still excludes (12,1)). Consider moving the allowlist into a shared helper/constant and reusing it in both places.

Suggested direction
+# e.g., in a shared module
+CUDA_CORE_CAPABILITIES = {(8, 9), (12, 0), (12, 1)}
+
 # in Linear.__init__
 self.enable_cuda_core = False
 if torch.cuda.is_available():
     capability = torch.cuda.get_device_capability(torch.device('cuda:0'))
-    self.enable_cuda_core = (capability[0] == 8 and capability[1] == 9) \
-        or (capability[0] == 12 and capability[1] in (0, 1))
+    self.enable_cuda_core = capability in CUDA_CORE_CAPABILITIES

Apply the same shared constant/helper in tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py to keep behavior aligned.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/linear.py` around lines 2572 - 2574, Centralize
the CUDA core allowlist by adding a shared constant or helper (e.g.,
CUDA_CORE_ALLOWLIST or is_cuda_core_supported(capability)) and replace the
inline capability check in LinearModule (where enable_cuda_core is set using
capability) with a call/reference to that helper; then update the quantization
code in tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py to use
the same helper so (12,1) is included consistently across both places. Ensure
the shared symbol explicitly contains the tuples {(8,9), (12,0), (12,1)} (or
logic that yields the same) and update references in the LinearModule
(enable_cuda_core) and the quant module to use that single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/modules/linear.py`:
- Around line 2572-2574: Centralize the CUDA core allowlist by adding a shared
constant or helper (e.g., CUDA_CORE_ALLOWLIST or
is_cuda_core_supported(capability)) and replace the inline capability check in
LinearModule (where enable_cuda_core is set using capability) with a
call/reference to that helper; then update the quantization code in
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py to use the same
helper so (12,1) is included consistently across both places. Ensure the shared
symbol explicitly contains the tuples {(8,9), (12,0), (12,1)} (or logic that
yields the same) and update references in the LinearModule (enable_cuda_core)
and the quant module to use that single source of truth.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e59cfc02-a160-49f6-a50f-c9d5c7c47013

📥 Commits

Reviewing files that changed from the base of the PR and between 11c40bb and 81ea2be.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/modules/linear.py

The enable_cuda_core check in NVFP4Linear only matches SM89 and SM120
but not SM121 (DGX Spark GB10). CudaCoreNVFP4Runner.MIN_SM_VERSION is
100 so SM121 qualifies, but the linear module early-exit optimization
bypasses the autotuner for small M dimensions (M <= 8) and this path
was dead on SM121.

Add capability (12, 1) to the check.

Signed-off-by: Mihai Chiorean <mihai.v.chiorean@gmail.com>
@mihai-chiorean mihai-chiorean force-pushed the fix/enable-cuda-core-sm121 branch from 81ea2be to ee80a41 Compare April 2, 2026 18:59
@mihai-chiorean mihai-chiorean requested a review from a team as a code owner April 2, 2026 18:59
@mihai-chiorean mihai-chiorean requested a review from galagam April 2, 2026 18:59
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 2, 2026
@mihai-chiorean mihai-chiorean changed the title [None][fix] Enable CUDA core NVFP4 fast path for SM121 (DGX Spark) [None][fix] Enable CUDA core fast path for SM121 (DGX Spark) Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants