[None][feat] Optimize super-v3 nvfp4 for better perf#11273
[None][feat] Optimize super-v3 nvfp4 for better perf#11273Wanli-Jiang merged 16 commits intoNVIDIA:mainfrom
Conversation
|
/bot run --disable-fail-fast |
|
PR_Github #34754 [ run ] triggered by Bot. Commit: |
|
PR_Github #34820 [ run ] triggered by Bot. Commit: |
|
PR_Github #34820 [ run ] completed with state
|
e0a300f to
082ba85
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #34964 [ run ] triggered by Bot. Commit: |
|
PR_Github #34964 [ run ] completed with state
|
15c7bf2 to
6cd2cae
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #35091 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis pull request introduces CUDA kernel optimizations for causal convolution, implements fused ReLU-2 activation with FP4 quantization, extends layernorm kernels with high-precision output handling, and integrates FP4 quantization support into PyTorch modules and the Nemotron model. Includes comprehensive test coverage for new and modified operations. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
tensorrt_llm/_torch/modules/mamba/ssd_state_passing.py (1)
4-4:⚠️ Potential issue | 🟡 MinorCopyright year should be updated to 2026.
This file (and the other three files in this review) has
2022-2024in the NVIDIA copyright header, but the files are being meaningfully modified in 2026. As per coding guidelines, "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification."Update to
2022-2026across all four modified files.tensorrt_llm/_torch/modules/mamba/mamba2_metadata.py (1)
1-1:⚠️ Potential issue | 🟡 MinorCopyright year should be updated to 2026.
The file has meaningful modifications in this PR. As per coding guidelines, source files should "contain an NVIDIA copyright header with the year of latest meaningful modification."
cpp/tensorrt_llm/kernels/fusedLayernormKernels/ws_layernorm_fp4_traits.cu (1)
1-2:⚠️ Potential issue | 🟡 MinorCopyright year should be updated to 2026.
This file has meaningful modifications (new template parameter, new branching logic). As per coding guidelines, the copyright header should reflect the year of the latest meaningful modification.
tensorrt_llm/_torch/models/modeling_nemotron_h.py (2)
707-709:⚠️ Potential issue | 🟡 MinorPython 3.10+ type syntax breaks 3.8 compatibility.
Lines 707 and 709 use
torch.Tensor | Noneand lowercasetuple[...], which require Python 3.10+ at runtime. The rest of the file (and the imports at line 17) consistently usesOptional[...]andTuple[...]fromtyping. There is nofrom __future__ import annotationsto enable deferred evaluation.Proposed fix
- residual: torch.Tensor | None = None, + residual: Optional[torch.Tensor] = None, attn_metadata: Optional[AttentionMetadata] = None, - ) -> tuple[torch.Tensor, torch.Tensor | None]: + ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:As per coding guidelines, "The code developed for TensorRT-LLM should conform to Python 3.8+."
623-652:⚠️ Potential issue | 🔴 CriticalRemove dead code that references non-existent attributes.
Lines 623–638 contain code copied from
NemotronHForCausalLM.__init__that referencesself.model,self.draft_model,self.epilogue, andself.spec_worker—none of which exist inNemotronHMTPDecoderLayer(which inherits fromNemotronHLayer→DecoderLayer).This block is currently unreachable because
sublayer_model_config(lines 785–791) does not includespec_config, somodel_config.spec_configis alwaysNone. However, ifspec_configwere ever passed to the sublayer config, this code would crash withAttributeError. Remove this block to prevent confusion and latent bugs.
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py`:
- Around line 253-261: The branch that handles non-tuple hidden_states can
receive an Fp4QuantizedTensor which lacks .view(), so update the else branch in
the block around hidden_states / hidden_states_hp (and before creating
hidden_states_hp_2d) to either tighten the type union or detect and unwrap the
quantized wrapper: if isinstance(hidden_states_hp, Fp4QuantizedTensor) (or check
hasattr(hidden_states_hp, "tensor"/"values"/"to_torch")), assign
hidden_states_hp = hidden_states_hp.tensor (or the correct backing
attribute/method) to get a regular torch.Tensor, then assert shape and call
.view(-1, self.hidden_dim) to produce hidden_states_hp_2d; alternatively, remove
Fp4QuantizedTensor from the accepted non-tuple type so only tuple paths reach
this code.
In `@tensorrt_llm/_torch/modules/mlp.py`:
- Around line 127-142: In _fused_relu2_quant, the code always casts inputs to
torch.bfloat16 which can change precision for float16 models and add overhead;
update the logic in _fused_relu2_quant to preserve the input dtype when it is
torch.float16 or torch.bfloat16 (only convert if dtype is neither), then call
torch.ops.trtllm.fused_relu2_quantize with that preserved dtype path (the C++
invokeFusedRelu2Quantize supports both fp16 and bf16), and optionally emit a
debug log when an unexpected dtype is converted; reference _fused_relu2_quant,
torch.ops.trtllm.fused_relu2_quantize, and self.down_proj.input_scale to locate
where to adjust the cast and add the log.
In `@tensorrt_llm/_torch/modules/rms_norm.py`:
- Around line 140-152: Update the method's return type annotation to reflect the
three actual possible signatures: Fp4QuantizedTensor, (Fp4QuantizedTensor,
Tensor) and (Fp4QuantizedTensor, Tensor, Tensor) — i.e. include the 3-tuple
variant returned when return_hp_output is True and has_residual is True;
reference the symbols return_hp_output, has_residual, hidden_states_fused,
residual_out and high_precision_normed_output to locate the forward method's
annotation and remove the impossible (no-residual + return_hp_output) variant
from the union.
In `@tests/unittest/_torch/modules/mamba/test_causal_conv1d.py`:
- Around line 44-47: Remove the duplicated assertion that repeats
conv_weight.shape[0] == dim; in the test where conv_weight is validated
(variable conv_weight in test_causal_conv1d), delete the redundant assert on
conv_weight.shape[0] so only the unique shape checks remain (assert
conv_weight.shape[0] == dim, assert conv_weight.shape[1] == 1, assert
conv_weight.shape[2] == dconv).
In `@tests/unittest/_torch/modules/test_fused_activation_quant.py`:
- Around line 72-87: The helper dequantize_fp4 function currently ignores its
scale and scale2 parameters and only unpacks FP4 indices via E2M1_VALUES,
producing incorrect values and is unused; either remove dequantize_fp4 or
implement proper dequantization by applying the scales to the unpacked values
(use scale and scale2 to reconstruct floating outputs for the two interleaved
channels) and ensure tests call it; locate the function dequantize_fp4 and
E2M1_VALUES in this test file and either delete the unused function or update it
to map low/high indices through E2M1_VALUES and then multiply/add the
appropriate scale/scale2 per element so the returned tensor reflects correct
dequantized floats.
🧹 Nitpick comments (12)
tensorrt_llm/_torch/modules/mamba/ssd_chunk_state.py (1)
375-457: Autotune expansion for_chunk_state_varlen_kernelis consistent with the other kernels.Same config structure as
_chunk_state_fwd_kernel. No issues.Note: the
tl.exp()calls in this file (e.g., lines 344, 349, 564, 611) were not converted totl.math.exp2(... * 1.4426950408889634)like inssd_chunk_scan.py. If the exp2 optimization yields measurable gains inssd_chunk_scan, it may be worth applying here as well for consistency.cpp/tensorrt_llm/kernels/fusedLayernormKernels/low_latency_layernorm.cuh (1)
1-2: Copyright year may need updating.The copyright header says 2024, but this file contains meaningful modifications for 2026. As per coding guidelines, source files should have the year of latest meaningful modification in the copyright header.
Suggested fix
-* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +* Copyright (c) 2024-2026, NVIDIA CORPORATION. All rights reserved.tests/unittest/_torch/modules/test_fused_add_rms_norm_quant.py (1)
34-63:tuple[...]syntax requires Python 3.9+.Lines 39 and 66 use lowercase
tuple[...]for type hints. Per coding guidelines, Python 3.8+ compatibility is expected. UseTuplefromtypingfor consistency with the rest of the codebase (e.g.,cpp_custom_ops.pyusesTuple).Proposed fix
+from typing import Tuple + def rms_norm_ref( hidden_states: torch.Tensor, residual: torch.Tensor, gamma: torch.Tensor, eps: float, -) -> tuple[torch.Tensor, torch.Tensor]: +) -> Tuple[torch.Tensor, torch.Tensor]:And similarly for
get_swizzled_sf_indices:-def get_swizzled_sf_indices(m: int, n: int, sf_vec_size: int = 16) -> list[int]: +def get_swizzled_sf_indices(m: int, n: int, sf_vec_size: int = 16) -> List[int]:As per coding guidelines, "The code developed for TensorRT-LLM should conform to Python 3.8+".
cpp/tensorrt_llm/thop/fusedActivationQuant.cpp (1)
42-88:int64_t→intnarrowing cast on dimensionsmandn.Lines 69–70 (and 76–77) cast
mandnfromint64_ttointviastatic_cast<int>(). If a tensor dimension ever exceedsINT_MAX, this silently truncates. While unlikely for FP4 quantization workloads, it's worth adding a guard — similar to the existingTORCH_CHECKassertions — to reject oversized inputs early.Proposed guard
TORCH_CHECK(n % sf_vec_size == 0, "N must be divisible by sf_vec_size."); + TORCH_CHECK(m <= INT_MAX && n <= INT_MAX, "Dimensions exceed int range.");tensorrt_llm/_torch/modules/mamba/fuse_elementwise_ops.py (1)
146-146: Unused unpacked variableconv_dim.Ruff flags
conv_dimas unused (RUF059). Prefix with_to clarify intent.Proposed fix
- conv_dim, num_prefill_tokens = xbc.shape + _conv_dim, num_prefill_tokens = xbc.shapetensorrt_llm/_torch/modules/mamba/mamba2_metadata.py (2)
26-60:num_seqsastl.constexprwill trigger Triton recompilation for every distinct batch size.Since
num_seqsis declaredtl.constexpr, the loopfor seq_idx in range(num_seqs - 1)is unrolled at compile time. Every time the batch size changes, Triton must re-specialize and recompile this kernel. If the serving workload sees variable batch sizes (common in production), the recompilation overhead could be noticeable on the first occurrence of each batch size.Consider whether
num_seqscan be a runtime parameter instead, with atl.range-based loop (or awhileloop with a dynamic bound). Alternatively, if the max batch size is small and bounded, the current approach is fine — just be aware of the trade-off.
225-233: Per-element Python-loop access tonum_cached_tokens_per_seqcauses host-device syncs.Lines 225–227 index into
num_cached_tokens_per_seq(a CUDA tensor) in a Pythonforloop, triggering a host-device synchronization on each iteration. For smallnum_contextsthis is typically acceptable, but it partially negates the benefit of the Triton-based chunk computation.If
num_cached_tokens_per_seqis a CUDA tensor, a vectorized approach avoids the per-element syncs:Suggested vectorized path
- initial_states = [ - num_cached_tokens_per_seq[i] > 0 for i in range(num_contexts) - ] - self.use_initial_states = any(initial_states) + initial_states_tensor = num_cached_tokens_per_seq[:num_contexts] > 0 + self.use_initial_states = initial_states_tensor.any().item() if self.use_initial_states: - self.has_initial_states[:num_contexts] = torch.tensor( - initial_states, dtype=torch.bool) + self.has_initial_states[:num_contexts] = initial_states_tensorcpp/tensorrt_llm/kernels/fusedActivationQuant.cu (2)
42-50: Missing braces andconstqualifiers on local variables.Per coding guidelines:
- Line 49–50: The
ifbody must be a compound (brace-delimited) statement.- Lines 44–47:
SFScaleVal,numColThreads,numColVecs,rowIdxare never modified after initialization and should beconst.SFScaleValand other identifiers starting with uppercase violate the camelCase-with-lowercase-first convention for local variables (e.g.,sfScaleVal).Proposed fix for the early-return guard
- if (rowIdx >= m) - return; + if (rowIdx >= m) + { + return; + }As per coding guidelines: "The statement forming the body of a
switch,while,do..while, orforstatement must be a compound statement", "Ifandelsestatements should always be followed by brace-delimited statements", and "A variable that is not modified after its initialization should be declared asconst".
70-71: Warp shuffle reduction only reduces across lane pairs — verify this is intentional.
__shfl_xor_sync(0xffffffff, localMax, 1)exchanges with exactly the adjacent lane (XOR with 1). SincekEltsPerThread = 8andkSfVecSize = 16, each pair of consecutive threads covers one 16-element SF vector, so a single XOR-1 shuffle is sufficient. This is correct but subtle — a brief comment noting the pairing would help future readers.cpp/tensorrt_llm/kernels/fusedLayernormKernels/ws_layernorm_fp4_traits.cu (1)
122-148: Four-way branching is correct but could be simplified.The
{rms, non-rms} × {hp, non-hp}matrix produces 4 nearly identical_invoke(...)calls differing only in twobooltemplate args. A small helper template or constexpr dispatch could deduplicate this, but since template instantiation requires compile-time constants, the current explicit branching is functionally correct and not unusual for CUDA kernel selection code.tensorrt_llm/_torch/models/modeling_nemotron_h.py (2)
1-1: Copyright year is outdated.The header says
2022-2024but this file has meaningful modifications in 2026. As per coding guidelines, "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification."Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
421-422: Redundantzeros_likeallocation — dead code in practice.
NemotronHModel.forward(line 527) already initializesresidual = torch.zeros_like(hidden_states)before the layer loop and always passes it, so this fallback is never reached in the normal call path. TheNemotronHMTPDecoderLayeralso handlesNoneresidual in its ownforwardwithout callingsuper().forward().If this is kept as a defensive guard, consider a cheaper sentinel or an assertion instead, to avoid silently masking caller bugs:
Suggested alternative
- if residual is None: - residual = torch.zeros_like(hidden_states) + assert residual is not None, ( + "NemotronHLayer.forward expects a residual tensor; " + "caller must initialize it (e.g. torch.zeros_like(hidden_states))." + )
|
PR_Github #35091 [ run ] completed with state |
|
PR_Github #35691 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #35691 [ run ] completed with state
|
|
PR_Github #35735 [ run ] triggered by Bot. Commit: |
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
d397bbc to
66f64d0
Compare
|
/bot run --stage-list "A30-AutoDeploy-1,RTX5090-PackageSanityCheck-PY312-UB2404,A10-PackageSanityCheck-PY310-UB2204,GB200-4_GPUs-PyTorch-1,DGX_B200-4_GPUs-PyTorch-1,DGX_H100-4_GPUs-PyTorch-Others-1" --disable-fail-fast |
|
PR_Github #35746 [ run ] triggered by Bot. Commit: |
|
PR_Github #35746 [ run ] completed with state |
|
/bot skip --comment "all tests are passed within different runs and passed at local runs" |
|
PR_Github #35787 [ skip ] triggered by Bot. Commit: |
|
PR_Github #35787 [ skip ] completed with state |
Detailed Kernel Analysis
1. Kernels Completely Removed (Optimized Out)
These kernels were present in the base profile but are absent from the optimized version:
elementwise_kernel (direct_copy BFloat16)cutlass FP4 GEMM (4_4_1_2SM, 256x256x256)cutlass3x_sm100 bias_bf16_reluflashinfer RMSNormKernelvectorized_elementwise_kernel (clamp)vectorized_elementwise_kernel (pow)Total time saved from removed kernels: ~1,410.76 ms
2. Kernels Added (New in Optimized Version)
These kernels appear only in the optimized profile:
_fused_conv_output_transpose_kernel_extract_transpose_prefill_kernelwarpSpecializedInvoker (FP4AddBiasResidualPreLayerNorm, variant 1)warpSpecializedInvoker (FP4AddBiasResidualPreLayerNorm, variant 2)fusedRelu2QuantizeKernelFusedAddRMSNormKernelFillFunctorTotal time for new kernels: ~462.00 ms
Net savings from kernel fusion: ~948.76 ms
3. Kernels with Significant Time Changes
4. Core GEMM Operations (We did not touch it, the regression might be from run variances.)
Summary by CodeRabbit
Release Notes
New Features
Performance Improvements
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.