[6241485] Add support for ONNX Q/DQ node placement for DLA by gcunhase · Pull Request #1661 · NVIDIA/Model-Optimizer

gcunhase · 2026-06-09T20:09:34Z

What does this PR do?

Type of change: New feature

On DLA, the whole DLA-eligible region is compiled as one node, which runs in INT8 or FP16, and it expects scales to be present throughout. A tensor without a usable scale typically forces either that region to run in FP16 or a GPU fallback (if enabled) — otherwise the build fails.

With IQ (implicit quantization) being deprecated in TensorRT, users are migrating to ModelOpt for quantization/calibration. However, this breaks the DLA workflow since DLA still only supports IQ. The suggested workflow is then to:

Use ModelOpt to obtain the EQ (explicitly quantized) model;
Use NVIDIA's Q/DQ Translator Toolkit to obtain the calib.cache and layer_arg.txt files, which can be used with the non-quantized model to generate a DLA loadable.

A study on Yolov5 has shown that EQ can achieve perf parity with IQ on DLA if Q/DQ nodes are inserted at every layer, making sure all tensors have INT8 scales. From the study: "With this option, all layers’ scales can be obtained during model fine-tuning. However, this method may potentially disrupt TensorRT fusion strategy with Q/DQ layers when running inference on GPU and lead to higher latency on the GPU. For DLA, on the other hand, the rule of thumb with PTQ scales is, “The more available scales, the lower the latency.” "

This PR aims to enable a quantization path targeting DLA.

Usage

$ python -m modelopt.onnx.quantization --onnx=model.onnx --target_dla

Testing

Two new parametrized tests (target_dla=False/True) cover both the Conv/Mul quantization expansion and the GEMV (MatMul m=1) exclusion bypass, with dedicated model builders.
Internal test: 6241485@10

Before your PR is "Ready for review"

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ✅
Did you update Changelog?: ✅
Did you get Claude approval on this PR?: ❌

Additional info

Related blogpost: https://developer.nvidia.com/blog/deploying-yolov5-on-nvidia-jetson-orin-with-cudla-quantization-aware-training-to-inference/#adding_qdq_nodes

Summary by CodeRabbit

New Features
- Added a DLA-targeting option to enable Q/DQ placement for INT8 quantization.
Behavior Changes
- Pre-quantization exclusion discovery is skipped when DLA targeting or autotune is enabled, yielding more consistent Q/DQ placement.
Tests
- New unit tests validate Q/DQ placement with and without DLA targeting using a small grouped-conv model.
Documentation
- Changelog updated to document the new option.

coderabbitai · 2026-06-09T20:09:49Z

📝 Walkthrough

Walkthrough

Adds a --target_dla CLI flag and a target_dla parameter to quantize(), propagates it into INT8 logic to skip certain exclusion heuristics and expand op-types when enabled, and adds tests, a test model, and a changelog entry.

Changes

DLA Targeting Feature

Layer / File(s)	Summary
CLI flag definition and wiring `modelopt/onnx/quantization/__main__.py`	The `--target_dla` command-line argument is added to the PTQ parser with INT8-only documentation, and the flag value is forwarded to the `quantize()` function call.
Main quantize() function signature and documentation `modelopt/onnx/quantization/quantize.py`	The `quantize()` function signature is extended with a new `target_dla: bool = False` parameter, and the docstring documents its effect on Q/DQ placement for DLA deployment.
Main quantize() DLA logic `modelopt/onnx/quantization/quantize.py`	MHA node exclusion is skipped when `target_dla` is true and `quantize_mode == "int8"`, and `target_dla` is injected into `kwargs` for downstream INT8/FP8 quantizers.
INT8 quantization adjustments for DLA `modelopt/onnx/quantization/int8.py`	GEMV pattern detection and conv-based exclusion are skipped when `autotune` or `target_dla` are set; when `target_dla` is enabled and `op_types_to_quantize` is empty, it is populated from all graph `op_type` values.
Test model and unit tests `tests/_test_utils/onnx/lib_test_models.py`, `tests/unit/onnx/quantization/test_qdq_rules_int8.py`	Adds `build_small_grouped_conv_model()` and parameterized tests (`test_target_dla_conv`, `test_target_dla_matmul`) that assert different Q/DQ placement behaviors depending on `target_dla`.
Changelog entry `CHANGELOG.rst`	Documents the new 0.46 feature for ONNX Q/DQ node placement support for DLA via the `--target_dla` flag.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly and specifically summarizes the main change: adding support for ONNX Q/DQ node placement targeting DLA (Deep Learning Accelerator).
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	PR adds target_dla feature for DLA quantization. Security review: numpy.load properly guarded with allow_pickle parameter controlled by --trust_calibration_data (caller-configurable, defaults to Fa...
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

modelopt/onnx/quantization/int8.py (3)

188-189: 💤 Low value

Update condition comment for consistency.

Similar to the GEMV exclusion above, the conv exclusion is now skipped for both autotune and target_dla. Consider adding a brief comment explaining why both conditions skip this exclusion, for maintainability.

📝 Suggested comment

 if not (autotune or kwargs.get("target_dla", False)):
+    # Skip conv exclusion for autotune (runtime-driven) and target_dla (comprehensive Q/DQ coverage)
     nodes_to_exclude.extend(find_nodes_from_convs_to_exclude(graph, quantize_mode="int8"))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/quantization/int8.py` around lines 188 - 189, The
conv-exclusion condition in int8.py currently reads `if not (autotune or
kwargs.get("target_dla", False)):` and lacks the explanatory comment present for
the GEMV exclusion; add a short comment immediately above this condition
explaining that conv nodes are only excluded when neither autotune nor
target_dla are active (because autotune/target DLA require keeping convs for
their tuning/DLA support), referencing the `find_nodes_from_convs_to_exclude`
call and `nodes_to_exclude` list so future readers understand why the exclusion
is skipped in both modes.

168-174: 💤 Low value

Update comment to mention target_dla.

The comment at line 172 states "this check will be skipped if Autotune is enabled," but after this change, the check is also skipped when target_dla=True. Please update the comment to reflect both conditions.

📝 Suggested comment update

-        # Note that this check will be skipped if Autotune is enabled as Q/DQ node placements
-        # will be decided according to TensorRT's runtime measurements.
+        # Note that this check will be skipped if Autotune or target_dla is enabled.
+        # Autotune decides Q/DQ placements via runtime measurements; target_dla requires
+        # comprehensive Q/DQ coverage for DLA scale requirements.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/quantization/int8.py` around lines 168 - 174, The comment above
the conditional that checks enable_gemv_detection_for_trt (the block starting
with if enable_gemv_detection_for_trt and not (autotune or
kwargs.get("target_dla", False))) is out of date: update the comment to state
that the GEMV/TensorCore check is skipped when either Autotune is enabled or
when target_dla=True. Locate the block that logs "Detecting GEMV patterns for
TRT optimization" and modify the explanatory comment to mention both conditions
(autotune and kwargs.get("target_dla", False)) as reasons the check will be
skipped.

155-156: ⚡ Quick win

Consider documenting the override behavior.

When target_dla=True, this code replaces any user-provided op_types_to_quantize list with all node types from the graph. While this is likely intentional for comprehensive DLA scale coverage, the override happens silently and could surprise users who provided a custom list.

Consider adding a log message or updating the docstring to make this behavior explicit.

💡 Suggested improvement

 if kwargs.get("target_dla", False):
+    logger.info("target_dla=True: quantizing all op types for DLA scale coverage")
     op_types_to_quantize = list({node.op_type for node in onnx_model.graph.node})

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/quantization/int8.py` around lines 155 - 156, The code silently
overrides a user-supplied op_types_to_quantize when kwargs.get("target_dla",
False) is true by setting op_types_to_quantize = list({node.op_type for node in
onnx_model.graph.node}); update the function handling these kwargs (the block
that reads target_dla and sets op_types_to_quantize) to either (a) emit a clear
log message (e.g., logger.info/warn) stating that target_dla=True causes
op_types_to_quantize to be replaced with all graph node types, or (b) update the
function docstring to explicitly document this override behavior; implement the
logging approach so callers see the override at runtime and keep the existing
assignment to list({node.op_type for node in onnx_model.graph.node}) when
target_dla is true.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@modelopt/onnx/quantization/int8.py`:
- Around line 188-189: The conv-exclusion condition in int8.py currently reads
`if not (autotune or kwargs.get("target_dla", False)):` and lacks the
explanatory comment present for the GEMV exclusion; add a short comment
immediately above this condition explaining that conv nodes are only excluded
when neither autotune nor target_dla are active (because autotune/target DLA
require keeping convs for their tuning/DLA support), referencing the
`find_nodes_from_convs_to_exclude` call and `nodes_to_exclude` list so future
readers understand why the exclusion is skipped in both modes.
- Around line 168-174: The comment above the conditional that checks
enable_gemv_detection_for_trt (the block starting with if
enable_gemv_detection_for_trt and not (autotune or kwargs.get("target_dla",
False))) is out of date: update the comment to state that the GEMV/TensorCore
check is skipped when either Autotune is enabled or when target_dla=True. Locate
the block that logs "Detecting GEMV patterns for TRT optimization" and modify
the explanatory comment to mention both conditions (autotune and
kwargs.get("target_dla", False)) as reasons the check will be skipped.
- Around line 155-156: The code silently overrides a user-supplied
op_types_to_quantize when kwargs.get("target_dla", False) is true by setting
op_types_to_quantize = list({node.op_type for node in onnx_model.graph.node});
update the function handling these kwargs (the block that reads target_dla and
sets op_types_to_quantize) to either (a) emit a clear log message (e.g.,
logger.info/warn) stating that target_dla=True causes op_types_to_quantize to be
replaced with all graph node types, or (b) update the function docstring to
explicitly document this override behavior; implement the logging approach so
callers see the override at runtime and keep the existing assignment to
list({node.op_type for node in onnx_model.graph.node}) when target_dla is true.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 72715d5e-ad7b-4b45-afdc-fb96b39940a5

📥 Commits

Reviewing files that changed from the base of the PR and between 5584ce4 and 2f10789.

📒 Files selected for processing (4)

CHANGELOG.rst
modelopt/onnx/quantization/__main__.py
modelopt/onnx/quantization/int8.py
modelopt/onnx/quantization/quantize.py

codecov · 2026-06-09T20:19:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.53%. Comparing base (46eddab) to head (9460382).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1661      +/-   ##
==========================================
+ Coverage   67.73%   76.53%   +8.80%     
==========================================
  Files         511      511              
  Lines       56169    56174       +5     
==========================================
+ Hits        38044    42993    +4949     
+ Misses      18125    13181    -4944

Flag	Coverage Δ
examples	`41.83% <0.00%> (+0.52%)`	⬆️
gpu	`57.70% <75.00%> (+25.74%)`	⬆️
unit	`54.38% <100.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gcunhase · 2026-06-10T18:16:43Z

/claude review

claude

Claude review — --target_dla for ONNX Q/DQ placement.

Findings: CRITICAL: 0, IMPORTANT: 2, SUGGESTION: 3.

Most impactful:

[IMPORTANT Compatibility] in quantize.py:628 — find_nodes_from_mha_to_exclude is skipped solely on target_dla, regardless of quantize_mode. The docstring and CLI help both promise "INT8 only", so combining --target_dla with --quantize_mode=fp8 silently changes FP8 MHA Q/DQ placement (head_size/fMHA-v2 exclusions are bypassed). Either gate the skip on int8 or document the cross-mode effect.
[IMPORTANT Compatibility] in int8.py:155 — when target_dla=True, any user-supplied op_types_to_quantize is silently overwritten with every op_type in the graph, including non-quantizable types (Reshape, Identity, Constant, Cast, …). Suggest logging a warning on override and intersecting with the quantizable set.
Three suggestions: (1) kwargs[\"target_dla\"] propagation is asymmetric vs. the int4 path, (2) test models use unseeded RNG, (3) test only covers conv-exclusion skip — GEMV/MHA skip branches are uncovered.

Risk: low-medium. Feature is opt-in and backward-compatible by default. The two IMPORTANT items are about scope creep and silent overrides under the new flag, both fixable with small guards/log messages.

coderabbitai

🧹 Nitpick comments (1)

tests/unit/onnx/quantization/test_qdq_rules_int8.py (1)
288-315: 💤 Low value

Consider clarifying the comment on line 313.

The comment states "only the 1st Conv is quantized" but the test only checks that the first Conv is quantized, without explicitly verifying that the other Conv nodes (dw_conv1, dw_conv2, dw_conv3) are not quantized. If other Convs can also be quantized when target_dla=False, consider updating the comment to be more precise (e.g., "Check that at least the 1st Conv is quantized"). If only the first Conv should be quantized, consider adding assertions to verify the other Convs are not quantized.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/onnx/quantization/test_qdq_rules_int8.py` around lines 288 - 315,
The comment "only the 1st Conv is quantized" is ambiguous because the test only
asserts the first Conv in conv_nodes is quantized; either update that comment to
"Check that at least the 1st Conv is quantized" or add explicit negative
assertions to ensure the remaining Conv nodes are not quantized (e.g., use
assert_nodes_are_not_quantized on conv_nodes[1:] or by referencing the specific
depthwise conv nodes if available) inside test_target_dla so the test clearly
enforces the intended behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/onnx/quantization/test_qdq_rules_int8.py`:
- Around line 288-315: The comment "only the 1st Conv is quantized" is ambiguous
because the test only asserts the first Conv in conv_nodes is quantized; either
update that comment to "Check that at least the 1st Conv is quantized" or add
explicit negative assertions to ensure the remaining Conv nodes are not
quantized (e.g., use assert_nodes_are_not_quantized on conv_nodes[1:] or by
referencing the specific depthwise conv nodes if available) inside
test_target_dla so the test clearly enforces the intended behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ea1160ae-9324-42a2-9710-41e3faa9442f

📥 Commits

Reviewing files that changed from the base of the PR and between 2f10789 and 3086897.

📒 Files selected for processing (6)

CHANGELOG.rst
modelopt/onnx/quantization/__main__.py
modelopt/onnx/quantization/int8.py
modelopt/onnx/quantization/quantize.py
tests/_test_utils/onnx/lib_test_models.py
tests/unit/onnx/quantization/test_qdq_rules_int8.py

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.rst

🚧 Files skipped from review as they are similar to previous changes (3)

modelopt/onnx/quantization/main.py
modelopt/onnx/quantization/int8.py
modelopt/onnx/quantization/quantize.py

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

gcunhase · 2026-06-11T15:18:40Z

Claude review — --target_dla for ONNX Q/DQ placement.

Findings: CRITICAL: 0, IMPORTANT: 2, SUGGESTION: 3.

Most impactful:

[IMPORTANT Compatibility] in quantize.py:628 — find_nodes_from_mha_to_exclude is skipped solely on target_dla, regardless of quantize_mode. The docstring and CLI help both promise "INT8 only", so combining --target_dla with --quantize_mode=fp8 silently changes FP8 MHA Q/DQ placement (head_size/fMHA-v2 exclusions are bypassed). Either gate the skip on int8 or document the cross-mode effect.

[IMPORTANT Compatibility] in int8.py:155 — when target_dla=True, any user-supplied op_types_to_quantize is silently overwritten with every op_type in the graph, including non-quantizable types (Reshape, Identity, Constant, Cast, …). Suggest logging a warning on override and intersecting with the quantizable set.

Three suggestions: (1) kwargs[\"target_dla\"] propagation is asymmetric vs. the int4 path, (2) test models use unseeded RNG, (3) test only covers conv-exclusion skip — GEMV/MHA skip branches are uncovered.

Risk: low-medium. Feature is opt-in and backward-compatible by default. The two IMPORTANT items are about scope creep and silent overrides under the new flag, both fixable with small guards/log messages.

All resolved.

gcunhase · 2026-06-11T15:19:00Z

/claude review

coderabbitai

🧹 Nitpick comments (1)

modelopt/onnx/quantization/quantize.py (1)
673-673: target_dla forwarding into FP8 is redundant (FP8 ignores it).
modelopt/onnx/quantization/fp8.py::quantize() accepts **kwargs but only reads enable_gemv_detection_for_trt, op_types_needing_output_quant, and no_quantize_inputs; it does not reference target_dla. So the docstring (“only has effect in INT8 quantization”) matches behavior. Optional: gate kwargs["target_dla"] to only quantize_mode == "int8" for clarity.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/quantization/quantize.py` at line 673, The assignment of
kwargs["target_dla"] is redundant for FP8 paths because fp8.py::quantize() does
not use target_dla; restrict forwarding by only setting kwargs["target_dla"]
when quantize_mode == "int8" (or the INT8-specific code path) so INT8
quantization receives it but FP8 does not; update the code around the kwargs
population in quantize.py to gate the target_dla assignment on quantize_mode ==
"int8".

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@modelopt/onnx/quantization/quantize.py`:
- Line 673: The assignment of kwargs["target_dla"] is redundant for FP8 paths
because fp8.py::quantize() does not use target_dla; restrict forwarding by only
setting kwargs["target_dla"] when quantize_mode == "int8" (or the INT8-specific
code path) so INT8 quantization receives it but FP8 does not; update the code
around the kwargs population in quantize.py to gate the target_dla assignment on
quantize_mode == "int8".

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 80acee62-eda4-432e-a9ee-7244d22aa3d3

📥 Commits

Reviewing files that changed from the base of the PR and between 3086897 and 9460382.

📒 Files selected for processing (6)

CHANGELOG.rst
modelopt/onnx/quantization/__main__.py
modelopt/onnx/quantization/int8.py
modelopt/onnx/quantization/quantize.py
tests/_test_utils/onnx/lib_test_models.py
tests/unit/onnx/quantization/test_qdq_rules_int8.py

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.rst

🚧 Files skipped from review as they are similar to previous changes (3)

modelopt/onnx/quantization/main.py
modelopt/onnx/quantization/int8.py
tests/_test_utils/onnx/lib_test_models.py

cjluo-nv

Bot review — DM the bot to share feedback.

Adds a --target_dla option to the ONNX quantization path that places Q/DQ nodes on all tensors for optimal DLA INT8 deployment. The change is small (+301/-12), cohesive, and backward compatible: target_dla defaults to False, preserving existing behavior, and the three guarded branches (GEMV/MatMul exclusion in int8.py, conv-exclusion, and MHA exclusion in quantize.py) are only bypassed when target_dla=True and quantize_mode == "int8".

Correctness checks:

target_dla is plumbed through quantize.py → quantize_int8 via kwargs and read with kwargs.get("target_dla", False). It's also injected into the fp8 path's kwargs, but fp8.quantize accepts **kwargs and ignores it, and the docstring/help correctly state the flag only affects INT8 — so no functional impact there.
When target_dla=True and no op_types_to_quantize is supplied, the code sets it to the full set of model op types. This correctly interacts with configure_ort, which only removes non-quantizable ops (Relu, Sigmoid, etc.) from the QDQ registry if they're NOT in op_types_to_quantize — so including all op types is consistent with DLA's "quantize everything" intent.

Tests: Two new parametrized tests (target_dla=False/True) cover both the Conv/Mul quantization expansion and the GEMV (MatMul m=1) exclusion bypass, with dedicated model builders. Assertions are meaningful and verify the QDQ placement differences between modes.

Docstring, --target_dla CLI help, and CHANGELOG are all updated. No licensing files or headers touched. No prompt-injection attempts in the PR content.

ajrasane · 2026-06-12T14:28:01Z

Could you check what is the accuracy impact if we enable quantizers for all layers?

gcunhase requested a review from a team as a code owner June 9, 2026 20:09

gcunhase requested a review from galagam June 9, 2026 20:09

gcunhase requested review from ajrasane and cjluo-nv and removed request for galagam June 9, 2026 20:09

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

coderabbitai Bot approved these changes Jun 9, 2026

View reviewed changes

gcunhase force-pushed the dev/gcunhasergio/6241485_dla_calib branch from b04c9e2 to 3086897 Compare June 10, 2026 18:17