Add Qwen3-VL model support + multi-image input support in Qwen VL family#2345
Merged
xiaoyu-work merged 30 commits intomicrosoft:mainfrom Mar 20, 2026
Merged
Add Qwen3-VL model support + multi-image input support in Qwen VL family#2345xiaoyu-work merged 30 commits intomicrosoft:mainfrom
xiaoyu-work merged 30 commits intomicrosoft:mainfrom
Conversation
- graph_surgeries.py: add QwenVL-specific graph surgery passes for vision embedding merge and positional encoding fixup - rtn_quantization.py: extend RTN quantization for multimodal models, handle vision encoder exclusion patterns - cast_chain_elimination.py: new pass to eliminate redundant Cast chains in Dynamo-exported models (fp32->fp16->fp32 patterns) - olive_config.json: register new passes
…surgery passes - rtn_quantization.py: Parameterize bits through quantization methods to support 8-bit Gather - common.py: Fix ByteSize() crash for >2GB models, fix FOLDED_FROM_KEY import - graph_surgeries.py: Add ReciprocalMulToDiv, DeduplicateSubgraphInitializers, DeduplicateNodes
…author (TD002), fix formatting
- Apply ruff format to 4 files (cast_chain_elimination.py, rtn_quantization.py, test_graph_surgeries.py, test_rtn_quantization.py) - Fix _pack_int8_to_int4 reshape bug: replace global flatten+pack with axis-aware _pack_int4_along_axis that correctly packs zero_point when k_blocks is small (e.g. 1), avoiding ValueError on reshape - Fix test_rtn_quantization_pass_gather assertion: GatherBlockQuantized always uses quantize_axis=data_rank-1, not pass_config['axis']
The upstream tuning_strategies.md page no longer exists, causing the Sphinx linkcheck to fail with -W (warnings-as-errors).
devang-ml
reviewed
Mar 13, 2026
Address PR review feedback from @devang-ml and @justinchuby: use onnxscript.optimizer.optimize() instead of ORT InferenceSession with session.enable_cast_chain_elimination to eliminate redundant Cast chains. - Remove onnxruntime dependency from cast_chain_elimination pass - Use onnxscript.optimizer.optimize() with TypeInferenceError fallback (same pattern as OnnxPeepholeOptimizer) - Update test comment to reflect onnxscript optimizer - Verified: numerically identical outputs (0.00 max abs diff) - Verified: no eval regression (69% on AI2D 100 samples)
Resolve conflict in olive/passes/onnx/common.py: take upstream fix from PR microsoft#2355 (ByteSize EncodeError handling).
justinchuby
reviewed
Mar 13, 2026
justinchuby
reviewed
Mar 13, 2026
…n elimination Use a custom CastCastRoundTrip rewrite rule instead of the full onnxscript.optimizer.optimize() call. The rewrite rule specifically targets round-trip Cast chains (e.g. fp32->fp16->fp32) by checking that the final cast type matches the original input type, and replaces them with Identity. This is simpler, faster, and avoids the TypeInferenceError fallback that was needed with the full optimizer. The onnxscript rewrite() function also runs RemoveUnusedNodesPass and RemoveUnusedOpsetsPass automatically. Validated: weights identical, 0.00 max abs diff, eval 69% unchanged.
devang-ml
reviewed
Mar 16, 2026
devang-ml
reviewed
Mar 16, 2026
Move _ensure_com_microsoft_opset and eliminate_cast_chains into ModelOptimizer class. Add fix_com_microsoft_opset and cast_chain_elimination config flags to OnnxPeepholeOptimizer. Remove standalone OnnxCastChainElimination pass, its olive_config entry, and its test file. Move tests into test_peephole_optimizer.py. Per devang-ml's review: consolidate into existing pass to avoid introducing a new one.
Add onnxscript_optimize, onnxoptimizer_optimize, and fuse_reshape_operations config flags (default True for backward compatibility). This allows recipe configs to disable the default optimizations and only run opset fixup + cast chain elimination, producing byte-identical models to the old standalone pass.
devang-ml
previously approved these changes
Mar 18, 2026
|
|
||
| Why this is needed: | ||
| ORT's ``convert_float_to_float16`` (``float16.py``) may insert identical | ||
| ``Cast`` nodes in parallel branches that each declare the same output tensor |
Collaborator
There was a problem hiding this comment.
Would it make sense to fix the convert_float_to_float16 ?
Contributor
Author
|
/azp run Olive CI |
|
Commenter does not have sufficient privileges for PR 2345 in repo microsoft/Olive |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends Olive’s ONNX optimization/quantization pipeline to better support Qwen VL-family exports by adding new ONNX graph-surgery utilities, enhancing RTN quantization (notably Gather + shared weights + initializer cleanup), and expanding the peephole optimizer with optional cast-chain elimination and com.microsoft opset fixups.
Changes:
- Enhanced
OnnxBlockWiseRtnQuantizationto support Gather 8-bit quantization, handle shared-weight initializers, and remove unused initializers post-quantization. - Added new
GraphSurgeriesproto-level surgeons:GemmToMatMulAdd,ReciprocalMulToDiv,DeduplicateSubgraphInitializers, andDeduplicateNodes, plus corresponding tests. - Extended
OnnxPeepholeOptimizerwith configurable optimization steps, an opset fix-up helper, and a cast-chain elimination rewrite rule, with new unit tests.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
olive/passes/onnx/rtn_quantization.py |
Adds Gather 8-bit support, shared initializer de-duping, and unused-initializer cleanup for RTN quantization. |
olive/passes/onnx/peephole_optimizer.py |
Adds optional com.microsoft opset fixup and cast-chain elimination; makes optimizer steps configurable. |
olive/passes/onnx/graph_surgeries.py |
Introduces several new proto-based graph surgery passes for compatibility and cleanup. |
olive/passes/onnx/common.py |
Adds compatibility fallback for FOLDED_FROM_KEY import. |
test/passes/onnx/test_rtn_quantization.py |
Expands RTN quantization tests for Gather 8-bit, axis forcing, shared weights, and initializer cleanup. |
test/passes/onnx/test_peephole_optimizer.py |
Adds unit tests for opset fixup and cast-chain elimination behavior. |
test/passes/onnx/test_graph_surgeries.py |
Adds tests validating new graph surgery passes and numerical correctness where applicable. |
CI uses an ORT version that supports max IR version 11, but newer ONNX packages default to IR version 13. Pin to 10 to match the convention used by existing tests.
… assert - GemmToMatMulAdd: create new transposed initializer instead of mutating shared one in-place; use base_name fallback for empty node.name to avoid duplicate tensor names. - ReciprocalMulToDiv: build consumer map upfront to avoid O(N^2) graph scans; re-check actual inputs for stale consumer references. - test_rtn_quantization: add found assertion in test_gather_quantize_axis_forced_to_last_dim. Validated: 0.00 max abs diff, eval 69% unchanged.
xiaoyu-work
approved these changes
Mar 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds support for exporting and optimizing Qwen3-VL (and Qwen2.5-VL) vision-language models through Olive, including new ONNX graph surgery passes, 8-bit quantization enhancements, and a cast chain elimination pass.