Skip to content

Add Qwen3-VL model support + multi-image input support in Qwen VL family#2345

Merged
xiaoyu-work merged 30 commits intomicrosoft:mainfrom
hanbitmyths:sunghcho/qwen3-vl
Mar 20, 2026
Merged

Add Qwen3-VL model support + multi-image input support in Qwen VL family#2345
xiaoyu-work merged 30 commits intomicrosoft:mainfrom
hanbitmyths:sunghcho/qwen3-vl

Conversation

@hanbitmyths
Copy link
Copy Markdown
Contributor

@hanbitmyths hanbitmyths commented Mar 4, 2026

This PR adds support for exporting and optimizing Qwen3-VL (and Qwen2.5-VL) vision-language models through Olive, including new ONNX graph surgery passes, 8-bit quantization enhancements, and a cast chain elimination pass.

  • Add Qwen3-VL / Qwen2.5-VL model export support via Model Builder and torch export
  • New pass: CastChainElimination removes redundant Cast→Cast chains (e.g., fp32→fp16→fp32) by collapsing them into a single Cast or eliminating them entirely when source and target types match.
  • GemmToMatMulAdd graph surgery converts Gemm nodes to MatMul+Add for broader runtime compatibility.
  • ReciprocalMulToDiv graph surgery fuses Reciprocal→Mul patterns into a single Div node.
  • DeduplicateSubgraphInitializers graph surgery merges duplicate initializers that share identical tensor data.
  • DeduplicateNodes graph surgery removes duplicate nodes that have identical op_type, attributes, and inputs.
  • Add 8-bit integer Gather quantization into RTN quantization.
  • Skip quantization of unused initializers.

- graph_surgeries.py: add QwenVL-specific graph surgery passes for
  vision embedding merge and positional encoding fixup
- rtn_quantization.py: extend RTN quantization for multimodal models,
  handle vision encoder exclusion patterns
- cast_chain_elimination.py: new pass to eliminate redundant Cast chains
  in Dynamo-exported models (fp32->fp16->fp32 patterns)
- olive_config.json: register new passes
…surgery passes

- rtn_quantization.py: Parameterize bits through quantization methods to support 8-bit Gather
- common.py: Fix ByteSize() crash for >2GB models, fix FOLDED_FROM_KEY import
- graph_surgeries.py: Add ReciprocalMulToDiv, DeduplicateSubgraphInitializers, DeduplicateNodes
Comment thread olive/passes/onnx/cast_chain_elimination.py Fixed
Comment thread olive/passes/onnx/common.py Fixed
Comment thread olive/passes/onnx/graph_surgeries.py Fixed
Comment thread olive/passes/onnx/graph_surgeries.py Fixed
Comment thread olive/passes/onnx/graph_surgeries.py Fixed
Comment thread olive/passes/onnx/graph_surgeries.py Fixed
Comment thread olive/passes/onnx/graph_surgeries.py Fixed
Comment thread olive/passes/onnx/graph_surgeries.py Fixed
Comment thread olive/passes/onnx/graph_surgeries.py Fixed
Comment thread olive/passes/onnx/graph_surgeries.py Fixed
hanbitmyths and others added 4 commits March 3, 2026 22:55
- Apply ruff format to 4 files (cast_chain_elimination.py,
  rtn_quantization.py, test_graph_surgeries.py, test_rtn_quantization.py)
- Fix _pack_int8_to_int4 reshape bug: replace global flatten+pack with
  axis-aware _pack_int4_along_axis that correctly packs zero_point when
  k_blocks is small (e.g. 1), avoiding ValueError on reshape
- Fix test_rtn_quantization_pass_gather assertion: GatherBlockQuantized
  always uses quantize_axis=data_rank-1, not pass_config['axis']
The upstream tuning_strategies.md page no longer exists, causing the
Sphinx linkcheck to fail with -W (warnings-as-errors).
Comment thread olive/passes/onnx/cast_chain_elimination.py Outdated
Address PR review feedback from @devang-ml and @justinchuby: use
onnxscript.optimizer.optimize() instead of ORT InferenceSession with
session.enable_cast_chain_elimination to eliminate redundant Cast chains.

- Remove onnxruntime dependency from cast_chain_elimination pass
- Use onnxscript.optimizer.optimize() with TypeInferenceError fallback
  (same pattern as OnnxPeepholeOptimizer)
- Update test comment to reflect onnxscript optimizer
- Verified: numerically identical outputs (0.00 max abs diff)
- Verified: no eval regression (69% on AI2D 100 samples)
Resolve conflict in olive/passes/onnx/common.py: take upstream fix
from PR microsoft#2355 (ByteSize EncodeError handling).
Comment thread olive/passes/onnx/cast_chain_elimination.py Outdated
Comment thread test/passes/onnx/test_cast_chain_elimination.py Outdated
…n elimination

Use a custom CastCastRoundTrip rewrite rule instead of the full
onnxscript.optimizer.optimize() call. The rewrite rule specifically
targets round-trip Cast chains (e.g. fp32->fp16->fp32) by checking
that the final cast type matches the original input type, and replaces
them with Identity.

This is simpler, faster, and avoids the TypeInferenceError fallback
that was needed with the full optimizer. The onnxscript rewrite()
function also runs RemoveUnusedNodesPass and RemoveUnusedOpsetsPass
automatically.

Validated: weights identical, 0.00 max abs diff, eval 69% unchanged.
Comment thread olive/passes/onnx/cast_chain_elimination.py Fixed
Comment thread olive/passes/onnx/cast_chain_elimination.py Fixed
Comment thread olive/passes/onnx/cast_chain_elimination.py Fixed
Comment thread olive/passes/onnx/cast_chain_elimination.py Fixed
Comment thread olive/passes/onnx/cast_chain_elimination.py Fixed
Comment thread olive/passes/onnx/cast_chain_elimination.py Fixed
Comment thread olive/passes/onnx/cast_chain_elimination.py Outdated
Comment thread olive/passes/onnx/cast_chain_elimination.py Outdated
Move _ensure_com_microsoft_opset and eliminate_cast_chains into
ModelOptimizer class. Add fix_com_microsoft_opset and
cast_chain_elimination config flags to OnnxPeepholeOptimizer.

Remove standalone OnnxCastChainElimination pass, its olive_config
entry, and its test file. Move tests into test_peephole_optimizer.py.

Per devang-ml's review: consolidate into existing pass to avoid
introducing a new one.
Add onnxscript_optimize, onnxoptimizer_optimize, and
fuse_reshape_operations config flags (default True for backward
compatibility). This allows recipe configs to disable the default
optimizations and only run opset fixup + cast chain elimination,
producing byte-identical models to the old standalone pass.
devang-ml
devang-ml previously approved these changes Mar 18, 2026

Why this is needed:
ORT's ``convert_float_to_float16`` (``float16.py``) may insert identical
``Cast`` nodes in parallel branches that each declare the same output tensor
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to fix the convert_float_to_float16 ?

@hanbitmyths
Copy link
Copy Markdown
Contributor Author

/azp run Olive CI

@azure-pipelines
Copy link
Copy Markdown

Commenter does not have sufficient privileges for PR 2345 in repo microsoft/Olive

Copilot AI review requested due to automatic review settings March 19, 2026 00:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends Olive’s ONNX optimization/quantization pipeline to better support Qwen VL-family exports by adding new ONNX graph-surgery utilities, enhancing RTN quantization (notably Gather + shared weights + initializer cleanup), and expanding the peephole optimizer with optional cast-chain elimination and com.microsoft opset fixups.

Changes:

  • Enhanced OnnxBlockWiseRtnQuantization to support Gather 8-bit quantization, handle shared-weight initializers, and remove unused initializers post-quantization.
  • Added new GraphSurgeries proto-level surgeons: GemmToMatMulAdd, ReciprocalMulToDiv, DeduplicateSubgraphInitializers, and DeduplicateNodes, plus corresponding tests.
  • Extended OnnxPeepholeOptimizer with configurable optimization steps, an opset fix-up helper, and a cast-chain elimination rewrite rule, with new unit tests.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
olive/passes/onnx/rtn_quantization.py Adds Gather 8-bit support, shared initializer de-duping, and unused-initializer cleanup for RTN quantization.
olive/passes/onnx/peephole_optimizer.py Adds optional com.microsoft opset fixup and cast-chain elimination; makes optimizer steps configurable.
olive/passes/onnx/graph_surgeries.py Introduces several new proto-based graph surgery passes for compatibility and cleanup.
olive/passes/onnx/common.py Adds compatibility fallback for FOLDED_FROM_KEY import.
test/passes/onnx/test_rtn_quantization.py Expands RTN quantization tests for Gather 8-bit, axis forcing, shared weights, and initializer cleanup.
test/passes/onnx/test_peephole_optimizer.py Adds unit tests for opset fixup and cast-chain elimination behavior.
test/passes/onnx/test_graph_surgeries.py Adds tests validating new graph surgery passes and numerical correctness where applicable.

Comment thread olive/passes/onnx/graph_surgeries.py Outdated
Comment thread olive/passes/onnx/graph_surgeries.py Outdated
Comment thread olive/passes/onnx/graph_surgeries.py Outdated
Comment thread test/passes/onnx/test_peephole_optimizer.py
Comment thread test/passes/onnx/test_rtn_quantization.py
CI uses an ORT version that supports max IR version 11, but newer
ONNX packages default to IR version 13. Pin to 10 to match the
convention used by existing tests.
… assert

- GemmToMatMulAdd: create new transposed initializer instead of
  mutating shared one in-place; use base_name fallback for empty
  node.name to avoid duplicate tensor names.
- ReciprocalMulToDiv: build consumer map upfront to avoid O(N^2)
  graph scans; re-check actual inputs for stale consumer references.
- test_rtn_quantization: add found assertion in
  test_gather_quantize_axis_forced_to_last_dim.

Validated: 0.00 max abs diff, eval 69% unchanged.
@xiaoyu-work xiaoyu-work enabled auto-merge (squash) March 20, 2026 22:51
@xiaoyu-work xiaoyu-work merged commit f56d223 into microsoft:main Mar 20, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants