Skip to content

fix(vl): reduce multimodal feature memory use#4603

Merged
lvhan028 merged 9 commits into
InternLM:mainfrom
CUHKSZzxy:fix/vlm-mm-feature-dtype-release
May 26, 2026
Merged

fix(vl): reduce multimodal feature memory use#4603
lvhan028 merged 9 commits into
InternLM:mainfrom
CUHKSZzxy:fix/vlm-mm-feature-dtype-release

Conversation

@CUHKSZzxy
Copy link
Copy Markdown
Collaborator

@CUHKSZzxy CUHKSZzxy commented May 20, 2026

Summary

  • Resolve multimodal feature dtype from the original Transformers config inside ImageEncoder, including nested text/LLM configs used by recent VLM families.
  • Cast floating multimodal processor feature tensors through VisionModel._postprocess_mm_output() before expansion to reduce feature memory overhead for bf16/fp16 VLM configs.
  • Keep dtype selection independent of backend engine config access, so the logic is shared by PyTorch and TurboMind VL paths.
  • Leave temporary reference-clearing cleanup and VL preprocess timing logs out of this PR; those can be handled separately.

Validation

  • Syntax checks for the touched VL serving and MP-engine modules passed.
  • Diff whitespace check passed.
  • qwen3.5 VLM single-image pipeline smoke passed, including bf16 multimodal feature dtype resolution.

Notes

  • No new unit tests were added.

Assistance

Assisted with Codex + GPT-5.5 xHigh

@CUHKSZzxy CUHKSZzxy marked this pull request as ready for review May 21, 2026 11:04
Copilot AI review requested due to automatic review settings May 21, 2026 11:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces memory pressure in the VL (multimodal) serving path by aligning multimodal feature tensor dtypes with the resolved PyTorch model dtype, and by dropping large multimodal references earlier after handoff through the scheduler/RPC layers.

Changes:

  • Cast floating multimodal processor outputs (e.g., pixel_values) to the resolved model dtype during VL preprocessing.
  • Drop large multimodal/RPC payload references earlier in async serving and MP-engine RPC to lower peak memory.
  • Expose MP-engine model_config to enable VL dtype selection, and add timing logs + focused tests for dtype handling.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_lmdeploy/test_vl/test_mm_feature_dtype.py Adds tests for casting only floating MM tensors + MP engine model_config exposure.
lmdeploy/vl/model/base.py Introduces MM feature dtype normalization/casting during preprocessing.
lmdeploy/vl/engine.py Adds mm_feature_dtype plumbing into ImageEncoder and logs preprocess duration.
lmdeploy/serve/processors/multimodal.py Threads request_id into VL preprocessing calls (but has a positional-arg bug).
lmdeploy/serve/core/vl_async_engine.py Picks resolved model dtype from engine model_config and passes to ImageEncoder.
lmdeploy/serve/core/async_engine.py Drops multimodal from kwargs after generator creation; passes request_id into prompt processing.
lmdeploy/pytorch/engine/mp_engine/zmq_rpc.py Clears large RPC payload references (e.g., multimodal, pickled blobs) after handoff.
lmdeploy/pytorch/engine/mp_engine/base.py Exposes model_config and drops multimodal from streaming kwargs.
lmdeploy/pytorch/engine/mp_engine/base_worker.py Adds worker RPC method to return resolved model_config.
lmdeploy/pytorch/engine/engine_instance.py Clears local references to msg/multimodal after enqueueing request.
Comments suppressed due to low confidence (1)

lmdeploy/serve/processors/multimodal.py:406

  • Same positional-argument issue as above: vl_encoder.preprocess(messages, mm_processor_kwargs, ...) binds mm_processor_kwargs to input_prompt. This will fail for models that use the new preprocess API. Use keyword arguments (mm_processor_kwargs=...) or explicitly pass input_prompt=None and keep mm_processor_kwargs as the third arg.
            else:
                results = await self.vl_encoder.preprocess(messages, mm_processor_kwargs, request_id=request_id)
                results = await self.vl_encoder.wrap_for_pytorch(messages=results,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lmdeploy/serve/processors/multimodal.py
Comment thread lmdeploy/vl/engine.py Outdated
Comment thread lmdeploy/serve/core/async_engine.py Outdated
Comment thread lmdeploy/serve/processors/multimodal.py Outdated
Comment thread lmdeploy/serve/processors/multimodal.py Outdated
Comment thread lmdeploy/serve/processors/multimodal.py Outdated
Comment thread lmdeploy/serve/processors/multimodal.py Outdated
Comment thread lmdeploy/serve/core/vl_async_engine.py Outdated
Comment on lines +37 to +39
if backend == 'pytorch':
model_config = getattr(self.engine, 'model_config', None)
mm_feature_dtype = getattr(model_config, 'dtype', None)
Copy link
Copy Markdown
Collaborator

@lvhan028 lvhan028 May 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get dtype from the original Transformers config in the class ImageEncoder to benefit both engines?

Comment thread tests/test_lmdeploy/test_vl/test_mm_feature_dtype.py Outdated
@lvhan028 lvhan028 requested a review from grimoire May 25, 2026 12:15
Comment thread lmdeploy/vl/model/base.py Outdated
"""Cast floating processor-output tensors to the target model dtype."""
if not isinstance(target_dtype, torch.dtype):
return output
if not torch.empty((), dtype=target_dtype).is_floating_point():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype has field is_float_point

@lvhan028 lvhan028 merged commit 92a62a9 into InternLM:main May 26, 2026
5 checks passed
@CUHKSZzxy CUHKSZzxy deleted the fix/vlm-mm-feature-dtype-release branch May 26, 2026 07:32
lvhan028 pushed a commit to lvhan028/lmdeploy that referenced this pull request May 27, 2026
* fix(vl): reduce multimodal feature memory use

* debug: log vl preprocess duration

* fix: address multimodal preprocess review comments

* test: remove multimodal preprocess regression test

* fix comments

* fix: resolve vl mm feature dtype from hf config

* chore: remove redundant multimodal cleanup

* chore: defer vl preprocess timing logs

* chore: simplify vl dtype checks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants