fix(vl): reduce multimodal feature memory use by CUHKSZzxy · Pull Request #4603 · InternLM/lmdeploy

CUHKSZzxy · 2026-05-20T08:53:41Z

Summary

Resolve multimodal feature dtype from the original Transformers config inside ImageEncoder, including nested text/LLM configs used by recent VLM families.
Cast floating multimodal processor feature tensors through VisionModel._postprocess_mm_output() before expansion to reduce feature memory overhead for bf16/fp16 VLM configs.
Keep dtype selection independent of backend engine config access, so the logic is shared by PyTorch and TurboMind VL paths.
Leave temporary reference-clearing cleanup and VL preprocess timing logs out of this PR; those can be handled separately.

Validation

Syntax checks for the touched VL serving and MP-engine modules passed.
Diff whitespace check passed.
qwen3.5 VLM single-image pipeline smoke passed, including bf16 multimodal feature dtype resolution.

Notes

No new unit tests were added.

Assistance

Assisted with Codex + GPT-5.5 xHigh

Copilot

Pull request overview

This PR reduces memory pressure in the VL (multimodal) serving path by aligning multimodal feature tensor dtypes with the resolved PyTorch model dtype, and by dropping large multimodal references earlier after handoff through the scheduler/RPC layers.

Changes:

Cast floating multimodal processor outputs (e.g., pixel_values) to the resolved model dtype during VL preprocessing.
Drop large multimodal/RPC payload references earlier in async serving and MP-engine RPC to lower peak memory.
Expose MP-engine model_config to enable VL dtype selection, and add timing logs + focused tests for dtype handling.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/test_lmdeploy/test_vl/test_mm_feature_dtype.py	Adds tests for casting only floating MM tensors + MP engine model_config exposure.
lmdeploy/vl/model/base.py	Introduces MM feature dtype normalization/casting during preprocessing.
lmdeploy/vl/engine.py	Adds `mm_feature_dtype` plumbing into `ImageEncoder` and logs preprocess duration.
lmdeploy/serve/processors/multimodal.py	Threads `request_id` into VL preprocessing calls (but has a positional-arg bug).
lmdeploy/serve/core/vl_async_engine.py	Picks resolved model dtype from engine `model_config` and passes to `ImageEncoder`.
lmdeploy/serve/core/async_engine.py	Drops `multimodal` from kwargs after generator creation; passes `request_id` into prompt processing.
lmdeploy/pytorch/engine/mp_engine/zmq_rpc.py	Clears large RPC payload references (e.g., `multimodal`, pickled blobs) after handoff.
lmdeploy/pytorch/engine/mp_engine/base.py	Exposes `model_config` and drops `multimodal` from streaming kwargs.
lmdeploy/pytorch/engine/mp_engine/base_worker.py	Adds worker RPC method to return resolved `model_config`.
lmdeploy/pytorch/engine/engine_instance.py	Clears local references to `msg`/`multimodal` after enqueueing request.

Comments suppressed due to low confidence (1)

lmdeploy/serve/processors/multimodal.py:406

Same positional-argument issue as above: vl_encoder.preprocess(messages, mm_processor_kwargs, ...) binds mm_processor_kwargs to input_prompt. This will fail for models that use the new preprocess API. Use keyword arguments (mm_processor_kwargs=...) or explicitly pass input_prompt=None and keep mm_processor_kwargs as the third arg.

            else:
                results = await self.vl_encoder.preprocess(messages, mm_processor_kwargs, request_id=request_id)
                results = await self.vl_encoder.wrap_for_pytorch(messages=results,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lvhan028 · 2026-05-24T03:40:31Z

+        if backend == 'pytorch':
+            model_config = getattr(self.engine, 'model_config', None)
+            mm_feature_dtype = getattr(model_config, 'dtype', None)


Could we get dtype from the original Transformers config in the class ImageEncoder to benefit both engines?

grimoire · 2026-05-25T13:17:43Z

+        """Cast floating processor-output tensors to the target model dtype."""
+        if not isinstance(target_dtype, torch.dtype):
+            return output
+        if not torch.empty((), dtype=target_dtype).is_floating_point():


dtype has field is_float_point

* fix(vl): reduce multimodal feature memory use * debug: log vl preprocess duration * fix: address multimodal preprocess review comments * test: remove multimodal preprocess regression test * fix comments * fix: resolve vl mm feature dtype from hf config * chore: remove redundant multimodal cleanup * chore: defer vl preprocess timing logs * chore: simplify vl dtype checks

CUHKSZzxy added 2 commits May 18, 2026 12:01

fix(vl): reduce multimodal feature memory use

b83addf

debug: log vl preprocess duration

23ee3b7

CUHKSZzxy marked this pull request as ready for review May 21, 2026 11:04

Copilot AI review requested due to automatic review settings May 21, 2026 11:04

Copilot started reviewing on behalf of CUHKSZzxy May 21, 2026 11:04 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread lmdeploy/serve/processors/multimodal.py

Comment thread lmdeploy/vl/engine.py Outdated

CUHKSZzxy added 2 commits May 21, 2026 20:15

fix: address multimodal preprocess review comments

2cde2e5

test: remove multimodal preprocess regression test

5258e2f