Conversation
| skip_patterns = [ | ||
| 'lm_head', | ||
| 'embed_tokens', | ||
| 'mlp.gate', # sparse MOE router gate | ||
| 'vision_model', # non-HF InternVL, vision part | ||
| 'mlp1', # non-HF InternVL, projector | ||
| 'mlp2', # non-HF InternVL-Flash, projector | ||
| 'vision_tower', # HF InternVL, vision part | ||
| 'multi_modal_projector', # HF InternVL, projector | ||
| ] | ||
| modules_to_not_convert = [] |
There was a problem hiding this comment.
These configurations are model-specific. We should adopt a more maintainable approach.
There was a problem hiding this comment.
I checked the vLLM FP8 compressor example, and noticed that the ignored patterns are indeed model-specific. Currently, these patterns are passed as an input argument named ignore in the quantization recipe.
https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8
How about we also expose this as a configurable input argument, allowing users to define their own ignore patterns as needed?
There was a problem hiding this comment.
@RunningLeon As discussed with @CUHKSZzxy, we propose a new --skip-pattern config.py option for custom skip patterns, alongside lmdeploy's internal defaults.
what's your opinion?
There was a problem hiding this comment.
Personally, if only passing skip patterns, a config file is not necessary.
| """ | ||
| tensor: torch.Tensor | ||
| scale: torch.Tensor | ||
| weight_scale_inv: torch.Tensor |
There was a problem hiding this comment.
Changing scale to weight_scale_inv might affect w8a8 quantized model inference.
Usage
NOTE: We can use either
pytorchorturbomindbackend for FP8 inference. Here we takepytorchbackend as an example.Accuracy
Dataset: OCRBench
Model: InternVL3.5-8B (FP8), InternVL3_5-30B-A3B (FP8)
Tested with VLMEvalKit.
Checklist
weight_scale_invmodification affects other quant methods / modules