-
Notifications
You must be signed in to change notification settings - Fork 7k
Fix GGUF to Work Better with modules_to_not_convert / keep_in_fp32_modules
#13697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -80,7 +80,7 @@ def _fused_mul_mat_gguf(x: torch.Tensor, qweight: torch.Tensor, qweight_type: in | |||||
| # there is no need to call any kernel for fp16/bf16 | ||||||
| if qweight_type in UNQUANTIZED_TYPES: | ||||||
| weight = dequantize_gguf_tensor(qweight) | ||||||
| return x @ weight.T | ||||||
| return x @ weight.to(x.dtype).T | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it break
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure how it will interact with diffusers/src/diffusers/quantizers/gguf/utils.py Lines 98 to 99 in d773308
So I think it should be fine? (I think this change isn't specific to |
||||||
|
|
||||||
| # TODO(Isotr0py): GGUF's MMQ and MMVQ implementation are designed for | ||||||
| # contiguous batching and inefficient with diffusers' batching, | ||||||
|
|
@@ -134,6 +134,8 @@ def _should_convert_to_gguf(state_dict, prefix): | |||||
| return | ||||||
|
|
||||||
| for name, module in model.named_children(): | ||||||
| if name in modules_to_not_convert: | ||||||
| continue | ||||||
| module_prefix = prefix + name + "." | ||||||
| _replace_with_gguf_linear(module, compute_dtype, state_dict, module_prefix, modules_to_not_convert) | ||||||
|
|
||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused. If a param is already
GGUFParametertype, then I'd assume that it's already quantized. In that case, how comedequantize -> type upcastingis the right sequence of ops?What am I missing?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that the GGUF checkpoint might specify a quantization for a parameter that we do not want to be quantized, as expressed through either
_keep_in_fp32_moduleson themodel: ModelMixinormodules_to_not_convertonGGUFQuantizationConfig.When we load the GGUF state dict, these parameters will be placed into a
GGUFParameter, and this happens before we load the weights into the model (e.g. inFromOriginalModelMixin.from_single_file). To respectmodules_to_not_convert, we need to convert these back into normal (unquantized) parameters, which we do here at load time viadequantize_gguf_tensor. We then need to cast the parameter to the right compute dtype, which istorch.float32forkeep_in_fp32_modulesandcompute_dtypeotherwise.Currently,
GGUFQuantizationConfigdoesn't expose amodules_to_not_convertargument, butkeep_in_fp32_modulesare included inmodules_to_not_convert:diffusers/src/diffusers/quantizers/gguf/gguf_quantizer.py
Line 133 in d773308
So this change would affect only any specified
_keep_in_fp32_modulesright now.