Fix GGUF to Work Better with `modules_to_not_convert` / `keep_in_fp32_modules` by dg845 · Pull Request #13697 · huggingface/diffusers

dg845 · 2026-05-08T09:23:31Z

What does this PR do?

This PR contains several fixes so the GGUF loading and inference work better with module_to_not_convert and _keep_in_fp32_modules.

Changelist

src/diffusers/quantizers/gguf/utils.py
1. _replace_with_gguf_linear: adds a check to see if any of the current module's named_children are in modules_to_not_convert, and if so, skip it. This allows us skip containers, rather than just leaf-level nn.Linear submodules as in the current code. For example, TimestepEmbedding modules are commonly added to _keep_in_fp32_modules (e.g. time_embedder in WanTransformer3DModel's WanTimeTextImageEmbedding condition embedder), but since they themselves contain leaf nn.Linear submodules such as linear_1, the current code will only check against leaf modules such as linear_1, and conclude incorrectly that they should be converted.
2. _fused_mul_mat_gguf: in the UNQUANTIZED_TYPES case, also cast the dequantized weight to the activation x's dtype before performing the matrix multiplication, which should prevent dtype errors for BF16 weights.
src/diffusers/quantizers/gguf/gguf_quantizer.py
1. GGUFQuantizer.create_quantized_param: handles modules_to_not_convert by dequantizing them, so that they end up in their original unquantized form. This is intended to handle the case where a module in self.modules_to_not_convert (or one of its children) is in the GGUF file. Since it is in the file, it will be converted to a GGUFParameter, but we don't want it to be quantized, so we convert it back here.

Inspired by GGUF debugging in #13551, in particular #13551 (comment).

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6
@sayakpaul

HuggingFaceDocBuilderDev · 2026-05-08T09:34:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2026-05-08T13:35:39Z

+        # If the GGUFParameter should not be quantized (for example, it is a submodule of any excluded module),
+        # dequantize it and set the (dequantized) parameter to the proper dtype.
+        if isinstance(param_value, GGUFParameter) and any(
+            m in param_name.split(".") for m in self.modules_to_not_convert
+        ):
+            keep_in_fp32 = getattr(self, "keep_in_fp32_modules", [])
+            target_dtype = (
+                torch.float32 if any(m in param_name.split(".") for m in keep_in_fp32) else self.compute_dtype
+            )
+            param_value = dequantize_gguf_tensor(param_value).to(target_dtype)


I am a bit confused. If a param is already GGUFParameter type, then I'd assume that it's already quantized. In that case, how come dequantize -> type upcasting is the right sequence of ops?

What am I missing?

The idea is that the GGUF checkpoint might specify a quantization for a parameter that we do not want to be quantized, as expressed through either _keep_in_fp32_modules on the model: ModelMixin or modules_to_not_convert on GGUFQuantizationConfig.

When we load the GGUF state dict, these parameters will be placed into a GGUFParameter, and this happens before we load the weights into the model (e.g. in FromOriginalModelMixin.from_single_file). To respect modules_to_not_convert, we need to convert these back into normal (unquantized) parameters, which we do here at load time via dequantize_gguf_tensor. We then need to cast the parameter to the right compute dtype, which is torch.float32 for keep_in_fp32_modules and compute_dtype otherwise.

Currently, GGUFQuantizationConfig doesn't expose a modules_to_not_convert argument, but keep_in_fp32_modules are included in modules_to_not_convert:

diffusers/src/diffusers/quantizers/gguf/gguf_quantizer.py

Line 133 in d773308

self.modules_to_not_convert.extend(keep_in_fp32_modules)

So this change would affect only any specified _keep_in_fp32_modules right now.

sayakpaul · 2026-05-08T13:38:49Z

    if qweight_type in UNQUANTIZED_TYPES:
        weight = dequantize_gguf_tensor(qweight)
-        return x @ weight.T
+        return x @ weight.to(x.dtype).T


Would it break torch.compile compatibility for models that don't define modules_to_not_convert / keep_in_fp32_modules?

I'm not sure how it will interact with torch.compile, but this change mirrors the implementation used for quantized weight types (qweight_type in DEQUANT_TYPES):

diffusers/src/diffusers/quantizers/gguf/utils.py

Lines 98 to 99 in d773308

weight = ops.ggml_dequantize(qweight, qweight_type, *shape)

y = x @ weight.to(x.dtype).T

So I think it should be fine? (I think this change isn't specific to modules_to_not_convert, as the GGUF checkpoint could store weights in e.g. BF16 even if modules_to_not_convert is empty, which would then go through this code path.)

sayakpaul

Thanks! I left some comments. I think there should be a test for this in

diffusers/tests/models/testing_utils/quantization.py

Line 984 in a851ce1

class GGUFTesterMixin(GGUFConfigMixin, QuantizationTesterMixin):

Fix GGUF to better respect module_to_not_convert / keep_in_fp32_modules

7b4dfc0

dg845 requested a review from DN6 May 8, 2026 09:23

github-actions Bot added the size/S PR with diff < 50 LOC label May 8, 2026

dg845 requested a review from sayakpaul May 8, 2026 09:23

github-actions Bot added the quantization label May 8, 2026

make style

022b705

dg845 mentioned this pull request May 8, 2026

feat: Add Motif-Video model and pipelines #13551

Open

6 tasks

sayakpaul reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GGUF to Work Better with `modules_to_not_convert` / `keep_in_fp32_modules`#13697

Fix GGUF to Work Better with `modules_to_not_convert` / `keep_in_fp32_modules`#13697
dg845 wants to merge 2 commits intomainfrom
gguf-fix-modules-not-to-convert

dg845 commented May 8, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 8, 2026

Uh oh!

sayakpaul May 8, 2026

Uh oh!

dg845 May 9, 2026 •

edited

Loading

Uh oh!

sayakpaul May 8, 2026

Uh oh!

dg845 May 9, 2026

Uh oh!

sayakpaul left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	weight = ops.ggml_dequantize(qweight, qweight_type, *shape)
	y = x @ weight.to(x.dtype).T

Conversation

dg845 commented May 8, 2026

What does this PR do?

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented May 8, 2026

Uh oh!

sayakpaul May 8, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul May 8, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 May 9, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dg845 May 9, 2026 •

edited

Loading