feat: add per-model FP8 layerwise casting for VRAM reduction by Pfannkuchensack · Pull Request #8945 · invoke-ai/InvokeAI

Pfannkuchensack · 2026-03-06T16:08:09Z

FP8 Layerwise Casting - Implementation

Summary

Add per-model fp8_storage option to model default settings that enables diffusers' enable_layerwise_casting() to store weights in FP8 (float8_e4m3fn) while casting to fp16/bf16 during inference. This reduces VRAM usage by ~50% per model with minimal quality loss.

Supported: SD1/SD2/SDXL/SD3, Flux, Flux2, CogView4, Z-Image, VAE (diffusers-based), ControlNet, T2IAdapter.
Not applicable: Text Encoders, LoRA, GGUF, BnB, custom classes.

Related Issues / Discussions

[enhancement]: need to support fp8 #7148
Based on approach from A big improvement for dtype casting system with fp8 storage type and manual cast AUTOMATIC1111/stable-diffusion-webui#14031
Uses diffusers' native enable_layerwise_casting() (available in diffusers 0.36.0)

QA Instructions

Set fp8_storage: true in a model's default_settings (via API or Model Manager UI)
Load the model and generate an image
Verify VRAM usage is reduced compared to normal loading
Verify image quality is acceptable (minimal degradation expected)
Verify Text Encoders are NOT affected (excluded by submodel type filter)
Verify non-CUDA devices gracefully ignore the setting

Test Matrix

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Changes to a redux slice have a corresponding migration
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

Add fp8_storage option to model default settings that enables diffusers' enable_layerwise_casting() to store weights in FP8 (float8_e4m3fn) while casting to fp16/bf16 during inference. This reduces VRAM usage by ~50% per model with minimal quality loss. Supported: SD1/SD2/SDXL/SD3, Flux, Flux2, CogView4, Z-Image, VAE (diffusers-based), ControlNet, T2IAdapter. Not applicable: Text Encoders, LoRA, GGUF, BnB, custom classes

Add per-model FP8 storage toggle in Model Manager default settings for both main models and control adapter models. When enabled, model weights are stored in FP8 format in VRAM (~50% savings) and cast layer-by-layer to compute precision during inference via diffusers' enable_layerwise_casting(). Backend: add fp8_storage field to MainModelDefaultSettings and ControlAdapterDefaultSettings, apply FP8 layerwise casting in all relevant model loaders (SD, SDXL, FLUX, CogView4, Z-Image, ControlNet, T2IAdapter, VAE). Gracefully skips non-ModelMixin models (custom checkpoint loaders, GGUF, BnB). Frontend: add FP8 Storage switch to model default settings panels with InformationalPopover, translation keys, and proper form handling.

JPPhoto

In my quantized Krea dev setup, your code was never called - is this by design or an overlooked class?

I'd also like the UI to be tweaked so the fp8 setting appears as a single slider under Settings like CPU-only for text encoders rather than as a dual-slider in the model defaults section.

JPPhoto · 2026-03-21T00:06:36Z

invokeai/backend/model_manager/load/load_default.py

+    def _should_use_fp8(self, config: AnyModelConfig, submodel_type: Optional[SubModelType] = None) -> bool:
+        """Check if FP8 layerwise casting should be applied to a model."""
+        # FP8 storage only works on CUDA
+        if self._torch_device.type != "cuda":


Shouldn't this check self._get_execution_device() to make sure the model is to be executed on cuda?

It checks "does the system even have CUDA?", not "does this model run on CUDA?". Both lead to the same result, but _torch_device is semantically a better fit for a hardware capability check. The only difference is that _get_execution_device() requires config and submodel_type as parameters, while _torch_device does not.

JPPhoto · 2026-03-21T00:13:27Z

invokeai/backend/model_manager/load/model_loaders/flux.py

            local_files_only=True,
        )

+        model = self._apply_fp8_layerwise_casting(model, config, submodel_type)


Does this only apply to the v2 VAE and diffusers models? What about GGUF?

GGUF and BnB models are intentionally excluded — they already use their own quantization (typically Q4/Q8), so applying FP8 layerwise casting on top would be redundant and likely conflict with their dequantization logic during inference.

FluxCheckpointModel and Flux2CheckpointModel were missing the _apply_fp8_layerwise_casting call. Additionally, the FP8 casting only worked for diffusers ModelMixin models. Add manual layerwise casting via forward hooks for plain nn.Module (custom Flux class). Also simplify FP8 UI toggle from dual-slider to single switch, matching the CPU-only toggle pattern per review feedback on invoke-ai#8945.

Z-Image's transformer has dtype mismatches with diffusers' enable_layerwise_casting: skipped modules (t_embedder, cap_embedder) stay in bf16 while hooked modules cast to fp16, causing crashes in attention layers. Also hide the FP8 toggle in the UI for Z-Image models.

Models like Flux are loaded in bf16 but the global torch dtype is fp16, causing dtype mismatches during FP8 layerwise casting. Detect the model's actual parameter dtype and use it as compute_dtype for both diffusers ModelMixin and plain nn.Module models.

Pfannkuchensack added 2 commits March 6, 2026 15:55

github-actions bot added python PRs that change python files backend PRs that change backend files frontend PRs that change frontend files labels Mar 6, 2026

ruff format

afe246e

lstein assigned JPPhoto Mar 7, 2026

lstein added this to Invoke - Community Roadmap Mar 7, 2026

lstein moved this to 6.13.x in Invoke - Community Roadmap Mar 7, 2026

lstein added the v6.13.x label Mar 7, 2026

JPPhoto added 3 commits March 9, 2026 09:13

Merge branch 'main' into feature/fp8-layerwise-casting

2262d8d

Merge branch 'main' into feature/fp8-layerwise-casting

5327df8

Merge branch 'main' into feature/fp8-layerwise-casting

6c13fca

JPPhoto requested changes Mar 21, 2026

View reviewed changes

Pfannkuchensack added 4 commits March 21, 2026 04:09

Remove call for _should_use_fp8 in z-image

025759f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add per-model FP8 layerwise casting for VRAM reduction#8945

feat: add per-model FP8 layerwise casting for VRAM reduction#8945
Pfannkuchensack wants to merge 10 commits intoinvoke-ai:mainfrom
Pfannkuchensack:feature/fp8-layerwise-casting

Pfannkuchensack commented Mar 6, 2026 •

edited

Loading

Uh oh!

JPPhoto left a comment

Uh oh!

JPPhoto Mar 21, 2026

Uh oh!

Pfannkuchensack Mar 21, 2026

Uh oh!

JPPhoto Mar 21, 2026

Uh oh!

Pfannkuchensack Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Pfannkuchensack commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

FP8 Layerwise Casting - Implementation

Summary

Related Issues / Discussions

QA Instructions

Test Matrix

Checklist

Uh oh!

JPPhoto left a comment

Choose a reason for hiding this comment

Uh oh!

JPPhoto Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Pfannkuchensack Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

JPPhoto Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Pfannkuchensack Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pfannkuchensack commented Mar 6, 2026 •

edited

Loading