feat: add per-model FP8 layerwise casting for VRAM reduction#8945
feat: add per-model FP8 layerwise casting for VRAM reduction#8945Pfannkuchensack wants to merge 10 commits intoinvoke-ai:mainfrom
Conversation
Add fp8_storage option to model default settings that enables diffusers' enable_layerwise_casting() to store weights in FP8 (float8_e4m3fn) while casting to fp16/bf16 during inference. This reduces VRAM usage by ~50% per model with minimal quality loss. Supported: SD1/SD2/SDXL/SD3, Flux, Flux2, CogView4, Z-Image, VAE (diffusers-based), ControlNet, T2IAdapter. Not applicable: Text Encoders, LoRA, GGUF, BnB, custom classes
Add per-model FP8 storage toggle in Model Manager default settings for both main models and control adapter models. When enabled, model weights are stored in FP8 format in VRAM (~50% savings) and cast layer-by-layer to compute precision during inference via diffusers' enable_layerwise_casting(). Backend: add fp8_storage field to MainModelDefaultSettings and ControlAdapterDefaultSettings, apply FP8 layerwise casting in all relevant model loaders (SD, SDXL, FLUX, CogView4, Z-Image, ControlNet, T2IAdapter, VAE). Gracefully skips non-ModelMixin models (custom checkpoint loaders, GGUF, BnB). Frontend: add FP8 Storage switch to model default settings panels with InformationalPopover, translation keys, and proper form handling.
JPPhoto
left a comment
There was a problem hiding this comment.
In my quantized Krea dev setup, your code was never called - is this by design or an overlooked class?
I'd also like the UI to be tweaked so the fp8 setting appears as a single slider under Settings like CPU-only for text encoders rather than as a dual-slider in the model defaults section.
| def _should_use_fp8(self, config: AnyModelConfig, submodel_type: Optional[SubModelType] = None) -> bool: | ||
| """Check if FP8 layerwise casting should be applied to a model.""" | ||
| # FP8 storage only works on CUDA | ||
| if self._torch_device.type != "cuda": |
There was a problem hiding this comment.
Shouldn't this check self._get_execution_device() to make sure the model is to be executed on cuda?
There was a problem hiding this comment.
It checks "does the system even have CUDA?", not "does this model run on CUDA?". Both lead to the same result, but _torch_device is semantically a better fit for a hardware capability check. The only difference is that _get_execution_device() requires config and submodel_type as parameters, while _torch_device does not.
| local_files_only=True, | ||
| ) | ||
|
|
||
| model = self._apply_fp8_layerwise_casting(model, config, submodel_type) |
There was a problem hiding this comment.
Does this only apply to the v2 VAE and diffusers models? What about GGUF?
There was a problem hiding this comment.
GGUF and BnB models are intentionally excluded — they already use their own quantization (typically Q4/Q8), so applying FP8 layerwise casting on top would be redundant and likely conflict with their dequantization logic during inference.
FluxCheckpointModel and Flux2CheckpointModel were missing the _apply_fp8_layerwise_casting call. Additionally, the FP8 casting only worked for diffusers ModelMixin models. Add manual layerwise casting via forward hooks for plain nn.Module (custom Flux class). Also simplify FP8 UI toggle from dual-slider to single switch, matching the CPU-only toggle pattern per review feedback on invoke-ai#8945.
Z-Image's transformer has dtype mismatches with diffusers' enable_layerwise_casting: skipped modules (t_embedder, cap_embedder) stay in bf16 while hooked modules cast to fp16, causing crashes in attention layers. Also hide the FP8 toggle in the UI for Z-Image models.
Models like Flux are loaded in bf16 but the global torch dtype is fp16, causing dtype mismatches during FP8 layerwise casting. Detect the model's actual parameter dtype and use it as compute_dtype for both diffusers ModelMixin and plain nn.Module models.
FP8 Layerwise Casting - Implementation
Summary
Add per-model
fp8_storageoption to model default settings that enables diffusers'enable_layerwise_casting()to store weights in FP8 (float8_e4m3fn) while casting to fp16/bf16 during inference. This reduces VRAM usage by ~50% per model with minimal quality loss.Supported: SD1/SD2/SDXL/SD3, Flux, Flux2, CogView4, Z-Image, VAE (diffusers-based), ControlNet, T2IAdapter.
Not applicable: Text Encoders, LoRA, GGUF, BnB, custom classes.
Related Issues / Discussions
enable_layerwise_casting()(available in diffusers 0.36.0)QA Instructions
fp8_storage: truein a model'sdefault_settings(via API or Model Manager UI)Test Matrix
fp8_storage=true- load and generatefp8_storage=true- load and generatefp8_storage=true- load and generatefp8_storage=true- load and generatefp8_storage=true- load and generatefp8_storage=true- load and generatefp8_storage=true- load and generatefp8_storage=true- load and generatefp8_storage=true- check qualityfp8_storage=true- load and generatefp8_storagefp8_storageis silently ignoredfp8_storage=true- load and generate # it does not workChecklist
What's Newcopy (if doing a release after this PR)