-
Notifications
You must be signed in to change notification settings - Fork 2.9k
feat: add per-model FP8 layerwise casting for VRAM reduction #8945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
lstein
merged 21 commits into
invoke-ai:main
from
Pfannkuchensack:feature/fp8-layerwise-casting
May 12, 2026
Merged
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
6f52606
feat: add per-model FP8 layerwise casting for VRAM reduction
Pfannkuchensack bf3bd2e
feat: add FP8 storage option to Model Manager UI
Pfannkuchensack afe246e
ruff format
Pfannkuchensack 2262d8d
Merge branch 'main' into feature/fp8-layerwise-casting
JPPhoto 5327df8
Merge branch 'main' into feature/fp8-layerwise-casting
JPPhoto 6c13fca
Merge branch 'main' into feature/fp8-layerwise-casting
JPPhoto 0d7b39f
fix: enable FP8 layerwise casting for checkpoint Flux models
Pfannkuchensack a0df643
fix: exclude Z-Image from FP8 due to diffusers layerwise casting bug
Pfannkuchensack 06ad3c7
fix: detect model dtype for FP8 compute instead of using global dtype
Pfannkuchensack 025759f
Remove call for _should_use_fp8 in z-image
Pfannkuchensack 8ddb200
Merge branch 'main' into feature/fp8-layerwise-casting
Pfannkuchensack 9798012
Merge branch 'main' into feature/fp8-layerwise-casting
Pfannkuchensack 2b0af7c
Merge remote-tracking branch 'upstream/main' into feature/fp8-layerwi…
Pfannkuchensack 55d41a6
Merge branch 'main' + exclude VAEs from FP8 layerwise casting
Pfannkuchensack f0a53a5
fix(fp8): invalidate cache on settings change, exception-safe nn.Modu…
Pfannkuchensack f841598
fix(fp8): honor class swap for LoRA patches, evict stale locked entri…
Pfannkuchensack ae2068a
Merge branch 'main' into feature/fp8-layerwise-casting
JPPhoto f94b705
fix(fp8): switch nn.Module FP8 wrapper to hooks so CustomLinear dispa…
Pfannkuchensack 458a425
Merge branch 'feature/fp8-layerwise-casting' of https://github.com/Pf…
Pfannkuchensack 2598698
Add docs for fp8
Pfannkuchensack 9a4a2f8
Merge branch 'main' into feature/fp8-layerwise-casting
lstein File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| --- | ||
| title: FP8 Storage | ||
| sidebar: | ||
| order: 3 | ||
| --- | ||
|
|
||
| import { Steps } from '@astrojs/starlight/components'; | ||
|
|
||
| FP8 Storage cuts a model's VRAM footprint roughly in half by keeping weights on the GPU in 8-bit floating-point format (`float8_e4m3fn`). During inference, each layer's weights are cast on-the-fly back up to the compute precision (FP16/BF16), then cast back to FP8 after the forward pass — so quality is largely preserved. | ||
|
|
||
| It pairs well with [Low-VRAM mode](/configuration/low-vram-mode/): low-VRAM mode streams layers between RAM and VRAM, while FP8 Storage shrinks the layers themselves. | ||
|
|
||
| ## Requirements | ||
|
|
||
| - **Nvidia GPU on Windows or Linux.** FP8 Storage uses CUDA tensor types and is silently disabled on CPU and MPS. | ||
| - **CUDA 12.x and recent PyTorch.** The `float8_e4m3fn` dtype was added in PyTorch 2.1 — InvokeAI's bundled versions satisfy this. | ||
|
|
||
| There is no hardware requirement for FP8 *compute* — InvokeAI casts back to FP16/BF16 for math. This means FP8 Storage works on GPUs that do not natively support FP8 matmul (e.g. RTX 30-series), at a small per-step throughput cost. | ||
|
|
||
| ## Enabling FP8 Storage | ||
|
|
||
| FP8 Storage is a **per-model setting**, configured from the Model Manager: | ||
|
|
||
| <Steps> | ||
| 1. Open the **Model Manager**. | ||
| 2. Select a model (Main, ControlNet, or T2I-Adapter). | ||
| 3. Under **Default Settings**, toggle **FP8 Storage (Save VRAM)**. | ||
| 4. Click **Save**. | ||
| </Steps> | ||
|
|
||
| The setting takes effect on the next load. If the model is already in the cache, InvokeAI evicts the cached copy automatically so the new setting applies — even if a generation is currently using the model (the eviction is deferred until the generation finishes). | ||
|
|
||
| :::tip[When to enable] | ||
| Enable FP8 Storage on large models that don't fit comfortably in VRAM — FLUX dev/Klein, large SDXL checkpoints, ControlNet-XL adapters. For smaller SD1 / SD2 models, the savings are negligible and not worth the small precision trade-off. | ||
| ::: | ||
|
|
||
| ## What FP8 Storage applies to | ||
|
|
||
| FP8 Storage is **only** applied to layers where the precision trade-off is acceptable: | ||
|
|
||
| | Model type | FP8 applied? | | ||
| | ----------------------------- | -------------------------------------- | | ||
| | Main models (SD1, SD2, SDXL) | Yes | | ||
| | FLUX.1 / FLUX.2 Klein | Yes | | ||
| | ControlNet, T2I-Adapter | Yes | | ||
| | VAE | No — visible decode-quality regression | | ||
| | Text encoders, tokenizers | No — small models, no benefit | | ||
| | Z-Image (any variant) | No — dtype mismatch with skipped layers| | ||
| | LoRA, ControlLoRA | No — patched into base, not run alone | | ||
|
|
||
| Within a supported model, **norm layers, position/patch embeddings, and `proj_in`/`proj_out` are skipped** so precision-sensitive tiny learned scalars (e.g. FLUX `RMSNorm.scale`) aren't crushed to FP8. This mirrors the diffusers default skip list. | ||
|
|
||
| ## Quality trade-offs | ||
|
|
||
| FP8 Storage is **near-lossless** for most workloads because: | ||
|
|
||
| - Norms and embeddings (the precision-sensitive layers) are skipped. | ||
| - The actual matmul still happens in FP16/BF16 — FP8 is only the on-GPU storage format. | ||
|
|
||
| That said, some artifacts have been reported on: | ||
|
|
||
| - **VAEs** — never cast (the toggle has no effect on VAE submodels). | ||
| - **Heavy LoRA stacks** — patching is unaffected, but very precision-sensitive LoRAs may show slight drift. Compare a side-by-side if your workflow depends on subtle LoRA behavior. | ||
|
|
||
| If you see unexpected quality regressions, disable FP8 Storage on the affected model and re-run. | ||
|
|
||
| ## Combining with Low-VRAM mode and quantized models | ||
|
|
||
| - **FP8 + partial loading**: fully supported. FP8 Storage shrinks the layers; partial loading streams them between RAM and VRAM as needed. Use both on tight VRAM budgets. | ||
| - **FP8 + GGUF / NF4 / int8 quantized checkpoints**: these formats already have their own storage precision. FP8 Storage is not applied on top — the toggle is silently a no-op for quantized formats, since the loader returns a different module type. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### "I toggled FP8 Storage but VRAM usage didn't change" | ||
|
|
||
| The cache eviction is immediate for idle models, but **deferred until the next unlock** if the model is mid-generation. Wait for the current generation to finish, then start a new one — the next load will use the new setting. | ||
|
|
||
| If VRAM still hasn't dropped: | ||
|
|
||
| - Check the InvokeAI log for `FP8 layerwise casting enabled for <model name>`. If the line isn't there, the model is on the exclusion list (VAE, text encoder, Z-Image, LoRA — see table above). | ||
| - Confirm you are on CUDA. FP8 Storage is silently disabled on CPU and MPS. | ||
|
|
||
| ### Quality regression on a specific model | ||
|
|
||
| Disable FP8 Storage for that model in Model Manager and reload. If quality is restored, the model has FP8-sensitive layers that fall outside the default skip list. Please open an issue with the model name and a side-by-side comparison. | ||
|
|
||
| ### "RuntimeError: ... float8_e4m3fn ..." | ||
|
|
||
| You're on a PyTorch version that predates FP8 support. Reinstall InvokeAI using the official launcher — the bundled torch version supports FP8. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.