-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add imatrix support #633
base: master
Are you sure you want to change the base?
Add imatrix support #633
Conversation
@stduhpf Thank you for working on this :) Do you think transformer based models work better with importance, like ggml quants generally do? (eg. flux) |
@Green-Sky I have no idea. I'm not sure it would work right now, but I've only tested sd1.5 so far, because it's so much faster |
|
Maybe imatrix.hpp should just not be a header only lib ^^ |
@Green-Sky I'm doing some tests with sd3, it seems to be doing something, but cooking imatrix for larger un-distilled models takes ages compared to something like sd1.5 LCM. Now that I think about It, applying imatrix to flux (or any model with standalone diffusion model) will be tricky, The imatrix uses the name that the weight have at runtime, but when quantizing the names are not prefixed like they are at runtime. |
Nice job stduhpf.
Flux also seems to struggle with the lower bit i-quants: https://huggingface.co/Eviation/flux-imatrix |
Ok I found a satisfactory way to apply imatrix to flux. (Also it seems like training the imatrix with quantized models works just fine)
(imatrix trained on 10 generations using static q4_k (schnell) or iq4_nl (dev) model) |
Looks great. Did you tune it on the same amount sampling steps? Optimising for your own usecase is probably the best for lower quants. |
For the schnell one, I trained it with 4 steps only, with different resolutions. My PC is currently cooking a Flux dev imatrix using varying step count (from 16 to 40). Maybe I'll try to make one with fixed step count to compare with after. |
4ec74a9
to
24d8fd7
Compare
I feel like this is pretty much ready now. |
I am trying this right now. I am no expert on how the importance data flows into the quantization, but does it make sense to sample using a quant, just to recreate the same quant with the importance data? You showed that using a higher quant to generate the imat works, but using the same quant would be interesting... |
I think it would work. As long as the original quant is "good enough" to generate coherent images, the activations should already be representative of the ideal activations, and therefore the imatrix shouldn't be too different from the one trained on the full precision model, with the same kind of improvements. |
Thanks, good to know. This all reminds me very much of PGO, where you usually stack them, to get the last 1-2% performance. 😄 I am doing q5_k right now, and the image is very coherent indeed. |
examples/cli/main.cpp
Outdated
@@ -204,6 +210,8 @@ void print_usage(int argc, const char* argv[]) { | |||
printf(" --upscale-repeats Run the ESRGAN upscaler this many times (default 1)\n"); | |||
printf(" --type [TYPE] weight type (examples: f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K, q4_K)\n"); | |||
printf(" If not specified, the default is the type of the weight file\n"); | |||
printf(" --imat-out [PATH] If set, compute the imatrix for this run and save it to the provided path"); | |||
printf(" --imat-in [PATH] Use imatrix for quantization."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both new options miss a new line.
imatrix.cpp
Outdated
return false; | ||
} | ||
|
||
// Recreate the state as expected by save_imatrix(), and corerct for weighted sum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corerct -> correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, I just copy-pasted that part of the code, maybe the typo is important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😁
Not sure I did anything wrong, but using imats produced by the same quant seems to produce the same model file. So it either does not work, or I did something wrong.
edit:
edit2: and the imats are different
edit3: tried to do an optimized q4_k, same issue, so something is fundamentally broken with the flux prune/distill/dedistill i am using. |
Try with And then run it with |
Another issue. When I use flash attention it breaks the imat collection after a varying amount of images. (using sd_turbo here)
Details
update: happened after much longer without flash attention too.
|
This seems to have worked. Not a fan of the tensor renaming though.
(they where obv identical before)
It looks like the imat from q5_k made q4_k stray more. I don't have the full size model image for this example (it's too expensive ngl), but to me this looks worse. However, the importance guided quant seems to have less dither noise, so it got better somewhere... I was trying to measure the visual quality difference of the quants, and I saw and remembered that flux specifically shows dither-like patterns when you go lower with the quants. So I tried to measure that with gimp. I first applied a high-pass filter (at 0.5 std and 4 contrast) and then used the histogram plot.
Base is spread out a little more, so this should mean there is indeed more high frequency noise, but this is just a single sample AND a highly experimental and somewhat subjective analysis 😅 |
Yes, this is a bit annoying for flux models. I thought of adding a way to extract the diffusion model (or other components like vae or text encoders) from the model file, but I feel like this is getting a bit out of scope for this PR. (Something like |
@Green-Sky I've had a similar problem with sd3 q4_k. (with fp16 imatrix). For some reason the outputs of the q4_k model seems to stray further from the full precision model when using imatrix, but it still seem to minimise artifacts. (I think q6_k shows a similar behavior to a lesser extent) |
Interesting. Here is q2_k, which seems to have the same, but much more noticeable behavior.
|
I actually cant, it crashes.
The reason seems to be a null buffer here: update: It's with any model type (q8_0, f16, q5_k tested) stacktrace:
update2: same for sd_turbo
|
@Green-Sky It should be fixed now. |
@stduhpf works thanks. running some at incredible 541.82s/it. But looks like cpu inference is broken for that model..., so imat might be of questionable quality. update: I did a big oopsy and forgot to add |
I ran the freshly f16 generated imatrix file through ggml-org/llama.cpp#12718 :
somewhat unreadable. |
|
Haven't had much of an opportunity to play with T2I models yet but if someone can point me to a sample model and imatrix file, happy to make the necessary changes. |
Adds support for llama.cpp-style importance matrices (see https://github.com/ggml-org/llama.cpp/blob/master/examples/imatrix/README.md and ggml-org/llama.cpp#4861) to increase the performance of quantized models.
Models generated with imatrix are backwards compatible with the previous releases.
Usage:
To train imatrix:
sd.exe [same exact parameters as normal generation] --imat-out imatrix.dat
This will generate an image and train the imatrix while doing so (you can use
-b
to generate multiple images at once).To keep training an existing imatrix:
sd.exe [same exact parameters as normal generation] --imat-out imatrix.dat --imat-in imatrix.dat
You can load multiple imatrix at once, this will merge them in the output:
sd.exe [same exact parameters as normal generation] --imat-out imatrix.dat --imat-in imatrix.dat --imat-in imatrix2.dat
Quantize with imatrix:
sd.exe -M convert [same exact parameters as normal quantization] --imat-in imatrix.dat
(again you can use multiple imatrix)
Examples
"simple" imatrix trained on a batch of 32 image generations (512x512) with the dreamshaper_8LCM (f16, 8 steps) model and empty prompts. (because of the model's bias, it was mostly calibrated on portraits of asian women):
"better" imatrix trained on 504 generations using diverse prompst and aspect ratios, using the same model.
* static means that the importance matrix is not active (all ones), as it is set up to do when quantizing with the master branch.
iq2_xs seems completely broken even with imatrix for this model, but the effect is still noticable. With iq4, the static quant is already pretty good so the difference in quality isn't obvious. (both using the "better" imatrix here)
Interesting observation: for the "girl wearing a funny hat" prompt, static quants put her in a city like the original fp16 model does, while the quants calibrated with the "better" imatrix put her in a forest. This is most likely due to a bias in the calibraton dataset, which contained some samples of girls with forest background and none with city backgrounds.
You can find these models and the imatrices used here: https://huggingface.co/stduhpf/dreamshaper-8LCM-im-GGUF-sdcpp
You can find examples with other models in the discussion.