Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add imatrix support #633

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions clip.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -661,6 +661,7 @@ class CLIPTextModel : public GGMLBlock {
if (version == OPEN_CLIP_VIT_BIGG_14) {
enum ggml_type wtype = GGML_TYPE_F32; // tensor_types.find(prefix + "text_projection") != tensor_types.end() ? tensor_types[prefix + "text_projection"] : GGML_TYPE_F32;
params["text_projection"] = ggml_new_tensor_2d(ctx, wtype, projection_dim, hidden_size);
ggml_set_name(params["text_projection"], (prefix + "text_projection").c_str());
}
}

Expand Down Expand Up @@ -812,6 +813,7 @@ class CLIPProjection : public UnaryBlock {
} else {
params["weight"] = ggml_new_tensor_2d(ctx, wtype, in_features, out_features);
}
ggml_set_name(params["weight"], (prefix + "weight").c_str());
}

public:
Expand Down
59 changes: 59 additions & 0 deletions docs/imatrix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Importance Matrix (imatrix) Quantization

## What is an Importance Matrix?

Quantization reduces the precision of a model's weights, decreasing its size and computational requirements. However, this can lead to a loss of quality. An importance matrix helps mitigate this by identifying which weights are *most* important for the model's performance. During quantization, these important weights are preserved with higher precision, while less important weights are quantized more aggressively. This allows for better overall quality at a given quantization level.

This originates from work done with language models in [llama.cpp](https://github.com/ggml-org/llama.cpp/blob/master/examples/imatrix/README.md).

## Usage

The imatrix feature involves two main steps: *training* the matrix and *using* it during quantization.

### Training the Importance Matrix

To generate an imatrix, run stable-diffusion.cpp with the `--imat-out` flag, specifying the output filename. This process runs alongside normal image generation.

```bash
sd.exe [same exact parameters as normal generation] --imat-out imatrix.dat
```

* **`[same exact parameters as normal generation]`**: Use the same command-line arguments you would normally use for image generation (e.g., prompt, dimensions, sampling method, etc.).
* **`--imat-out imatrix.dat`**: Specifies the output file for the generated imatrix.

You can generate multiple images at once using the `-b` flag to speed up the training process.

### Continuing Training an Existing Matrix

If you want to refine an existing imatrix, use the `--imat-in` flag *in addition* to `--imat-out`. This will load the existing matrix and continue training it.

```bash
sd.exe [same exact parameters as normal generation] --imat-out imatrix.dat --imat-in imatrix.dat
```
With that, you can train and refine the imatrix while generating images like you'd normally do.

### Using Multiple Matrices

You can load and merge multiple imatrices together:

```bash
sd.exe [same exact parameters as normal generation] --imat-out imatrix.dat --imat-in imatrix.dat --imat-in imatrix2.dat
```

### Quantizing with an Importance Matrix

To quantize a model using a trained imatrix, use the `-M convert` option (or equivalent quantization command) and the `--imat-in` flag, specifying the imatrix file.

```bash
sd.exe -M convert [same exact parameters as normal quantization] --imat-in imatrix.dat
```

* **`[same exact parameters as normal quantization]`**: Use the same command-line arguments you would normally use for quantization (e.g., target quantization method, input/output filenames).
* **`--imat-in imatrix.dat`**: Specifies the imatrix file to use during quantization. You can specify multiple `--imat-in` flags to combine multiple matrices.

## Important Considerations

* The quality of the imatrix depends on the prompts and settings used during training. Use prompts and settings representative of the types of images you intend to generate for the best results.
* Experiment with different training parameters (e.g., number of images, prompt variations) to optimize the imatrix for your specific use case.
* The performance impact of training an imatrix during image generation or using an imatrix for quantization is negligible.
* Using already quantized models to train the imatrix seems to be working fine.
111 changes: 79 additions & 32 deletions examples/cli/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,12 @@ struct SDParams {
float slg_scale = 0.f;
float skip_layer_start = 0.01f;
float skip_layer_end = 0.2f;

/* Imatrix params */

std::string imatrix_out = "";

std::vector<std::string> imatrix_in = {};
};

void print_params(SDParams params) {
Expand Down Expand Up @@ -204,6 +210,8 @@ void print_usage(int argc, const char* argv[]) {
printf(" --upscale-repeats Run the ESRGAN upscaler this many times (default 1)\n");
printf(" --type [TYPE] weight type (examples: f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K, q4_K)\n");
printf(" If not specified, the default is the type of the weight file\n");
printf(" --imat-out [PATH] If set, compute the imatrix for this run and save it to the provided path\n");
printf(" --imat-in [PATH] Use imatrix for quantization.\n");
printf(" --lora-model-dir [DIR] lora model directory\n");
printf(" -i, --init-img [IMAGE] path to the input image, required by img2img\n");
printf(" --mask [MASK] path to the mask image, required by img2img with mask\n");
Expand Down Expand Up @@ -250,6 +258,7 @@ void print_usage(int argc, const char* argv[]) {
void parse_args(int argc, const char** argv, SDParams& params) {
bool invalid_arg = false;
std::string arg;
std::string type = "";
for (int i = 1; i < argc; i++) {
arg = argv[i];

Expand Down Expand Up @@ -355,32 +364,7 @@ void parse_args(int argc, const char** argv, SDParams& params) {
invalid_arg = true;
break;
}
std::string type = argv[i];
bool found = false;
std::string valid_types = "";
for (size_t i = 0; i < SD_TYPE_COUNT; i++) {
auto trait = ggml_get_type_traits((ggml_type)i);
std::string name(trait->type_name);
if (name == "f32" || trait->to_float && trait->type_size) {
if (i)
valid_types += ", ";
valid_types += name;
if (type == name) {
if (ggml_quantize_requires_imatrix((ggml_type)i)) {
printf("\033[35;1m[WARNING]\033[0m: type %s requires imatrix to work properly. A dummy imatrix will be used, expect poor quality.\n", trait->type_name);
}
params.wtype = (enum sd_type_t)i;
found = true;
break;
}
}
}
if (!found) {
fprintf(stderr, "error: invalid weight format %s, must be one of [%s]\n",
type.c_str(),
valid_types.c_str());
exit(1);
}
type = argv[i];
} else if (arg == "--lora-model-dir") {
if (++i >= argc) {
invalid_arg = true;
Expand Down Expand Up @@ -629,12 +613,60 @@ void parse_args(int argc, const char** argv, SDParams& params) {
break;
}
params.skip_layer_end = std::stof(argv[i]);
} else if (arg == "--imat-out") {
if (++i >= argc) {
invalid_arg = true;
break;
}
params.imatrix_out = argv[i];
} else if (arg == "--imat-in") {
if (++i >= argc) {
invalid_arg = true;
break;
}
params.imatrix_in.push_back(std::string(argv[i]));
} else {
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
print_usage(argc, argv);
exit(1);
}
}
if (type != "") {
bool found = false;
std::string valid_types = "";
for (size_t i = 0; i < SD_TYPE_COUNT; i++) {
auto trait = ggml_get_type_traits((ggml_type)i);
std::string name(trait->type_name);
if (name == "f32" || trait->to_float && trait->type_size) {
if (i)
valid_types += ", ";
valid_types += name;
if (type == name) {
if (ggml_quantize_requires_imatrix((ggml_type)i) && params.imatrix_in.size() == 0) {
printf("\033[35;1m[WARNING]\033[0m: type %s requires imatrix to work properly. A dummy imatrix will be used, expect poor quality.\n", trait->type_name);
}
params.wtype = (enum sd_type_t)i;
found = true;
break;
}
}
}
if (!found) {
fprintf(stderr, "error: invalid weight format %s, must be one of [%s]\n",
type.c_str(),
valid_types.c_str());
exit(1);
}
}

if (params.imatrix_out.size() > 0 && std::ifstream(params.imatrix_out).good()) {
// imatrix file already exists
if (std::find(params.imatrix_in.begin(), params.imatrix_in.end(), params.imatrix_out) == params.imatrix_in.end()) {
printf("\n IMPORTANT: imatrix file %s already exists, but wasn't found in the imatrix inputs.\n", params.imatrix_out.c_str());
printf("%s will get overwritten!\n", params.imatrix_out.c_str());
}
}

if (invalid_arg) {
fprintf(stderr, "error: invalid parameter for argument: %s\n", arg.c_str());
print_usage(argc, argv);
Expand Down Expand Up @@ -799,8 +831,20 @@ int main(int argc, const char* argv[]) {
printf("%s", sd_get_system_info());
}

if (params.imatrix_out != "") {
enableImatrixCollection();
}
if (params.imatrix_out != "" || params.mode == CONVERT || params.wtype != SD_TYPE_COUNT) {
for (const auto& in_file : params.imatrix_in) {
printf("loading imatrix from '%s'\n", in_file.c_str());
if (!loadImatrix(in_file.c_str())) {
printf("Failed to load %s\n", in_file.c_str());
}
}
}

if (params.mode == CONVERT) {
bool success = convert(params.model_path.c_str(), params.vae_path.c_str(), params.output_path.c_str(), params.wtype);
bool success = convert(params.model_path.c_str(), params.clip_l_path.c_str(), params.clip_g_path.c_str(), params.t5xxl_path.c_str(), params.diffusion_model_path.c_str(), params.vae_path.c_str(), params.output_path.c_str(), params.wtype);
if (!success) {
fprintf(stderr,
"convert '%s'/'%s' to '%s' failed\n",
Expand Down Expand Up @@ -1075,19 +1119,19 @@ int main(int argc, const char* argv[]) {

std::string dummy_name, ext, lc_ext;
bool is_jpg;
size_t last = params.output_path.find_last_of(".");
size_t last = params.output_path.find_last_of(".");
size_t last_path = std::min(params.output_path.find_last_of("/"),
params.output_path.find_last_of("\\"));
if (last != std::string::npos // filename has extension
&& (last_path == std::string::npos || last > last_path)) {
if (last != std::string::npos // filename has extension
&& (last_path == std::string::npos || last > last_path)) {
dummy_name = params.output_path.substr(0, last);
ext = lc_ext = params.output_path.substr(last);
std::transform(ext.begin(), ext.end(), lc_ext.begin(), ::tolower);
is_jpg = lc_ext == ".jpg" || lc_ext == ".jpeg" || lc_ext == ".jpe";
} else {
dummy_name = params.output_path;
ext = lc_ext = "";
is_jpg = false;
is_jpg = false;
}
// appending ".png" to absent or unknown extension
if (!is_jpg && lc_ext != ".png") {
Expand All @@ -1099,7 +1143,7 @@ int main(int argc, const char* argv[]) {
continue;
}
std::string final_image_path = i > 0 ? dummy_name + "_" + std::to_string(i + 1) + ext : dummy_name + ext;
if(is_jpg) {
if (is_jpg) {
stbi_write_jpg(final_image_path.c_str(), results[i].width, results[i].height, results[i].channel,
results[i].data, 90, get_image_params(params, params.seed + i).c_str());
printf("save result JPEG image to '%s'\n", final_image_path.c_str());
Expand All @@ -1111,6 +1155,9 @@ int main(int argc, const char* argv[]) {
free(results[i].data);
results[i].data = NULL;
}
if (params.imatrix_out != "") {
saveImatrix(params.imatrix_out.c_str());
}
free(results);
free_sd_ctx(sd_ctx);
free(control_image_buffer);
Expand Down
46 changes: 38 additions & 8 deletions ggml_extend.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,11 @@
#include "ggml-alloc.h"
#include "ggml-backend.h"
#include "ggml-cpu.h"
#include "ggml/src/ggml-impl.h"
#include "ggml.h"

#include "model.h"
#include "util.h"

#ifdef SD_USE_CUDA
#include "ggml-cuda.h"
Expand Down Expand Up @@ -117,13 +119,6 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_kronecker(ggml_context* ctx, struct g
b);
}

__STATIC_INLINE__ void ggml_log_callback_default(ggml_log_level level, const char* text, void* user_data) {
(void)level;
(void)user_data;
fputs(text, stderr);
fflush(stderr);
}

__STATIC_INLINE__ void ggml_tensor_set_f32_randn(struct ggml_tensor* tensor, std::shared_ptr<RNG> rng) {
uint32_t n = (uint32_t)ggml_nelements(tensor);
std::vector<float> random_numbers = rng->randn(n);
Expand Down Expand Up @@ -1241,7 +1236,39 @@ struct GGMLRunner {
ggml_backend_cpu_set_n_threads(backend, n_threads);
}

ggml_backend_graph_compute(backend, gf);
auto callback_eval = get_callback_eval();

if(!callback_eval){
ggml_backend_graph_compute(backend, gf);
}else{
void * callback_eval_user_data = get_callback_eval_user_data();
for (int j0 = 0; j0 < gf->n_nodes; j0++) {
struct ggml_tensor * t = gf->nodes[j0];

// check if the user needs data from this node
bool need = callback_eval(t, true, callback_eval_user_data);

int j1 = j0;

// determine the range [j0, j1] of nodes that can be computed together
while (!need && j1 < gf->n_nodes - 1) {
t = gf->nodes[++j1];
need = callback_eval(t, true, callback_eval_user_data);
}

struct ggml_cgraph gv = ggml_graph_view(gf, j0, j1 + 1);

ggml_backend_graph_compute_async(backend, &gv);

if (need && !callback_eval(t, false, callback_eval_user_data)) {
break;
}

j0 = j1;
}
ggml_backend_synchronize(backend);
}

#ifdef GGML_PERF
ggml_graph_print(gf);
#endif
Expand Down Expand Up @@ -1345,6 +1372,7 @@ class Linear : public UnaryBlock {
wtype = GGML_TYPE_F32;
}
params["weight"] = ggml_new_tensor_2d(ctx, wtype, in_features, out_features);
ggml_set_name(params["weight"], (prefix + "weight").c_str());
if (bias) {
enum ggml_type wtype = GGML_TYPE_F32; //(tensor_types.ypes.find(prefix + "bias") != tensor_types.end()) ? tensor_types[prefix + "bias"] : GGML_TYPE_F32;
params["bias"] = ggml_new_tensor_1d(ctx, wtype, out_features);
Expand Down Expand Up @@ -1508,6 +1536,8 @@ class LayerNorm : public UnaryBlock {
if (elementwise_affine) {
enum ggml_type wtype = GGML_TYPE_F32; //(tensor_types.ypes.find(prefix + "weight") != tensor_types.end()) ? tensor_types[prefix + "weight"] : GGML_TYPE_F32;
params["weight"] = ggml_new_tensor_1d(ctx, wtype, normalized_shape);
ggml_set_name(params["weight"], (prefix + "weight").c_str());

if (bias) {
enum ggml_type wtype = GGML_TYPE_F32; //(tensor_types.ypes.find(prefix + "bias") != tensor_types.end()) ? tensor_types[prefix + "bias"] : GGML_TYPE_F32;
params["bias"] = ggml_new_tensor_1d(ctx, wtype, normalized_shape);
Expand Down
Loading
Loading