diff --git a/mixtral.md b/mixtral.md index 6073aa35d8..6387ca7320 100644 --- a/mixtral.md +++ b/mixtral.md @@ -285,8 +285,32 @@ output = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` -Note that for both QLoRA and GPTQ you need at least 30 GB of GPU VRAM to fit the model. You can make it work with 24 GB if you use `device_map="auto"`, like in the example above, so some layers are offloaded to CPU. +If you have [exllama kernels installed](https://github.com/turboderp/exllama), you can leverage them to run the GPTQ model. To do so, load the model with a custom GPTQ configuration where you set the desired parameters: + +```python +import torch +from transformers + +model_id = "TheBloke/Mixtral-8x7B-v0.1-GPTQ" +tokenizer = AutoTokenizer.from_pretrained(model_id) + +gptq_config = GPTQConfig(bits=4, use_exllama=True) +model = AutoModelForCausalLM.from_pretrained( + model_id, + quantization_config=gptq_config, + device_map="auto" +) +prompt = "[INST] Explain what a Mixture of Experts is in less than 100 words. [/INST]" +inputs = tokenizer(prompt, return_tensors="pt").to(0) + +output = model.generate(**inputs, max_new_tokens=50) +print(tokenizer.decode(output[0], skip_special_tokens=True)) +``` + +If left unset , the "use_exllama" parameter defaults to True , enabling the exllama backend functionality, specifically designed to work with the "bits" value of 4. + +Note that for both QLoRA and GPTQ you need at least 30 GB of GPU VRAM to fit the model. You can make it work with 24 GB if you use `device_map="auto"`, like in the example above, so some layers are offloaded to CPU. ## Disclaimers and ongoing work