Conversation
This makes it possible to load transformers quantized weights
| config["quantization"] = quantization | ||
| config["quantization_config"] = quantization | ||
|
|
||
| if (quantization := config.get("quantization", None)) is not None: |
There was a problem hiding this comment.
I believe the new code above should be here, right?
There was a problem hiding this comment.
It's a bit confusing 😅. quantization is first read in L140 (which was already there). If it exists, it's a MLX-style quantization definition. If it doesn't, then we check if quantization_config exists, convert it to quantization format and store it in the config object. Perhaps we could remove L140 and just check if config has the quantization attribute in line 217.
There was a problem hiding this comment.
Let's refactor then 😎
I think this could start at line 241.
So all the quantisation logic is clear in one place
There was a problem hiding this comment.
Deleted the distant quantization line from L140 and simplified slightly.
This makes it possible to load transformers quantized weights.
This is another mismatch I found while working on #689. I was puzzled that I had to use
QuantizedSwitchLinearexplicitly in order to be able to load the weights, whereas this was not necessary in mlx_lm. The quantization process immediately performed after sanitization is driven by the existence of quantization_config, and it adapts the weights accordingly so they have.scalesand.biases.Extracting as a separate PR for discussion, maybe I missed some side effects.