Conversation
The materialization step was critical to avoid sync locks!
Generation works! 🥳
|
Generation works now 🥳 I had to disable chunked prefill because it's causing GPU hangs that cause the distributed process to deadlock. I have a potential explanation here. I also decided to compute the vision embeddings on all ranks rather than computing on rank 0 and then explicitly broadcasting results. The resulting script is simpler, and I think more stable (found deadlocks when broadcasting the vision embeddings, although the process worked perfectly fine when using text-only generation). This is now ready for review. Note this PR also incorporates the changes I extracted to this separate PR and that we still have to refactor. |
|
Example command: |
| if quantization is not None: | ||
| config["quantization"] = quantization | ||
| config["quantization_config"] = quantization | ||
|
|
There was a problem hiding this comment.
This previous block will come from #692 when merged.
|
Updated with the latest |
There was a problem hiding this comment.
Lets revert the generate changes
There was a problem hiding this comment.
Could you try using a vision fix similar to qwen2 and 2_5 vl in #745
I would do it but have no way of testing it 😅
There was a problem hiding this comment.
Could we rebase this PR onto #730 it some nice utils that are more general for distributed models
Blaizzy
left a comment
There was a problem hiding this comment.
LGTM, just a few nits before we merge!
"causal" does not work, we need the full array here: https://github.com/ml-explore/mlx-lm/blob/9c113f7019683d944422f96122bed5c636aa0b90/mlx_lm/models/deepseek_v3.py#L149
No description provided.