Skip to content

Distributed inference for Kimi K2.5#689

Open
pcuenca wants to merge 17 commits intoBlaizzy:mainfrom
pcuenca:kimi-k25
Open

Distributed inference for Kimi K2.5#689
pcuenca wants to merge 17 commits intoBlaizzy:mainfrom
pcuenca:kimi-k25

Conversation

@pcuenca
Copy link
Contributor

@pcuenca pcuenca commented Jan 27, 2026

No description provided.

@pcuenca
Copy link
Contributor Author

pcuenca commented Feb 2, 2026

Generation works now 🥳

I had to disable chunked prefill because it's causing GPU hangs that cause the distributed process to deadlock. I have a potential explanation here.

I also decided to compute the vision embeddings on all ranks rather than computing on rank 0 and then explicitly broadcasting results. The resulting script is simpler, and I think more stable (found deadlocks when broadcasting the vision embeddings, although the process worked perfectly fine when using text-only generation).

This is now ready for review. Note this PR also incorporates the changes I extracted to this separate PR and that we still have to refactor.

@pcuenca
Copy link
Contributor Author

pcuenca commented Feb 2, 2026

Example command:

mlx.launch --hostfile ring-ethernet.json --backend ring --env MLX_METAL_FAST_SYNCH=1 --env HF_HUB_CACHE=/Users/pedro/code/mlx/cache -- /Users/pedro/code/mlx/mlx-vlm/examples/sharded_kimi_generate.py

@pcuenca pcuenca changed the title [wip] Add Kimi K2.5 Distributed inference for Kimi K2.5 Feb 2, 2026
if quantization is not None:
config["quantization"] = quantization
config["quantization_config"] = quantization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This previous block will come from #692 when merged.

@pcuenca
Copy link
Contributor Author

pcuenca commented Feb 17, 2026

Updated with the latest main, prefill and generation both work.

@pcuenca pcuenca requested a review from Blaizzy February 17, 2026 11:59
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets revert the generate changes

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try using a vision fix similar to qwen2 and 2_5 vl in #745

I would do it but have no way of testing it 😅

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rebase this PR onto #730 it some nice utils that are more general for distributed models

Copy link
Owner

@Blaizzy Blaizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nits before we merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants