Distributed inference for Kimi K2.5 by pcuenca · Pull Request #689 · Blaizzy/mlx-vlm

pcuenca · 2026-01-27T18:34:16Z

No description provided.

mlx_vlm/models/kimi_k25/language.py

The materialization step was critical to avoid sync locks!

Generation works! 🥳

pcuenca · 2026-02-02T09:35:12Z

Generation works now 🥳

I had to disable chunked prefill because it's causing GPU hangs that cause the distributed process to deadlock. I have a potential explanation here.

I also decided to compute the vision embeddings on all ranks rather than computing on rank 0 and then explicitly broadcasting results. The resulting script is simpler, and I think more stable (found deadlocks when broadcasting the vision embeddings, although the process worked perfectly fine when using text-only generation).

This is now ready for review. Note this PR also incorporates the changes I extracted to this separate PR and that we still have to refactor.

pcuenca · 2026-02-02T09:35:59Z

Example command:

mlx.launch --hostfile ring-ethernet.json --backend ring --env MLX_METAL_FAST_SYNCH=1 --env HF_HUB_CACHE=/Users/pedro/code/mlx/cache -- /Users/pedro/code/mlx/mlx-vlm/examples/sharded_kimi_generate.py

examples/sharded_kimi_generate.py

pcuenca · 2026-02-03T10:59:03Z

mlx_vlm/utils.py

+            if quantization is not None:
+                config["quantization"] = quantization
+                config["quantization_config"] = quantization
+


This previous block will come from #692 when merged.

pcuenca · 2026-02-17T11:59:24Z

Updated with the latest main, prefill and generation both work.

Blaizzy · 2026-02-17T12:44:59Z

mlx_vlm/generate.py

Lets revert the generate changes

Blaizzy · 2026-02-17T12:54:54Z

mlx_vlm/models/kimi_k25/vision.py

Could you try using a vision fix similar to qwen2 and 2_5 vl in #745

I would do it but have no way of testing it 😅

Blaizzy · 2026-02-17T12:58:15Z

examples/sharded_kimi_generate.py

Could we rebase this PR onto #730 it some nice utils that are more general for distributed models

Blaizzy

LGTM, just a few nits before we merge!

"causal" does not work, we need the full array here: https://github.com/ml-explore/mlx-lm/blob/9c113f7019683d944422f96122bed5c636aa0b90/mlx_lm/models/deepseek_v3.py#L149

Add Kimi K2.5

763bb88

pcuenca commented Jan 27, 2026

View reviewed changes

mlx_vlm/models/kimi_k25/language.py Outdated Show resolved Hide resolved

pcuenca mentioned this pull request Jan 28, 2026

Make sure weights are released during conversion #691

Merged

pcuenca added 3 commits January 28, 2026 12:06

Merge remote-tracking branch 'upstream/main' into kimi-k25

a6aa293

Allow loading of transformers quantized weights

c4d1e4d

Simplify after quantization updates

cacb9c2

pcuenca mentioned this pull request Jan 28, 2026

Honor quantization_config #692

Merged

pcuenca added 7 commits January 28, 2026 19:46

Use implementation from mlx_lm

ded737d

Sharding script - wip

c4afe52

Allow inputs_embeds, use generation stream consistently.

c858bb9

Materialize weights after shard. Allow no images.

f7bae19

The materialization step was critical to avoid sync locks!

Update pbar

46dd883

Run vision on all ranks; disable chunked prefill.

4af90d3

Generation works! 🥳

Better way to disable chunked prefill

23dc217

pcuenca changed the title ~~[wip] Add Kimi K2.5~~ Distributed inference for Kimi K2.5 Feb 2, 2026

pcuenca commented Feb 2, 2026

View reviewed changes

examples/sharded_kimi_generate.py Outdated Show resolved Hide resolved

pcuenca added 2 commits February 2, 2026 18:41

Slight reorder

c18af6c

style

8249164

pcuenca commented Feb 3, 2026

View reviewed changes

pcuenca added 3 commits February 17, 2026 12:38

Merge remote-tracking branch 'upstream/main' into kimi-k25

fc40781

Fix sample

5055c46

Enable prefill

215ad75

pcuenca requested a review from Blaizzy February 17, 2026 11:59

Blaizzy reviewed Feb 17, 2026

View reviewed changes

mlx_vlm/generate.py

Copy link

Owner

Blaizzy Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets revert the generate changes

Blaizzy reviewed Feb 17, 2026

View reviewed changes

examples/sharded_kimi_generate.py

Copy link

Owner

Blaizzy Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rebase this PR onto #730 it some nice utils that are more general for distributed models

Blaizzy requested changes Feb 17, 2026

View reviewed changes

Fix for deepseek v3 MLA prefill

3be76db

"causal" does not work, we need the full array here: https://github.com/ml-explore/mlx-lm/blob/9c113f7019683d944422f96122bed5c636aa0b90/mlx_lm/models/deepseek_v3.py#L149

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed inference for Kimi K2.5#689

Distributed inference for Kimi K2.5#689
pcuenca wants to merge 17 commits intoBlaizzy:mainfrom
pcuenca:kimi-k25

pcuenca commented Jan 27, 2026

Uh oh!

Uh oh!

pcuenca commented Feb 2, 2026

Uh oh!

pcuenca commented Feb 2, 2026

Uh oh!

Uh oh!

pcuenca Feb 3, 2026

Uh oh!

pcuenca commented Feb 17, 2026

Uh oh!

Blaizzy Feb 17, 2026

Uh oh!

Blaizzy Feb 17, 2026

Uh oh!

Blaizzy Feb 17, 2026

Uh oh!

Blaizzy left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pcuenca commented Jan 27, 2026

Uh oh!

Uh oh!

pcuenca commented Feb 2, 2026

Uh oh!

pcuenca commented Feb 2, 2026

Uh oh!

Uh oh!

pcuenca Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

pcuenca commented Feb 17, 2026

Uh oh!

Blaizzy Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Blaizzy Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Blaizzy Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Blaizzy left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants