feat: Add ZImage/LongCat/Sana diffusion support + LLM VL improvements… by er6y · Pull Request #4165 · alibaba/MNN

er6y · 2026-02-12T09:26:14Z

… + OpenCL fixes

Diffusion Engine:

Add ZImageDiffusion subclass: FlowMatch Euler scheduler, PhiloxRNG noise, CLIP text encoder
Add LongCatDiffusion subclass: LLM text encoder (lazy load), Flux-like latent packing, VAE enc/dec, T2I + Image Edit modes
Integrate SanaDiffusion with SanaLlm into unified API, fix Euler sampling (dt/1000->dt) and CFG order
Add DiffusionConfig/LLMEncoderConfig, GPU config (BUFFER mode, FP32, Memory_Low, OpenCL cache)
Add DiffusionGpuMemoryMode/PrecisionMode/CFGMode enums
Extend createDiffusion factory with full parameters for all model types
Unify diffusion_demo to support SD1.5/Taiyi/Sana/ZImage/LongCat with cfg_scale and input_image args
Add image processing utilities (resize, crop, colorspace, pack/unpack latents)
Implement FlowMatchEuler scheduler

LLM / Vision:

omni.cpp: Qwen3-VL vision fixes (floor rounding, nullptr check)
omni.hpp: mrope position ids fix max(T,H,W)+1
tokenizer.hpp: public wrapper header, MNN_PUBLIC export
llm_demo: LLM_DEMO_ONELINE mode
pymnn llm.h: forward_all() binding
Fix Qwen2_5Vision transformer_fuse for window attention compatibility

OpenCL fixes:

BinaryBufExecution: localWorkSize divisibility check
binary_buf.cl: float4/int4 init, per-element ReLU (NVIDIA compiler bug workaround)

Other fixes:

ShapeSliceTf/ShapeWhere: shape calculation rewrite
OnnxEinsum: outer product broadcast _Unsqueeze fix
Pipeline: NaN/Inf debug check macro (disabled by default)
Fix LongCat unpackLatentsGPU to match master implementation

CLAassistant · 2026-02-12T09:26:22Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Qxinyu · 2026-03-10T12:16:59Z

source/backend/opencl/core/BufferPool.cpp

@@ -51,13 +60,20 @@ void BufferPool::clear() {
 }

 void BufferPool::releaseFreeList() {
+    std::multimap<size_t, std::shared_ptr<OpenCLBufferNode>> keepList;


Using this logic in an OpenCL dynamic memory pool might result in the release of memory reused by preceding layers, leading to a crash.
Do you mean to release the memory previously allocated in the DYNAMIC_IN_EXECUTION memory pool used by the Attention operator?

… + OpenCL fixes Diffusion Engine: - Add ZImageDiffusion subclass: FlowMatch Euler scheduler, PhiloxRNG noise, CLIP text encoder - Add LongCatDiffusion subclass: LLM text encoder (lazy load), Flux-like latent packing, VAE enc/dec, T2I + Image Edit modes - Integrate SanaDiffusion with SanaLlm into unified API, fix Euler sampling (dt/1000->dt) and CFG order - Add DiffusionConfig/LLMEncoderConfig, GPU config (BUFFER mode, FP32, Memory_Low, OpenCL cache) - Add DiffusionGpuMemoryMode/PrecisionMode/CFGMode enums - Extend createDiffusion factory with full parameters for all model types - Unify diffusion_demo to support SD1.5/Taiyi/Sana/ZImage/LongCat with cfg_scale and input_image args - Add image processing utilities (resize, crop, colorspace, pack/unpack latents) - Implement FlowMatchEuler scheduler LLM / Vision: - omni.cpp: Qwen3-VL vision fixes (floor rounding, nullptr check) - omni.hpp: mrope position ids fix max(T,H,W)+1 - tokenizer.hpp: public wrapper header, MNN_PUBLIC export - llm_demo: LLM_DEMO_ONELINE mode - pymnn llm.h: forward_all() binding - Fix Qwen2_5Vision transformer_fuse for window attention compatibility OpenCL fixes: - BinaryBufExecution: localWorkSize divisibility check - binary_buf.cl: float4/int4 init, per-element ReLU (NVIDIA compiler bug workaround) Other fixes: - ShapeSliceTf/ShapeWhere: shape calculation rewrite - OnnxEinsum: outer product broadcast _Unsqueeze fix - Pipeline: NaN/Inf debug check macro (disabled by default) - Fix LongCat unpackLatentsGPU to match master implementation

…cation - Add Flux2Klein (FLUX.2-Klein-4B) diffusion model support - Fix VAE-on-CPU flag (was ignored; now creates separate CPU runtime) - Fix OpenCL OOM with large attention tensors in 1024px edit mode - Quality fixes: ZImage VAE scaling, text encoder padding, dynamic time_shift - Refactor: extract shared code to base class (~180 lines removed) - LLM/VL: Qwen3-VL vision fixes, mrope position ids fix - OPT memory usage and lazy loading(Memory mode=Low)

er6y · 2026-03-16T11:00:07Z

Thanks for the review! To clarify:

my modification only affects [BufferPool::releaseFreeList()]
BufferPool.cpp:61:0-76:1) (static pool), NOT BufferExecutionPool::releaseFreeList() (DYNAMIC_IN_EXECUTION pool).

What i changed:
BufferPool (Line 62-77): Keep buffers >1GB in the free list after releaseFreeList() is called
BufferExecutionPool(Line 135-143): No changes

The >1GB threshold targets large attention score tensors (6.78GB in our case). These are temporary intermediate tensors that are allocated and freed within a single inference pass, not persistent buffers shared across layers.

The actual fix for layer reuse is in Pipeline.cpp:
The critical fix is in Pipeline.cpp (Line 1012-1049), where we force-release large tensors with useCount<=1 before allocating new large outputs. This ensures the previous block's attention score buffer is recycled before the next one is allocated, preventing two 6.78GB tensors from coexisting.

The BufferPool.cpp change is just an optimization to avoid repeated alloc/free of 6.78GB buffers across multiple diffusion blocks.

Does this makes sense or not?

wangzhaode self-assigned this Feb 12, 2026

er6y force-pushed the master branch 5 times, most recently from cc3032c to d1527db Compare February 26, 2026 11:09

wangzhaode assigned Qxinyu and bitxsw93 Mar 10, 2026

Qxinyu reviewed Mar 10, 2026

View reviewed changes

er6y added 2 commits March 15, 2026 09:35

er6y force-pushed the master branch from d1527db to 2a96f09 Compare March 15, 2026 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add ZImage/LongCat/Sana diffusion support + LLM VL improvements…#4165

feat: Add ZImage/LongCat/Sana diffusion support + LLM VL improvements…#4165
er6y wants to merge 2 commits intoalibaba:masterfrom
er6y:master

er6y commented Feb 12, 2026

Uh oh!

CLAassistant commented Feb 12, 2026

Uh oh!

Qxinyu Mar 10, 2026

Uh oh!

er6y commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

er6y commented Feb 12, 2026

Uh oh!

CLAassistant commented Feb 12, 2026

Uh oh!

Qxinyu Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

er6y commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants