Skip to content

feat: Add ZImage/LongCat/Sana diffusion support + LLM VL improvements…#4165

Open
er6y wants to merge 2 commits intoalibaba:masterfrom
er6y:master
Open

feat: Add ZImage/LongCat/Sana diffusion support + LLM VL improvements…#4165
er6y wants to merge 2 commits intoalibaba:masterfrom
er6y:master

Conversation

@er6y
Copy link

@er6y er6y commented Feb 12, 2026

… + OpenCL fixes

Diffusion Engine:

  • Add ZImageDiffusion subclass: FlowMatch Euler scheduler, PhiloxRNG noise, CLIP text encoder
  • Add LongCatDiffusion subclass: LLM text encoder (lazy load), Flux-like latent packing, VAE enc/dec, T2I + Image Edit modes
  • Integrate SanaDiffusion with SanaLlm into unified API, fix Euler sampling (dt/1000->dt) and CFG order
  • Add DiffusionConfig/LLMEncoderConfig, GPU config (BUFFER mode, FP32, Memory_Low, OpenCL cache)
  • Add DiffusionGpuMemoryMode/PrecisionMode/CFGMode enums
  • Extend createDiffusion factory with full parameters for all model types
  • Unify diffusion_demo to support SD1.5/Taiyi/Sana/ZImage/LongCat with cfg_scale and input_image args
  • Add image processing utilities (resize, crop, colorspace, pack/unpack latents)
  • Implement FlowMatchEuler scheduler

LLM / Vision:

  • omni.cpp: Qwen3-VL vision fixes (floor rounding, nullptr check)
  • omni.hpp: mrope position ids fix max(T,H,W)+1
  • tokenizer.hpp: public wrapper header, MNN_PUBLIC export
  • llm_demo: LLM_DEMO_ONELINE mode
  • pymnn llm.h: forward_all() binding
  • Fix Qwen2_5Vision transformer_fuse for window attention compatibility

OpenCL fixes:

  • BinaryBufExecution: localWorkSize divisibility check
  • binary_buf.cl: float4/int4 init, per-element ReLU (NVIDIA compiler bug workaround)

Other fixes:

  • ShapeSliceTf/ShapeWhere: shape calculation rewrite
  • OnnxEinsum: outer product broadcast _Unsqueeze fix
  • Pipeline: NaN/Inf debug check macro (disabled by default)
  • Fix LongCat unpackLatentsGPU to match master implementation

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@wangzhaode wangzhaode self-assigned this Feb 12, 2026
@er6y er6y force-pushed the master branch 5 times, most recently from cc3032c to d1527db Compare February 26, 2026 11:09
@@ -51,13 +60,20 @@ void BufferPool::clear() {
}

void BufferPool::releaseFreeList() {
std::multimap<size_t, std::shared_ptr<OpenCLBufferNode>> keepList;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this logic in an OpenCL dynamic memory pool might result in the release of memory reused by preceding layers, leading to a crash.
Do you mean to release the memory previously allocated in the DYNAMIC_IN_EXECUTION memory pool used by the Attention operator?

er6y added 2 commits March 15, 2026 09:35
… + OpenCL fixes

Diffusion Engine:
- Add ZImageDiffusion subclass: FlowMatch Euler scheduler, PhiloxRNG noise, CLIP text encoder
- Add LongCatDiffusion subclass: LLM text encoder (lazy load), Flux-like latent packing, VAE enc/dec, T2I + Image Edit modes
- Integrate SanaDiffusion with SanaLlm into unified API, fix Euler sampling (dt/1000->dt) and CFG order
- Add DiffusionConfig/LLMEncoderConfig, GPU config (BUFFER mode, FP32, Memory_Low, OpenCL cache)
- Add DiffusionGpuMemoryMode/PrecisionMode/CFGMode enums
- Extend createDiffusion factory with full parameters for all model types
- Unify diffusion_demo to support SD1.5/Taiyi/Sana/ZImage/LongCat with cfg_scale and input_image args
- Add image processing utilities (resize, crop, colorspace, pack/unpack latents)
- Implement FlowMatchEuler scheduler

LLM / Vision:
- omni.cpp: Qwen3-VL vision fixes (floor rounding, nullptr check)
- omni.hpp: mrope position ids fix max(T,H,W)+1
- tokenizer.hpp: public wrapper header, MNN_PUBLIC export
- llm_demo: LLM_DEMO_ONELINE mode
- pymnn llm.h: forward_all() binding
- Fix Qwen2_5Vision transformer_fuse for window attention compatibility

OpenCL fixes:
- BinaryBufExecution: localWorkSize divisibility check
- binary_buf.cl: float4/int4 init, per-element ReLU (NVIDIA compiler bug workaround)

Other fixes:
- ShapeSliceTf/ShapeWhere: shape calculation rewrite
- OnnxEinsum: outer product broadcast _Unsqueeze fix
- Pipeline: NaN/Inf debug check macro (disabled by default)
- Fix LongCat unpackLatentsGPU to match master implementation
…cation

- Add Flux2Klein (FLUX.2-Klein-4B) diffusion model support
- Fix VAE-on-CPU flag (was ignored; now creates separate CPU runtime)
- Fix OpenCL OOM with large attention tensors in 1024px edit mode
- Quality fixes: ZImage VAE scaling, text encoder padding, dynamic time_shift
- Refactor: extract shared code to base class (~180 lines removed)
- LLM/VL: Qwen3-VL vision fixes, mrope position ids fix
- OPT memory usage and lazy loading(Memory mode=Low)
@er6y
Copy link
Author

er6y commented Mar 16, 2026

Thanks for the review! To clarify:

my modification only affects [BufferPool::releaseFreeList()]
BufferPool.cpp:61:0-76:1) (static pool), NOT BufferExecutionPool::releaseFreeList() (DYNAMIC_IN_EXECUTION pool).

What i changed:
BufferPool (Line 62-77): Keep buffers >1GB in the free list after releaseFreeList() is called
BufferExecutionPool(Line 135-143): No changes

The >1GB threshold targets large attention score tensors (6.78GB in our case). These are temporary intermediate tensors that are allocated and freed within a single inference pass, not persistent buffers shared across layers.

The actual fix for layer reuse is in Pipeline.cpp:
The critical fix is in Pipeline.cpp (Line 1012-1049), where we force-release large tensors with useCount<=1 before allocating new large outputs. This ensures the previous block's attention score buffer is recycled before the next one is allocated, preventing two 6.78GB tensors from coexisting.

The BufferPool.cpp change is just an optimization to avoid repeated alloc/free of 6.78GB buffers across multiple diffusion blocks.

Does this makes sense or not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants