feat: Add ZImage/LongCat/Sana diffusion support + LLM VL improvements…#4165
feat: Add ZImage/LongCat/Sana diffusion support + LLM VL improvements…#4165er6y wants to merge 2 commits intoalibaba:masterfrom
Conversation
|
|
cc3032c to
d1527db
Compare
| @@ -51,13 +60,20 @@ void BufferPool::clear() { | |||
| } | |||
|
|
|||
| void BufferPool::releaseFreeList() { | |||
| std::multimap<size_t, std::shared_ptr<OpenCLBufferNode>> keepList; | |||
There was a problem hiding this comment.
Using this logic in an OpenCL dynamic memory pool might result in the release of memory reused by preceding layers, leading to a crash.
Do you mean to release the memory previously allocated in the DYNAMIC_IN_EXECUTION memory pool used by the Attention operator?
… + OpenCL fixes Diffusion Engine: - Add ZImageDiffusion subclass: FlowMatch Euler scheduler, PhiloxRNG noise, CLIP text encoder - Add LongCatDiffusion subclass: LLM text encoder (lazy load), Flux-like latent packing, VAE enc/dec, T2I + Image Edit modes - Integrate SanaDiffusion with SanaLlm into unified API, fix Euler sampling (dt/1000->dt) and CFG order - Add DiffusionConfig/LLMEncoderConfig, GPU config (BUFFER mode, FP32, Memory_Low, OpenCL cache) - Add DiffusionGpuMemoryMode/PrecisionMode/CFGMode enums - Extend createDiffusion factory with full parameters for all model types - Unify diffusion_demo to support SD1.5/Taiyi/Sana/ZImage/LongCat with cfg_scale and input_image args - Add image processing utilities (resize, crop, colorspace, pack/unpack latents) - Implement FlowMatchEuler scheduler LLM / Vision: - omni.cpp: Qwen3-VL vision fixes (floor rounding, nullptr check) - omni.hpp: mrope position ids fix max(T,H,W)+1 - tokenizer.hpp: public wrapper header, MNN_PUBLIC export - llm_demo: LLM_DEMO_ONELINE mode - pymnn llm.h: forward_all() binding - Fix Qwen2_5Vision transformer_fuse for window attention compatibility OpenCL fixes: - BinaryBufExecution: localWorkSize divisibility check - binary_buf.cl: float4/int4 init, per-element ReLU (NVIDIA compiler bug workaround) Other fixes: - ShapeSliceTf/ShapeWhere: shape calculation rewrite - OnnxEinsum: outer product broadcast _Unsqueeze fix - Pipeline: NaN/Inf debug check macro (disabled by default) - Fix LongCat unpackLatentsGPU to match master implementation
…cation - Add Flux2Klein (FLUX.2-Klein-4B) diffusion model support - Fix VAE-on-CPU flag (was ignored; now creates separate CPU runtime) - Fix OpenCL OOM with large attention tensors in 1024px edit mode - Quality fixes: ZImage VAE scaling, text encoder padding, dynamic time_shift - Refactor: extract shared code to base class (~180 lines removed) - LLM/VL: Qwen3-VL vision fixes, mrope position ids fix - OPT memory usage and lazy loading(Memory mode=Low)
|
Thanks for the review! To clarify: my modification only affects [BufferPool::releaseFreeList()] What i changed: The >1GB threshold targets large attention score tensors (6.78GB in our case). These are temporary intermediate tensors that are allocated and freed within a single inference pass, not persistent buffers shared across layers. The actual fix for layer reuse is in Pipeline.cpp: The BufferPool.cpp change is just an optimization to avoid repeated alloc/free of 6.78GB buffers across multiple diffusion blocks. Does this makes sense or not? |
… + OpenCL fixes
Diffusion Engine:
LLM / Vision:
OpenCL fixes:
Other fixes: