-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Hi @jingheya ,
First, fantastic work on Lotus! The results for high-quality dense prediction are very impressive, and the release of the training code is much appreciated.
I noticed that Lotus, like many other diffusion-based models for dense prediction (e.g., Marigold, GeoWizard), seems to primarily leverage a U-Net based architecture as its core diffusion backbone (as is common with Stable Diffusion variants).
Given the recent advancements and strong performance of DiT (Diffusion Transformer) based models (such as SD3 or Flux) in various generative tasks, I was curious if the team considered or experimented with these transformer-based architectures for Lotus.
Is there a particular reason or set of observations that led to the decision to stick with a U-Net for this task? For instance, did experiments show that U-Nets are more inherently suitable or perform better for high-quality dense prediction, or were there other factors influencing this architectural choice?
Any insights you could provide would be greatly valuable for understanding the design philosophy behind Lotus.
Thanks again for sharing this excellent work!