Architectural Choice - U-Net vs. DiT in Lotus

Hi @jingheya ,

First, fantastic work on Lotus! The results for high-quality dense prediction are very impressive, and the release of the training code is much appreciated.

I noticed that Lotus, like many other diffusion-based models for dense prediction (e.g., Marigold, GeoWizard), seems to primarily leverage a U-Net based architecture as its core diffusion backbone (as is common with Stable Diffusion variants).

Given the recent advancements and strong performance of DiT (Diffusion Transformer) based models (such as SD3 or Flux) in various generative tasks, I was curious if the team considered or experimented with these transformer-based architectures for Lotus.

Is there a particular reason or set of observations that led to the decision to stick with a U-Net for this task? For instance, did experiments show that U-Nets are more inherently suitable or perform better for high-quality dense prediction, or were there other factors influencing this architectural choice?

Any insights you could provide would be greatly valuable for understanding the design philosophy behind Lotus.

Thanks again for sharing this excellent work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architectural Choice - U-Net vs. DiT in Lotus #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Architectural Choice - U-Net vs. DiT in Lotus #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions