Skip to content

Architectural Choice - U-Net vs. DiT in Lotus #42

@wyhlovecpp

Description

@wyhlovecpp

Hi @jingheya ,

First, fantastic work on Lotus! The results for high-quality dense prediction are very impressive, and the release of the training code is much appreciated.

I noticed that Lotus, like many other diffusion-based models for dense prediction (e.g., Marigold, GeoWizard), seems to primarily leverage a U-Net based architecture as its core diffusion backbone (as is common with Stable Diffusion variants).

Given the recent advancements and strong performance of DiT (Diffusion Transformer) based models (such as SD3 or Flux) in various generative tasks, I was curious if the team considered or experimented with these transformer-based architectures for Lotus.

Is there a particular reason or set of observations that led to the decision to stick with a U-Net for this task? For instance, did experiments show that U-Nets are more inherently suitable or perform better for high-quality dense prediction, or were there other factors influencing this architectural choice?

Any insights you could provide would be greatly valuable for understanding the design philosophy behind Lotus.

Thanks again for sharing this excellent work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions