I2VGenXLPipeline - missing components? #7952

dhaivat1729 · 2024-05-15T18:07:02Z

dhaivat1729
May 15, 2024

Hi everyone,

I was playing with I2VGenXLPipeline. Here is corresponding Huggingface implementation.. I saw some discrepancy between method described in the paper and this implementation. Can someone help me in checking if my understanding is correct.

In the paper, they have the following diagram:

According to this diagram, in the base stage, they have D.Enc. and G.Inc, however, I only see CLIP in the implementation here.
Similarly, in implementation, I observe that text embeddings are passed to the LDM of base stage (this line), however, as per the diagram, text is only passed in refinement stage.
In refinement stage, there is LDM, however, in implementation, I see that low dimensional video latent is passed to VAE decoder to generate high dimensional video, I do not see any reverse diffusion process.

Can anyone tell me if my understanding is correct for this code? I wanted to access intermediate low dimensional video, which comes at the end of base stage, but I don't know how to exactly access it. Can anyone tell me how to access that representation? I would appreciate it.

yiyixuxu · 2024-05-16T02:45:37Z

yiyixuxu
May 16, 2024
Maintainer

cc @sayakpaul

0 replies

sayakpaul · 2024-05-16T03:20:50Z

sayakpaul
May 16, 2024
Maintainer

Thanks for bringing this to our attention. We followed the original implementation code and ensured the same outputs could be obtained from the two implementations.

They're very likely used only during training and NOT during inference (similar to other models like https://arxiv.org/abs/2306.00637). You can verify this from the original implementation code for inference as well.
Again, it could very well be just for training and not for inference.

In refinement stage, there is LDM, however, in implementation, I see that low dimensional video latent is passed to VAE decoder to generate high dimensional video, I do not see any reverse diffusion process.

I think there's a misunderstanding here. LDM is the entire process:

You start with a low-dimensional noise vector.
You then iteratively refine the noise vector with a denoiser (a UNet in this case) that is conditioned on text, timestep of the noise scheduler, etc.
You pass the refined latent (aka the denoised noise vector) to the decoder of a VAE to obtain the final output.

So, to answer your question, all the above steps are being done in https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/i2vgen_xl/pipeline_i2vgen_xl.py.

Also, you can access the intermediate latents at the end of the iterative process by setting output_type="latents" in the pipeline call.

Hope that makes sense.

1 reply

dhaivat1729 May 16, 2024
Author

Hi @sayakpaul

Thank you for your response.

I will check their original implementation. According to their figure, there appears to be 2 LDM systems, one in base stage and one is refinement stage.
However, prompt is only passed to refinement stage as per the figure. However, right now we are passing prompt along with noise vector, which was the source of confusion.

In current implementation, denoising process is happening only once, and output of denoising is low dimensional video (which is end of the base stage), and then prompt is passed along with this low dimensional video in the refinement stage. However, currently, we have output of denoising a low dimensional video, which has already seen prompt.

I will contact the authors of the paper to see if there is any discrepancy.

Steven-SWZhang · 2024-05-17T07:10:38Z

Steven-SWZhang
May 17, 2024

Hello,
Thank you for your interest in our work on I2VGen-XL. In fact, I2VGen-XL comes in several different versions. We have not released the two-stage model here because the two-stage model does not sufficiently retain the content of the input image (you can observe this carefully in this fig); the two-stage model is relatively more complex with a larger number of parameters, making it less suitable for academic use. Therefore, we have released our single-stage model instead, which can fully preserve the content of input images and has a simpler overall structure. If you are looking to access our second-stage model, we have actually already made it open-source, please refer here. If you are just interested in obtaining the intermediate low-resolution video, you may need try the implementation available here on modelscope.

6 replies

dhaivat1729 May 17, 2024
Author

@sayakpaul
I don't know if it makes sense or not, but in documentation, we can explicitly state that the HF implementation is a single stage variant of I2VGen-XL, as the main figure in the paper is 2 stage, it could be made clearer. Just a suggestion :)

sayakpaul May 17, 2024
Maintainer

Sure, feel free to drop a PR and will get that sorted.

dhaivat1729 May 20, 2024
Author

@sayakpaul sure, I would love to make a PR, can you tell which all files should I look at to make the changes? I am new to the repository and I am not fully sure. I am sure the PR itself will not take time but I want to do it right way.

One place I can think of is docstring of I2vGen class itself, anywhere else?

sayakpaul May 20, 2024
Maintainer

Thank you!

I think you could perform the edit in https://huggingface.co/docs/diffusers/en/api/pipelines/i2vgenxl, the source is here: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/i2vgenxl.md.

Does that help?

dhaivat1729 May 26, 2024
Author

@sayakpaul yes, it does. Thank you for your response, I have submitted PR for the same, this is my first PR to any HF repo, so please provide your suggestions and I will make amendments. I just added 1 more point in Notes section of the docs.
PR link: #8282

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I2VGenXLPipeline - missing components? #7952

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

I2VGenXLPipeline - missing components? #7952

dhaivat1729 May 15, 2024

Replies: 3 comments · 7 replies

yiyixuxu May 16, 2024 Maintainer

sayakpaul May 16, 2024 Maintainer

dhaivat1729 May 16, 2024 Author

Steven-SWZhang May 17, 2024

dhaivat1729 May 17, 2024 Author

sayakpaul May 17, 2024 Maintainer

dhaivat1729 May 20, 2024 Author

sayakpaul May 20, 2024 Maintainer

dhaivat1729 May 26, 2024 Author

dhaivat1729
May 15, 2024

Replies: 3 comments 7 replies

yiyixuxu
May 16, 2024
Maintainer

sayakpaul
May 16, 2024
Maintainer

dhaivat1729 May 16, 2024
Author

Steven-SWZhang
May 17, 2024

dhaivat1729 May 17, 2024
Author

sayakpaul May 17, 2024
Maintainer

dhaivat1729 May 20, 2024
Author

sayakpaul May 20, 2024
Maintainer

dhaivat1729 May 26, 2024
Author