Skip to content

Commit 22ab919

Browse files
committed
hunyuanvideo
1 parent 7fb3f5d commit 22ab919

File tree

2 files changed

+110
-40
lines changed

2 files changed

+110
-40
lines changed

docs/source/en/api/pipelines/cogvideox.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,12 @@
2323

2424
[CogVideoX](https://huggingface.co/papers/2408.06072) is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An "expert" transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos.
2525

26-
You can find all the original CogVideoX checkpoints under the [CogVideoX collection](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce).
26+
You can find all the original CogVideoX checkpoints under the CogVideoX [collection](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce).
2727

2828
> [!TIP]
2929
> Click on the CogVideoX models in the right sidebar for more examples of how to use CogVideoX for other video generation tasks.
3030
31-
The example below demonstrates how to generate a video with CogVideoX, optimized for memory or inference speed.
31+
The example below demonstrates how to generate a video optimized for memory or inference speed.
3232

3333
<hfoptions id="usage">
3434
<hfoption id="memory">
@@ -164,6 +164,7 @@ export_to_video(video, "output.mp4", fps=8)
164164
)
165165
pipeline.to("cuda")
166166

167+
# load LoRA weights
167168
pipeline.load_lora_weights("finetrainers/CogVideoX-1.5-crush-smol-v0", adapter_name="crush-lora")
168169
pipeline.set_adapters("crush-lora", 0.9)
169170

docs/source/en/api/pipelines/hunyuan_video.md

+107-38
Original file line numberDiff line numberDiff line change
@@ -12,59 +12,66 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License. -->
1414

15-
# HunyuanVideo
16-
17-
<div class="flex flex-wrap space-x-1">
18-
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
15+
<div style="float: right;">
16+
<div class="flex flex-wrap space-x-1">
17+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
18+
</div>
1919
</div>
2020

21-
[HunyuanVideo](https://www.arxiv.org/abs/2412.03603) by Tencent.
22-
23-
*Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at [this https URL](https://github.com/tencent/HunyuanVideo).*
24-
25-
<Tip>
21+
# HunyuanVideo
2622

27-
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
23+
[HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.
2824

29-
</Tip>
25+
You can find all the original HunyuanVideo checkpoints under the Tencent [organization](https://huggingface.co/tencent).
3026

31-
Recommendations for inference:
32-
- Both text encoders should be in `torch.float16`.
33-
- Transformer should be in `torch.bfloat16`.
34-
- VAE should be in `torch.float16`.
35-
- `num_frames` should be of the form `4 * k + 1`, for example `49` or `129`.
36-
- For smaller resolution videos, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution images, try higher values (between `7.0` and `12.0`). The default value is `7.0` for HunyuanVideo.
37-
- For more information about supported resolutions and other details, please refer to the original repository [here](https://github.com/Tencent/HunyuanVideo/).
27+
> [!TIP]
28+
> The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers.
3829
39-
## Available models
30+
The example below demonstrates how to generate a video optimized for memory or inference speed.
4031

41-
The following models are available for the [`HunyuanVideoPipeline`](text-to-video) pipeline:
32+
<hfoptions id="usage">
33+
<hfoption id="memory">
4234

43-
| Model name | Description |
44-
|:---|:---|
45-
| [`hunyuanvideo-community/HunyuanVideo`](https://huggingface.co/hunyuanvideo-community/HunyuanVideo) | Official HunyuanVideo (guidance-distilled). Performs best at multiple resolutions and frames. Performs best with `guidance_scale=6.0`, `true_cfg_scale=1.0` and without a negative prompt. |
46-
| [`https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-T2V`](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-T2V) | Skywork's custom finetune of HunyuanVideo (de-distilled). Performs best with `97x544x960` resolution, `guidance_scale=1.0`, `true_cfg_scale=6.0` and a negative prompt. |
35+
```py
36+
import torch
37+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
38+
from diffusers.utils import export_to_video
4739

48-
The following models are available for the image-to-video pipeline:
40+
# quantization
41+
quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True)
42+
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
43+
"hunyuanvideo-community/HunyuanVideo",
44+
subfolder="transformer",
45+
quantization_config=quant_config,
46+
torch_dtype=torch.bfloat16,
47+
)
4948

50-
| Model name | Description |
51-
|:---|:---|
52-
| [`Skywork/SkyReels-V1-Hunyuan-I2V`](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-I2V) | Skywork's custom finetune of HunyuanVideo (de-distilled). Performs best with `97x544x960` resolution. Performs best at `97x544x960` resolution, `guidance_scale=1.0`, `true_cfg_scale=6.0` and a negative prompt. |
53-
| [`hunyuanvideo-community/HunyuanVideo-I2V`](https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V) | Tecent's official HunyuanVideo I2V model. Performs best at resolutions of 480, 720, 960, 1280. A higher `shift` value when initializing the scheduler is recommended (good values are between 7 and 20) |
49+
pipeline = HunyuanVideoPipeline.from_pretrained(
50+
"hunyuanvideo-community/HunyuanVideo",
51+
transformer=transformer,
52+
torch_dtype=torch.float16,
53+
)
5454

55-
## Quantization
55+
# model-offloading
56+
pipeline.enable_model_cpu_offload()
57+
pipeline.vae.enable_tiling()
5658

57-
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
59+
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
60+
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
61+
export_to_video(video, "output.mp4", fps=15)
62+
```
5863

59-
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`HunyuanVideoPipeline`] for inference with bitsandbytes.
64+
</hfoptions>
65+
<hfoption id="inference speed">
6066

6167
```py
6268
import torch
6369
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
6470
from diffusers.utils import export_to_video
6571

66-
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
67-
transformer_8bit = HunyuanVideoTransformer3DModel.from_pretrained(
72+
# quantization
73+
quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True)
74+
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
6875
"hunyuanvideo-community/HunyuanVideo",
6976
subfolder="transformer",
7077
quantization_config=quant_config,
@@ -73,16 +80,78 @@ transformer_8bit = HunyuanVideoTransformer3DModel.from_pretrained(
7380

7481
pipeline = HunyuanVideoPipeline.from_pretrained(
7582
"hunyuanvideo-community/HunyuanVideo",
76-
transformer=transformer_8bit,
83+
transformer=transformer,
7784
torch_dtype=torch.float16,
78-
device_map="balanced",
7985
)
8086

81-
prompt = "A cat walks on the grass, realistic style."
87+
# model-offloading
88+
pipeline.enable_model_cpu_offload()
89+
pipeline.vae.enable_tiling()
90+
91+
# torch.compile
92+
pipeline.transformer.to(memory_format=torch.channels_last)
93+
pipeline.transformer = torch.compile(
94+
pipeline.transformer, mode="max-autotune", fullgraph=True
95+
)
96+
97+
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
8298
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
83-
export_to_video(video, "cat.mp4", fps=15)
99+
export_to_video(video, "output.mp4", fps=15)
84100
```
85101

102+
</hfoption>
103+
</hfoptions>
104+
105+
## Notes
106+
107+
- HunyuanVideo supports LoRAs with [`~loaders.HunyuanVideoLoraLoaderMixin.load_lora_weights`].
108+
109+
```py
110+
import torch
111+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
112+
from diffusers.utils import export_to_video
113+
114+
# quantize weights to int4 with bitsandbytes
115+
quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True)
116+
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
117+
"hunyuanvideo-community/HunyuanVideo",
118+
subfolder="transformer",
119+
quantization_config=quant_config,
120+
torch_dtype=torch.bfloat16,
121+
)
122+
123+
pipeline = HunyuanVideoPipeline.from_pretrained(
124+
"hunyuanvideo-community/HunyuanVideo",
125+
transformer=transformer,
126+
torch_dtype=torch.float16,
127+
)
128+
129+
# load LoRA weights
130+
pipeline.load_lora_weights("https://huggingface.co/lucataco/hunyuan-steamboat-willie-10", adapter_name="steamboat-willie")
131+
pipeline.set_adapters("steamboat-willie", 0.9)
132+
133+
# model-offloading
134+
pipeline.enable_model_cpu_offload()
135+
pipeline.vae.enable_tiling()
136+
137+
prompt = """
138+
In the style of SWR. A black and white animated scene featuring a fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys.
139+
"""
140+
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
141+
export_to_video(video, "output.mp4", fps=15)
142+
```
143+
144+
- Refer to the table below for recommended inference values.
145+
146+
| parameter | recommended value |
147+
|---|---|
148+
| text encoder dtype | `torch.float16` |
149+
| transformer dtype | `torch.bfloat16` |
150+
| vae dtype | `torch.float16` |
151+
| `num_frames` | 4 * k + 1 |
152+
153+
- Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos, and try higher `shift` values (`7.0` to `12.0`) for higher resolution images.
154+
86155
## HunyuanVideoPipeline
87156

88157
[[autodoc]] HunyuanVideoPipeline

0 commit comments

Comments
 (0)