You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/source/en/api/pipelines/cogvideox.md
+3-2
Original file line number
Diff line number
Diff line change
@@ -23,12 +23,12 @@
23
23
24
24
[CogVideoX](https://huggingface.co/papers/2408.06072) is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An "expert" transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos.
25
25
26
-
You can find all the original CogVideoX checkpoints under the [CogVideoX collection](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce).
26
+
You can find all the original CogVideoX checkpoints under the CogVideoX [collection](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce).
27
27
28
28
> [!TIP]
29
29
> Click on the CogVideoX models in the right sidebar for more examples of how to use CogVideoX for other video generation tasks.
30
30
31
-
The example below demonstrates how to generate a video with CogVideoX, optimized for memory or inference speed.
31
+
The example below demonstrates how to generate a video optimized for memory or inference speed.
[HunyuanVideo](https://www.arxiv.org/abs/2412.03603) by Tencent.
22
-
23
-
*Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at [this https URL](https://github.com/tencent/HunyuanVideo).*
24
-
25
-
<Tip>
21
+
# HunyuanVideo
26
22
27
-
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
23
+
[HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.
28
24
29
-
</Tip>
25
+
You can find all the original HunyuanVideo checkpoints under the Tencent [organization](https://huggingface.co/tencent).
30
26
31
-
Recommendations for inference:
32
-
- Both text encoders should be in `torch.float16`.
33
-
- Transformer should be in `torch.bfloat16`.
34
-
- VAE should be in `torch.float16`.
35
-
-`num_frames` should be of the form `4 * k + 1`, for example `49` or `129`.
36
-
- For smaller resolution videos, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution images, try higher values (between `7.0` and `12.0`). The default value is `7.0` for HunyuanVideo.
37
-
- For more information about supported resolutions and other details, please refer to the original repository [here](https://github.com/Tencent/HunyuanVideo/).
27
+
> [!TIP]
28
+
> The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers.
38
29
39
-
## Available models
30
+
The example below demonstrates how to generate a video optimized for memory or inference speed.
40
31
41
-
The following models are available for the [`HunyuanVideoPipeline`](text-to-video) pipeline:
32
+
<hfoptionsid="usage">
33
+
<hfoptionid="memory">
42
34
43
-
| Model name | Description |
44
-
|:---|:---|
45
-
|[`hunyuanvideo-community/HunyuanVideo`](https://huggingface.co/hunyuanvideo-community/HunyuanVideo)| Official HunyuanVideo (guidance-distilled). Performs best at multiple resolutions and frames. Performs best with `guidance_scale=6.0`, `true_cfg_scale=1.0` and without a negative prompt. |
46
-
|[`https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-T2V`](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-T2V)| Skywork's custom finetune of HunyuanVideo (de-distilled). Performs best with `97x544x960` resolution, `guidance_scale=1.0`, `true_cfg_scale=6.0` and a negative prompt. |
35
+
```py
36
+
import torch
37
+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
38
+
from diffusers.utils import export_to_video
47
39
48
-
The following models are available for the image-to-video pipeline:
|[`Skywork/SkyReels-V1-Hunyuan-I2V`](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-I2V)| Skywork's custom finetune of HunyuanVideo (de-distilled). Performs best with `97x544x960` resolution. Performs best at `97x544x960` resolution, `guidance_scale=1.0`, `true_cfg_scale=6.0` and a negative prompt. |
53
-
|[`hunyuanvideo-community/HunyuanVideo-I2V`](https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V)| Tecent's official HunyuanVideo I2V model. Performs best at resolutions of 480, 720, 960, 1280. A higher `shift` value when initializing the scheduler is recommended (good values are between 7 and 20) |
49
+
pipeline = HunyuanVideoPipeline.from_pretrained(
50
+
"hunyuanvideo-community/HunyuanVideo",
51
+
transformer=transformer,
52
+
torch_dtype=torch.float16,
53
+
)
54
54
55
-
## Quantization
55
+
# model-offloading
56
+
pipeline.enable_model_cpu_offload()
57
+
pipeline.vae.enable_tiling()
56
58
57
-
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
59
+
prompt ="A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
60
+
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
61
+
export_to_video(video, "output.mp4", fps=15)
62
+
```
58
63
59
-
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`HunyuanVideoPipeline`] for inference with bitsandbytes.
64
+
</hfoptions>
65
+
<hfoptionid="inference speed">
60
66
61
67
```py
62
68
import torch
63
69
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
0 commit comments