You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/source/en/api/pipelines/cogvideox.md
+31-49
Original file line number
Diff line number
Diff line change
@@ -26,19 +26,32 @@
26
26
You can find all the original CogVideoX checkpoints under the [CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) collection.
27
27
28
28
> [!TIP]
29
-
> Click on the CogVideoX models in the right sidebar for more examples of how to use CogVideoX for other video generation tasks.
29
+
> Click on the CogVideoX models in the right sidebar for more examples of other video generation tasks.
30
30
31
31
The example below demonstrates how to generate a video optimized for memory or inference speed.
32
32
33
33
<hfoptionsid="usage">
34
34
<hfoptionid="memory">
35
35
36
+
Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques.
37
+
38
+
The quantized CogVideoX 5B model below requires ~16GB of VRAM.
39
+
36
40
```py
37
41
import torch
38
42
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
39
43
from diffusers.hooks import apply_group_offloading
prompt = ("A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. "
64
-
"The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. "
65
-
"Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, "
66
-
"with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.")
76
+
prompt ="""
77
+
A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea.
78
+
The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse.
79
+
Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood,
80
+
with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.
81
+
"""
82
+
67
83
video = pipeline(
68
84
prompt=prompt,
69
85
guidance_scale=6,
@@ -72,45 +88,6 @@ video = pipeline(
72
88
export_to_video(video, "output.mp4", fps=8)
73
89
```
74
90
75
-
Reduce memory usage even more if necessary by quantizing a model to a lower precision data type.
76
-
77
-
```py
78
-
import torch
79
-
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel, TorchAoConfig
prompt = ("A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. "
107
-
"The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. "
108
-
"Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, "
109
-
"with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.")
110
-
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
111
-
export_to_video(video, "output.mp4", fps=8)
112
-
```
113
-
114
91
</hfoption>
115
92
<hfoptionid="inference speed">
116
93
@@ -119,7 +96,6 @@ Compilation is slow the first time but subsequent calls to the pipeline are fast
119
96
```py
120
97
import torch
121
98
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
122
-
from diffusers.hooks import apply_group_offloading
prompt = ("A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. "
137
-
"The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. "
138
-
"Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, "
139
-
"with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.")
112
+
prompt ="""
113
+
A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea.
114
+
The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse.
115
+
Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood,
116
+
with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.
- The text-to-video (T2V) checkpoints work best with a resolution of 1360x768 because that was the resolution it was pretrained on.
170
+
190
171
- The image-to-video (I2V) checkpoints work with multiple resolutions. The width can vary from 768 to 1360, but the height must be 758. Both height and width must be divisible by 16.
172
+
191
173
- Both T2V and I2V checkpoints work best with 81 and 161 frames. It is recommended to export the generated video at 16fps.
Copy file name to clipboardexpand all lines: docs/source/en/api/pipelines/hunyuan_video.md
+12-7
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@
20
20
21
21
# HunyuanVideo
22
22
23
-
[HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.
23
+
[HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B parameter diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.
24
24
25
25
You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization.
26
26
@@ -32,12 +32,16 @@ The example below demonstrates how to generate a video optimized for memory or i
32
32
<hfoptionsid="usage">
33
33
<hfoptionid="memory">
34
34
35
+
Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques.
36
+
37
+
The quantized HunyuanVideo model below requires ~14GB of VRAM.
38
+
35
39
```py
36
40
import torch
37
41
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
Copy file name to clipboardexpand all lines: docs/source/en/api/pipelines/ltx_video.md
+22-57
Original file line number
Diff line number
Diff line change
@@ -20,18 +20,22 @@
20
20
21
21
# LTX-Video
22
22
23
-
[LTX-Video](https://huggingface.co/Lightricks/LTX-Video) is a diffusion transformer designed for fast and real-time generation of high-resolution videos from text and images. The main feature of LTX-Video is the Video-VAE. The Video-VAE has a higher pixel to latent compression ratio (1:192) which enables more efficient video data processing and faster generation speed. To support and prevent the finer details from being lost during generation, the Video-VAE decoder performs the latent to pixel conversion *and* the last denoising step.
23
+
[LTX-Video](https://huggingface.co/Lightricks/LTX-Video) is a diffusion transformer designed for fast and real-time generation of high-resolution videos from text and images. The main feature of LTX-Video is the Video-VAE. The Video-VAE has a higher pixel to latent compression ratio (1:192) which enables more efficient video data processing and faster generation speed. To support and prevent finer details from being lost during generation, the Video-VAE decoder performs the latent to pixel conversion *and* the last denoising step.
24
24
25
25
You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization.
26
26
27
27
> [!TIP]
28
-
> Click on the LTX-Video models in the right sidebar for more examples of how to use LTX-Video for other video generation tasks.
28
+
> Click on the LTX-Video models in the right sidebar for more examples of other video generation tasks.
29
29
30
30
The example below demonstrates how to generate a video optimized for memory or inference speed.
31
31
32
32
<hfoptionsid="usage">
33
33
<hfoptionid="memory">
34
34
35
+
Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques.
36
+
37
+
The LTX-Video model below requires ~10GB of VRAM.
38
+
35
39
```py
36
40
import torch
37
41
from diffusers import LTXPipeline, LTXVideoTransformer3DModel
prompt ="A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
65
+
prompt ="""
66
+
A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage
prompt ="A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
prompt ="A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
103
+
prompt ="""
104
+
A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage
- LTX-Video supports loading from single files, such as [GGUF checkpoints](../../quantization/gguf), with [`FromOriginalModelMixin.from_single_file`] or [`FromSingleFileMixin.from_single_file`].
156
+
157
+
- LTX-Video supports loading from single files, such as [GGUF checkpoints](../../quantization/gguf), with [`loaders.FromOriginalModelMixin.from_single_file`] or [`loaders.FromSingleFileMixin.from_single_file`].
0 commit comments