You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/source/en/api/pipelines/cogvideox.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@
23
23
24
24
[CogVideoX](https://huggingface.co/papers/2408.06072) is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An "expert" transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos.
25
25
26
-
You can find all the original CogVideoX checkpoints under the CogVideoX [collection](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce).
26
+
You can find all the original CogVideoX checkpoints under the [CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) collection.
27
27
28
28
> [!TIP]
29
29
> Click on the CogVideoX models in the right sidebar for more examples of how to use CogVideoX for other video generation tasks.
Copy file name to clipboardexpand all lines: docs/source/en/api/pipelines/hunyuan_video.md
+3-1
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@
22
22
23
23
[HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.
24
24
25
-
You can find all the original HunyuanVideo checkpoints under the Tencent [organization](https://huggingface.co/tencent).
25
+
You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization.
26
26
27
27
> [!TIP]
28
28
> The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers.
[LTX Video](https://huggingface.co/Lightricks/LTX-Video) is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 24 FPS videos at a 768x512 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content. We provide a model for both text-to-video as well as image + text-to-video usecases.
22
-
23
-
<Tip>
24
-
25
-
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
21
+
# LTX-Video
26
22
27
-
</Tip>
23
+
[LTX-Video](https://huggingface.co/Lightricks/LTX-Video) is a diffusion transformer designed for fast and real-time generation of high-resolution videos from text and images. The main feature of LTX-Video is the Video-VAE. The Video-VAE has a higher pixel to latent compression ratio (1:192) which enables more efficient video data processing and faster generation speed. To support and prevent the finer details from being lost during generation, the Video-VAE decoder performs the latent to pixel conversion *and* the last denoising step.
28
24
29
-
Available models:
25
+
You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization.
30
26
31
-
| Model name | Recommended dtype |
32
-
|:-------------:|:-----------------:|
33
-
|[`LTX Video 0.9.0`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.safetensors)|`torch.bfloat16`|
34
-
|[`LTX Video 0.9.1`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.1.safetensors)|`torch.bfloat16`|
27
+
> [!TIP]
28
+
> Click on the LTX-Video models in the right sidebar for more examples of how to use LTX-Video for other video generation tasks.
35
29
36
-
Note: The recommended dtype is for the transformer component. The VAE and text encoders can be either `torch.float32`, `torch.bfloat16`or `torch.float16` but the recommended dtype is `torch.bfloat16` as used in the original repository.
30
+
The example below demonstrates how to generate a video optimized for memory or inference speed.
37
31
38
-
## Loading Single Files
32
+
<hfoptionsid="usage">
33
+
<hfoptionid="memory">
39
34
40
-
Loading the original LTX Video checkpoints is also possible with [`~ModelMixin.from_single_file`]. We recommend using `from_single_file` for the Lightricks series of models, as they plan to release multiple models in the future in the single file format.
41
-
42
-
```python
35
+
```py
43
36
import torch
44
-
from diffusers import AutoencoderKLLTXVideo, LTXImageToVideoPipeline, LTXVideoTransformer3DModel
37
+
from diffusers import LTXPipeline, LTXVideoTransformer3DModel
38
+
from diffusers.hooks import apply_group_offloading
39
+
from diffusers.utils import export_to_video
45
40
46
-
# `single_file_url` could also be https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.1.safetensors
from transformers import T5EncoderModel, T5Tokenizer
61
+
prompt ="A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
prompt ="A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
prompt ="A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
Refer to [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox#memory-optimization) to learn more about optimizing memory consumption.
160
+
</hfoption>
161
+
</hfoptions>
145
162
146
-
## Quantization
163
+
## Notes
147
164
148
-
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
165
+
- LTX-Video supports LoRAs with [`~LTXVideoLoraLoaderMixin.load_lora_weights`].
149
166
150
-
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`LTXPipeline`] for inference with bitsandbytes.
167
+
```py
168
+
import torch
169
+
from diffusers import LTXConditionPipeline
170
+
from diffusers.utils import export_to_video
151
171
152
-
```py
153
-
import torch
154
-
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LTXVideoTransformer3DModel, LTXPipeline
155
-
from diffusers.utils import export_to_video
156
-
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
prompt ="CAKEIFY a person using a knife to cut a cake shaped like a pair of cowboy boots"
173
180
174
-
pipeline = LTXPipeline.from_pretrained(
175
-
"Lightricks/LTX-Video",
176
-
text_encoder=text_encoder_8bit,
177
-
transformer=transformer_8bit,
178
-
torch_dtype=torch.float16,
179
-
device_map="balanced",
180
-
)
181
+
video = pipeline(
182
+
prompt=prompt,
183
+
width=768,
184
+
height=512,
185
+
num_frames=161,
186
+
decode_timestep=0.03,
187
+
decode_noise_scale=0.025,
188
+
num_inference_steps=50,
189
+
).frames[0]
190
+
export_to_video(video, "output.mp4", fps=24)
191
+
```
192
+
- LTX-Video supports loading from single files, such as [GGUF checkpoints](../../quantization/gguf), with [`FromOriginalModelMixin.from_single_file`] or [`FromSingleFileMixin.from_single_file`].
181
193
182
-
prompt ="A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
183
-
video = pipeline(prompt=prompt, num_frames=161, num_inference_steps=50).frames[0]
184
-
export_to_video(video, "ship.mp4", fps=24)
185
-
```
194
+
```py
195
+
import torch
196
+
from diffusers.utils import export_to_video
197
+
from diffusers import LTXPipeline, LTXVideoTransformer3DModel, GGUFQuantizationConfig
0 commit comments