Skip to content

Commit abcdf12

Browse files
committed
ltx
1 parent 22ab919 commit abcdf12

File tree

4 files changed

+135
-107
lines changed

4 files changed

+135
-107
lines changed

docs/source/en/api/pipelines/cogvideox.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323

2424
[CogVideoX](https://huggingface.co/papers/2408.06072) is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An "expert" transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos.
2525

26-
You can find all the original CogVideoX checkpoints under the CogVideoX [collection](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce).
26+
You can find all the original CogVideoX checkpoints under the [CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) collection.
2727

2828
> [!TIP]
2929
> Click on the CogVideoX models in the right sidebar for more examples of how to use CogVideoX for other video generation tasks.

docs/source/en/api/pipelines/hunyuan_video.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222

2323
[HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.
2424

25-
You can find all the original HunyuanVideo checkpoints under the Tencent [organization](https://huggingface.co/tencent).
25+
You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization.
2626

2727
> [!TIP]
2828
> The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers.
@@ -64,6 +64,8 @@ export_to_video(video, "output.mp4", fps=15)
6464
</hfoptions>
6565
<hfoption id="inference speed">
6666

67+
Compilation is slow the first time but subsequent calls to the pipeline are faster.
68+
6769
```py
6870
import torch
6971
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline

docs/source/en/api/pipelines/ltx_video.md

+125-101
Original file line numberDiff line numberDiff line change
@@ -12,123 +12,139 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License. -->
1414

15-
# LTX Video
16-
17-
<div class="flex flex-wrap space-x-1">
18-
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
15+
<div style="float: right;">
16+
<div class="flex flex-wrap space-x-1">
17+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
18+
</div>
1919
</div>
2020

21-
[LTX Video](https://huggingface.co/Lightricks/LTX-Video) is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 24 FPS videos at a 768x512 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content. We provide a model for both text-to-video as well as image + text-to-video usecases.
22-
23-
<Tip>
24-
25-
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
21+
# LTX-Video
2622

27-
</Tip>
23+
[LTX-Video](https://huggingface.co/Lightricks/LTX-Video) is a diffusion transformer designed for fast and real-time generation of high-resolution videos from text and images. The main feature of LTX-Video is the Video-VAE. The Video-VAE has a higher pixel to latent compression ratio (1:192) which enables more efficient video data processing and faster generation speed. To support and prevent the finer details from being lost during generation, the Video-VAE decoder performs the latent to pixel conversion *and* the last denoising step.
2824

29-
Available models:
25+
You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization.
3026

31-
| Model name | Recommended dtype |
32-
|:-------------:|:-----------------:|
33-
| [`LTX Video 0.9.0`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.safetensors) | `torch.bfloat16` |
34-
| [`LTX Video 0.9.1`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.1.safetensors) | `torch.bfloat16` |
27+
> [!TIP]
28+
> Click on the LTX-Video models in the right sidebar for more examples of how to use LTX-Video for other video generation tasks.
3529
36-
Note: The recommended dtype is for the transformer component. The VAE and text encoders can be either `torch.float32`, `torch.bfloat16` or `torch.float16` but the recommended dtype is `torch.bfloat16` as used in the original repository.
30+
The example below demonstrates how to generate a video optimized for memory or inference speed.
3731

38-
## Loading Single Files
32+
<hfoptions id="usage">
33+
<hfoption id="memory">
3934

40-
Loading the original LTX Video checkpoints is also possible with [`~ModelMixin.from_single_file`]. We recommend using `from_single_file` for the Lightricks series of models, as they plan to release multiple models in the future in the single file format.
41-
42-
```python
35+
```py
4336
import torch
44-
from diffusers import AutoencoderKLLTXVideo, LTXImageToVideoPipeline, LTXVideoTransformer3DModel
37+
from diffusers import LTXPipeline, LTXVideoTransformer3DModel
38+
from diffusers.hooks import apply_group_offloading
39+
from diffusers.utils import export_to_video
4540

46-
# `single_file_url` could also be https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.1.safetensors
47-
single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
48-
transformer = LTXVideoTransformer3DModel.from_single_file(
49-
single_file_url, torch_dtype=torch.bfloat16
41+
# fp8 layerwise weight-casting
42+
transformer = LTXVideoTransformer3DModel.from_pretrained(
43+
"Lightricks/LTX-Video",
44+
subfolder="transformer",
45+
torch_dtype=torch.bfloat16
5046
)
51-
vae = AutoencoderKLLTXVideo.from_single_file(single_file_url, torch_dtype=torch.bfloat16)
52-
pipe = LTXImageToVideoPipeline.from_pretrained(
53-
"Lightricks/LTX-Video", transformer=transformer, vae=vae, torch_dtype=torch.bfloat16
47+
transformer.enable_layerwise_casting(
48+
storage_dtype=torch.float8_e4m3fn,
49+
compute_dtype=torch.bfloat16
5450
)
5551

56-
# ... inference code ...
57-
```
52+
pipeline = LTXPipeline.from_pretrained("Lightricks/LTX-Video", transformer=transformer, torch_dtype=torch.bfloat16)
5853

59-
Alternatively, the pipeline can be used to load the weights with [`~FromSingleFileMixin.from_single_file`].
54+
# group-offloading
55+
onload_device = torch.device("cuda")
56+
offload_device = torch.device("cpu")
57+
pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True)
58+
apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)
59+
apply_group_offloading(pipeline.vae, onload_device=onload_device, offload_type="leaf_level")
6060

61-
```python
62-
import torch
63-
from diffusers import LTXImageToVideoPipeline
64-
from transformers import T5EncoderModel, T5Tokenizer
61+
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
62+
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
6563

66-
single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
67-
text_encoder = T5EncoderModel.from_pretrained(
68-
"Lightricks/LTX-Video", subfolder="text_encoder", torch_dtype=torch.bfloat16
69-
)
70-
tokenizer = T5Tokenizer.from_pretrained(
71-
"Lightricks/LTX-Video", subfolder="tokenizer", torch_dtype=torch.bfloat16
72-
)
73-
pipe = LTXImageToVideoPipeline.from_single_file(
74-
single_file_url, text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.bfloat16
75-
)
64+
video = pipeline(
65+
prompt=prompt,
66+
negative_prompt=negative_prompt,
67+
width=768,
68+
height=512,
69+
num_frames=161,
70+
decode_timestep=0.03,
71+
decode_noise_scale=0.025,
72+
num_inference_steps=50,
73+
).frames[0]
74+
export_to_video(video, "output.mp4", fps=24)
7675
```
7776

78-
Loading [LTX GGUF checkpoints](https://huggingface.co/city96/LTX-Video-gguf) are also supported:
77+
Reduce memory usage even more if necessary by quantizing a model to a lower precision data type.
7978

8079
```py
8180
import torch
8281
from diffusers.utils import export_to_video
83-
from diffusers import LTXPipeline, LTXVideoTransformer3DModel, GGUFQuantizationConfig
82+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LTXVideoTransformer3DModel, LTXPipeline
83+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
8484

85-
ckpt_path = (
86-
"https://huggingface.co/city96/LTX-Video-gguf/blob/main/ltx-video-2b-v0.9-Q3_K_S.gguf"
85+
# quantize weights to int8 with bitsandbytes
86+
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
87+
text_encoder = T5EncoderModel.from_pretrained(
88+
"Lightricks/LTX-Video",
89+
subfolder="text_encoder",
90+
quantization_config=quantization_config,
91+
torch_dtype=torch.bfloat16,
8792
)
88-
transformer = LTXVideoTransformer3DModel.from_single_file(
89-
ckpt_path,
90-
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
93+
94+
quantization_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
95+
transformer = LTXVideoTransformer3DModel.from_pretrained(
96+
"Lightricks/LTX-Video",
97+
subfolder="transformer",
98+
quantization_config=quantization_config,
9199
torch_dtype=torch.bfloat16,
92100
)
93-
pipe = LTXPipeline.from_pretrained(
101+
102+
pipeline = LTXPipeline.from_pretrained(
94103
"Lightricks/LTX-Video",
104+
text_encoder=text_en,
95105
transformer=transformer,
96106
torch_dtype=torch.bfloat16,
97107
)
98-
pipe.enable_model_cpu_offload()
99108

100109
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
101110
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
102-
103-
video = pipe(
111+
video = pipeline(
104112
prompt=prompt,
105113
negative_prompt=negative_prompt,
106-
width=704,
107-
height=480,
114+
width=768,
115+
height=512,
108116
num_frames=161,
117+
decode_timestep=0.03,
118+
decode_noise_scale=0.025,
109119
num_inference_steps=50,
110120
).frames[0]
111-
export_to_video(video, "output_gguf_ltx.mp4", fps=24)
121+
export_to_video(video, "output.mp4", fps=24)
112122
```
113123

114-
Make sure to read the [documentation on GGUF](../../quantization/gguf) to learn more about our GGUF support.
115-
116-
<!-- TODO(aryan): Update this when official weights are supported -->
124+
</hfoption>
125+
<hfoption id="inference speed">
117126

118-
Loading and running inference with [LTX Video 0.9.1](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.1.safetensors) weights.
127+
Compilation is slow the first time but subsequent calls to the pipeline are faster.
119128

120-
```python
129+
```py
121130
import torch
122131
from diffusers import LTXPipeline
123132
from diffusers.utils import export_to_video
124133

125-
pipe = LTXPipeline.from_pretrained("a-r-r-o-w/LTX-Video-0.9.1-diffusers", torch_dtype=torch.bfloat16)
126-
pipe.to("cuda")
134+
pipeline = LTXPipeline.from_pretrained(
135+
"Lightricks/LTX-Video", torch_dtype=torch.bfloat16
136+
)
137+
138+
# torch.compile
139+
pipeline.transformer.to(memory_format=torch.channels_last)
140+
pipeline.transformer = torch.compile(
141+
pipeline.transformer, mode="max-autotune", fullgraph=True
142+
)
127143

128144
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
129145
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
130146

131-
video = pipe(
147+
video = pipeline(
132148
prompt=prompt,
133149
negative_prompt=negative_prompt,
134150
width=768,
@@ -141,48 +157,56 @@ video = pipe(
141157
export_to_video(video, "output.mp4", fps=24)
142158
```
143159

144-
Refer to [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox#memory-optimization) to learn more about optimizing memory consumption.
160+
</hfoption>
161+
</hfoptions>
145162

146-
## Quantization
163+
## Notes
147164

148-
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
165+
- LTX-Video supports LoRAs with [`~LTXVideoLoraLoaderMixin.load_lora_weights`].
149166

150-
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`LTXPipeline`] for inference with bitsandbytes.
167+
```py
168+
import torch
169+
from diffusers import LTXConditionPipeline
170+
from diffusers.utils import export_to_video
151171

152-
```py
153-
import torch
154-
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LTXVideoTransformer3DModel, LTXPipeline
155-
from diffusers.utils import export_to_video
156-
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
172+
pipeline = LTXConditionPipeline.from_pretrained(
173+
"Lightricks/LTX-Video-0.9.5", torch_dtype=torch.bfloat16
174+
)
157175

158-
quant_config = BitsAndBytesConfig(load_in_8bit=True)
159-
text_encoder_8bit = T5EncoderModel.from_pretrained(
160-
"Lightricks/LTX-Video",
161-
subfolder="text_encoder",
162-
quantization_config=quant_config,
163-
torch_dtype=torch.float16,
164-
)
176+
pipeline.load_lora_weights("Lightricks/LTX-Video-Cakeify-LoRA", adapter_name="cakeify")
177+
pipeline.set_adapters("cakeify", 0.9)
165178

166-
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
167-
transformer_8bit = LTXVideoTransformer3DModel.from_pretrained(
168-
"Lightricks/LTX-Video",
169-
subfolder="transformer",
170-
quantization_config=quant_config,
171-
torch_dtype=torch.float16,
172-
)
179+
prompt = "CAKEIFY a person using a knife to cut a cake shaped like a pair of cowboy boots"
173180

174-
pipeline = LTXPipeline.from_pretrained(
175-
"Lightricks/LTX-Video",
176-
text_encoder=text_encoder_8bit,
177-
transformer=transformer_8bit,
178-
torch_dtype=torch.float16,
179-
device_map="balanced",
180-
)
181+
video = pipeline(
182+
prompt=prompt,
183+
width=768,
184+
height=512,
185+
num_frames=161,
186+
decode_timestep=0.03,
187+
decode_noise_scale=0.025,
188+
num_inference_steps=50,
189+
).frames[0]
190+
export_to_video(video, "output.mp4", fps=24)
191+
```
192+
- LTX-Video supports loading from single files, such as [GGUF checkpoints](../../quantization/gguf), with [`FromOriginalModelMixin.from_single_file`] or [`FromSingleFileMixin.from_single_file`].
181193

182-
prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
183-
video = pipeline(prompt=prompt, num_frames=161, num_inference_steps=50).frames[0]
184-
export_to_video(video, "ship.mp4", fps=24)
185-
```
194+
```py
195+
import torch
196+
from diffusers.utils import export_to_video
197+
from diffusers import LTXPipeline, LTXVideoTransformer3DModel, GGUFQuantizationConfig
198+
199+
transformer = LTXVideoTransformer3DModel.from_single_file(
200+
"https://huggingface.co/city96/LTX-Video-gguf/blob/main/ltx-video-2b-v0.9-Q3_K_S.gguf"
201+
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
202+
torch_dtype=torch.bfloat16
203+
)
204+
pipeline = LTXPipeline.from_pretrained(
205+
"Lightricks/LTX-Video",
206+
transformer=transformer,
207+
torch_dtype=bfloat16
208+
)
209+
```
186210

187211
## LTXPipeline
188212

docs/source/en/api/pipelines/wan.md

+6-4
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,14 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License. -->
1414

15-
# Wan
16-
17-
<div class="flex flex-wrap space-x-1">
18-
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
15+
<div style="float: right;">
16+
<div class="flex flex-wrap space-x-1">
17+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
18+
</div>
1919
</div>
2020

21+
# Wan
22+
2123
[Wan 2.1](https://github.com/Wan-Video/Wan2.1) by the Alibaba Wan Team.
2224

2325
<!-- TODO(aryan): update abstract once paper is out -->

0 commit comments

Comments
 (0)