[Flux] Add batched inference #1227

CarlosGomes98 · 2025-05-27T14:18:36Z

This PR adds the ability to perform batched and multi-gpu inference on the Flux model, following from #1205

The main modification is to functions in sampling.py, allowing them to take several prompts and infer with them in a batch.
Added a infer.py script which performs inference. This leverages the Trainer, which is great for reusability, but not so great because it inherits all the requirements of the trainer (such as a train dataset, which doesnt really make sense here)
Added a run_inference.sh script and documented it in README.md
Added an example file with prompts prompts.txt

CarlosGomes98 · 2025-05-27T14:19:21Z

@wwwjn @tianyu-l Here is the split up inference bit :)

wwwjn · 2025-05-27T14:43:32Z

torchtitan/experiments/flux/sampling.py

@@ -177,9 +202,9 @@ def denoise(
    # create positional encodings
    POSITION_DIM = 3
    latent_pos_enc = create_position_encoding_for_latents(
-        bsz, latent_height, latent_width, POSITION_DIM
+        1, latent_height, latent_width, POSITION_DIM


QQ: Why we change the bsz to 1 here and later, as we are taking bsz prompts as input?

So theres 2 parts to this. The reason we can do this, is because for these particular tensors where im using 1 on the first dimension, they are the same for all samples. so we just want to repeat them for all of them. Due to torch broadcasting, whenever this is used in an operation with another tensor, this dimension will be expanded to match whatever is necessary from the other tensor (basically torch will automatically make this whatever the batch size is)

The reason we want to do this is twofold.

It does save some memory to not have to carry around all these repeated tensors, but to just allow torch to do broadcasting during operations instead

If we dont do it, whenever we are doing classifier free guidance, we will have to manually double the size of the tensors. Like this, we dont have to worry about it, as it will just correctly broadcast.

Ah the reasonning makes sense to me.

It feels to me that, if the result of latent_pos_enc is always identical for all samples in a batch, we probably should just remove the bsz as input arg and not worry about batch at all until its broadcast, instead of hardcoding bsz=1 at multiple places.

torchtitan/torchtitan/experiments/flux/utils.py

Line 123 in ed2bbc0

position_encoding = position_encoding.repeat(bsz, 1, 1)

wwwjn · 2025-05-27T14:45:37Z

torchtitan/experiments/flux/sampling.py

@@ -203,9 +231,12 @@ def denoise(
        if enable_classifer_free_guidance:
            pred_u, pred_c = pred.chunk(2)
            pred = pred_u + classifier_free_guidance_scale * (pred_c - pred_u)
-
+            pred = pred.repeat(2, 1, 1)


And QQ: Why we need to repeat the first dimension of pred?

My logic is as follows:

Previously, since we were dealing with just 1 input, we didnt have to do this, as pred would end up with a bsz of 1. In the case of classifier_free_guidance, latents would have a bsz of 2. Since pred has a bsz of 1, torch would broadcast this and it would work.

However, now, we can support batch sizes > 1, so pred will end up with a batch size = 1/2 of the bsz of latents, which in general will not be 1. Thus, we cannot benefit from broadcasting anymore, and have to do this repeat manually ourselves.

wwwjn · 2025-05-27T14:48:12Z

torchtitan/experiments/flux/scripts/infer.py

+
+        pil_images = [torch_to_pil(img) for img in all_images]
+        if config.inference.save_path:
+            path = Path(config.job.dump_folder, config.inference.save_path)


Can we reuse the save_image() function in sampling.py here?

wwwjn

Thank you so much @CarlosGomes98 for splitting the diff and make such a clear inference script, this is a really good feature for FLUX!

fegin

If trainer makes some assumptions about the dataset, we can rethink about the requirements of trainer. It's time to make trainer inference/eval friendly as well. cc., @wwwjn

fegin · 2025-05-27T17:01:12Z

torchtitan/experiments/flux/scripts/infer.py

+    return results
+
+
+if __name__ == "__main__":


It's better to put the following logic in a separate function (like main()). This will allow easier logic reuse (e.g., for unittests).

fegin · 2025-05-27T17:01:43Z

torchtitan/experiments/flux/scripts/infer.py

+                    3,
+                    256,
+                    256,


Can we make these magic numbers constant variables for the readability purpose?

fegin · 2025-05-27T17:02:42Z

torchtitan/experiments/flux/scripts/infer.py

+    config = config_manager.parse_args()
+    trainer = FluxTrainer(config)
+    world_size = int(os.environ["WORLD_SIZE"])
+    global_id = int(os.environ["RANK"])


Can we use rank to match the convention?

fegin · 2025-05-27T17:03:37Z

torchtitan/experiments/flux/scripts/infer.py

+    )
+    clip_tokenizer = FluxTokenizer(config.encoder.clip_encoder, max_length=77)
+
+    if global_id == 0:


This is not required, the logging information should be controlled by TorchRun configuration, which TorchTitan run scripts default to rank 0 only.

fegin · 2025-05-27T17:09:39Z

torchtitan/experiments/flux/scripts/infer.py

+    ]
+
+    # Gather images from all processes
+    torch.distributed.all_gather(gathered_images, images)


This should be gather() not all_gather() as you are not using the results on other ranks. Though I don't know if there will be any performance gains, gather() produces less total network traffic.

fegin · 2025-05-27T17:10:37Z

torchtitan/experiments/flux/scripts/infer.py

+    torch.distributed.all_gather(gathered_images, images)
+
+    # re-order the images to match the original ordering of prompts
+    if global_id == 0:


Another good motivation of making the logic in main() -- you can do early return here to remove one indention.

fegin · 2025-05-27T17:11:29Z

torchtitan/experiments/flux/scripts/infer.py

+                img.save(
+                    path / f"img_{i}.png", exif=exif_data, quality=95, subsampling=0
+                )
+    torch.distributed.destroy_process_group()


We should something like:

try: main() finally: if torch.distributed.is_initialized(): torch.distributed.destroy_process_group()

fegin

If trainer makes some assumptions about the dataset, we can rethink about the requirements of trainer. It's time to make trainer inference/eval friendly as well. cc., @wwwjn

tianyu-l

Nice progress! I left some comments on the organization of the code.

tianyu-l · 2025-05-28T12:23:40Z

torchtitan/experiments/flux/run_inference.sh

+# e.g.
+# LOG_RANK=0,1 NGPU=4 ./torchtitan/experiments/flux/run_inference.sh
+
+if [ -z "${JOB_FOLDER}" ]; then


can we default to job.dump_folder's default?

I would argue making it explicitly required makes using the script clearer and less error prone, at the cost of some user friendliness. But I understand that point as well. I'll make the change

tianyu-l · 2025-05-28T12:36:44Z

torchtitan/experiments/flux/scripts/infer.py

+                exif_data[ExifTags.Base.Model] = "Schnell"
+                exif_data[ExifTags.Base.ImageDescription] = original_prompts[i]
+                img.save(
+                    path / f"img_{i}.png", exif=exif_data, quality=95, subsampling=0


If eventually we are saving individual image files, why do we even perform gather / all-gather? We could save different images in the same folder from different ranks, just with unique names i.e. rank_{i} in the .png name.

Thats true. The tricky part is placing all the tensors back in the same order so we can match them up with the prompts. Because of the padding involved its not super straight forward, but I'm sure there's a way to do it while having each rank write its own images. Just might take some more thought

An easier way to bypass padding is to require the prompts file having length divisible by DP degree, or world size. Users of this script can manually add empty rows if needed.

tianyu-l · 2025-05-28T12:40:15Z

torchtitan/experiments/flux/scripts/infer.py

+
+
+@record
+def inference(


not obvious to me why this is worth a standalone function -- can we just call generate_image in the main script?

The only difference here is handling the batching. This could be handled by a method in the trainer, but I previously had it that way and refactored it out after discussions in #1205

I think both are valid, and its a matter of a design decision for torchtitan

I see. Sounds OK to me.

tianyu-l · 2025-05-28T12:45:00Z

torchtitan/experiments/flux/scripts/infer.py

Functionality-wise, this seems similar to torchtitan/experiments/flux/tests/test_generate_image.py, but with parallelized model.
I think we can make test_generate_image a unit test, if not removing it after this multi-gpu generation lands. @wwwjn

tianyu-l · 2025-05-28T12:53:34Z

torchtitan/experiments/flux/run_inference.sh

to group files a bit more logically, can we put run_inference.sh, prompts.txt, infer.py under the flux/inference folder? We can leave sampling.py outside as it's also used by the evaluation in train.py

tianyu-l · 2025-05-28T12:56:53Z

torchtitan/experiments/flux/sampling.py

        latents = latents + (t_prev - t_curr) * pred

+    if enable_classifer_free_guidance:
+        latents = latents.chunk(2)[1]


Yeah this code, together with the pred = pred.repeat(2, 1, 1) above, looks very obscure.
If they are necessary, please add adequate comments.

tianyu-l · 2025-05-28T13:08:38Z

If trainer makes some assumptions about the dataset, we can rethink about the requirements of trainer. It's time to make trainer inference/eval friendly as well.

@fegin
Interesting... I had thought Trainer as its name suggests, in principle should be used for training but not eval / inference? (validation is part of training)

btw for Llama, we have dedicated scripts to do multi-gpu generation / inference, which doesn't reuse the trainer
https://github.com/pytorch/torchtitan/blob/main/scripts/generate/test_generate.py
Actually it couldn't, because Sequence Parallel doesn't work with odd lengths language sequences.

fegin · 2025-05-28T23:10:11Z

@tianyu-l Naming is just one minor issue. IMO, it depends on how much code and logic are shared. If components are the most common pieces but the trainer is not, then I agree we shouldn't worry too much about this. However, if there is largely duplicated logic, especially performance critical parts (e.g., GC), we should either refactor trainer to make it more generic or evaluate the common logic in trainer to see if we can put this logic in components.

Carlos Gomes added 2 commits May 27, 2025 16:09

add batched inference

929fd66

simplify sampling.py structure

afabe1a

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 27, 2025

Carlos Gomes added 4 commits May 27, 2025 16:20

remove unused function from sampling

ee9adbf

Add prompt information to metadata, save all images

56d5ece

remove wrong comment

e6ac626

formatting changes

eccbbe3

wwwjn reviewed May 27, 2025

View reviewed changes

fegin reviewed May 27, 2025

View reviewed changes

tianyu-l reviewed May 28, 2025

View reviewed changes

+,
+,
+,

[Flux] Add batched inference #1227

Are you sure you want to change the base?

[Flux] Add batched inference #1227

Uh oh!

Conversation

CarlosGomes98 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CarlosGomes98 commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin commented May 28, 2025

Uh oh!

Uh oh!

CarlosGomes98 commented May 27, 2025 •

edited

Loading

tianyu-l commented May 28, 2025 •

edited

Loading