Skip to content

huggingface/lora-fast

Repository files navigation

lora-fast

Minimal repository to demonstrate fast LoRA inference with Flux.1-dev using different settings that can help with speed or memory efficiency. Please check the accompanying blog post at this URL.

The included benchmark script allows to experiment with:

  • FlashAttention3
  • torch.compile
  • Quantization
  • LoRA hot-swapping
  • CPU offloading

Key results

Option Time (s) ⬇️ Speedup (vs baseline) ⬆️ Notes
baseline 7.8910 Baseline
optimized 3.5464 2.23× Hotswapping + compilation without recompilation hiccups (FP8 on by default)
no_fp8 4.3520 1.81× Same as optimized, but with FP8 quantization disabled
no_fa3 4.3020 1.84× Disable FA3 (flash-attention v3)
baseline + compile 5.0920 1.55× Compilation on, but suffers from intermittent recompilation stalls
no_fa3_fp8 5.0850 1.55× Disable FA3 and FP8
no_compile_fp8 7.5190 1.05× Disable FP8 quantization and compilation
no_compile 10.4340 0.76× Disable compilation: the slowest setting

Installation

The requirements for this repository are listed in the requirements.txt, please ensure they are installed in your Python environment, e.g. by running:

python -m pip install -r requirements.txt.

FlashAttention3

Optionally, use FlashAttention3 for even better performance. This requires a Hopper GPU (e.g. H100). Follow the install instructions here.

Running the benchmarks

Run the benchmarks using the provided run_benchmark.py script. To check the available arguments, run:

python run_benchmark.py --help

If you want to run a battery of different settings, some shell scripts are provided to achieve that. Use run_experiments.sh if you have a server GPU like an H100. Use run_exps_rtx_4090.sh if you have a consumer GPU with 24 GB of memory, like an RTX 4090. The benchmark data and sample images are stored by default in the results/ directory.

Standalone script

The inference_lora.py script implements the optimizations in sequence and it is geared towards an H100. Refer to it for a simpler reference than run_benchmark.py. Users should only refer to this script in case they are not interested in conducting without running benchmarking.

About

Minimal repository to demonstrate fast LoRA inference with Flux family of models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •