GitHub - Epliz/muiLLM: Pytorch extensions to make LLMs go brrr on AMD MI GPUs

muiLLM (Machine Ultra Instinct LLM) is an experimental Python library for fast inference on AMD MI GPUs. We are on a journey to reach 1000+ tokens/s for inference at batch size 1 on MI300x (on Mistral 7b).

It works by replacing the implementation of HuggingFace Transformers layers to reach higher performance.

It has currently been tested on:

AMD MI100 GPUs
AMD MI300x GPUs

(MI250x GPUs probably work as well, but have not been tested.)

The library being experimental means that there is probably quite some bugs lurking in there, but speed results should be representative.

Supported models

Currently, the supported models (i.e. with most/all of their layers optimized) are:

More to come!

Optimizations

The following optimizations are already implemented:

optimized linear layers with faster GEMV
fused QKV
fused MLP (Gate/Up + SiLU)
fused residuals in linear layers
fused RMSNorm in linear layers
fused ROPE with write out in static/dynamic cache
experimental support for int8 RTN
flash decoding for attention computations
reduced CPU/GPU synchronizations due to attention mask checks
reduced CPU/GPU synchronizations during sampling
static cache support
reduced CPU overhead by using C++ modules instead of Python
tensor parallelism support (still being improved):
- sharded linear, mlp, attention layers
- custom low-latency fused GEMV-all-reduce: ~8us latency for 2 MI300x GPUs

Future optimizations (by order of likely implementation):

fp8 support
further improvements to linear/fused MLP to reach higher memory bandwidth
layer interleaving

Performance numbers

The numbers are changing at every commit, try it out by yourself!

But if you can't, here is the (outdated) approximate performance on a small prompt, generating 256 tokens, for a single GPU:

fp16 on MI300x: 210 tokens/s average tokens/s

TODO: MI100 results, tensor parallelism results, comparison to Nvidia TensorRT, HuggingFace stock + compiled stock performance

Installation

The library has to be installed from source.

Before doing so, Pytorch for ROCM has to be installed first.

Please refer to the Pytorch website for how to install pytorch for ROCm.

To make the building process faster, make sure you have ninja installed as well:

pip install ninja

Installing from source

First clone the repository:

git clone https://github.com/Epliz/muiLLM.git

go to the directory of the cloned repository:

cd muiLLM

And install the library (creating a virtual environment beforehand is recommended):

pip install --upgrade build
pip install wheel

python setup.py bdist_wheel && pip install ./dist/muillm-0.0.1-cp310-cp310-linux_x86_64.whl

Then you can run one of the examples

Examples

Some examples are available in the examples folder

examples/mistral7b.py an example of how to use muiLLM on the HuggingFace Transformers Mistral 7b model, in batch size 1 scenario.
examples/mistral7b_batched.py an example of how to use muiLLM on the HuggingFace Transformers Mistral 7b model, in batched inference scenario.
examples/tp_mistral7b.py an example of how to use tensor parallelism with muiLLM on the HuggingFace Transformers Mistral 7b model, in batch size 1 scenario.
examples/tp_mistral7b_batched.py an example of how to use tensor parallelism with muiLLM on the HuggingFace Transformers Mistral 7b model, in batched inference scenario.

There are other examples in that folder, among which for LLama 3.

Troubleshooting

The tensor parallelism support either uses peer-to-peer memory transfers, or staged-CPU-buffers to do the collective operations.

For peer-to-peer to work, you will need to make sure that ACS is disabled. You can use the script in the AMD documentation.

For staged-CPU-buffers to work, you will need to make sure that your limit for locked memory (ulimit -l) is high enough.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
csrc		csrc
examples		examples
muillm		muillm
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Supported models

Optimizations

Performance numbers

Installation

Installing from source

Examples

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Epliz/muiLLM

Folders and files

Latest commit

History

Repository files navigation

Supported models

Optimizations

Performance numbers

Installation

Installing from source

Examples

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages