muiLLM (Machine Ultra Instinct LLM) is an experimental Python library for fast inference on AMD MI GPUs. We are on a journey to reach 1000+ tokens/s for inference at batch size 1 on MI300x (on Mistral 7b).
It works by replacing the implementation of HuggingFace Transformers layers to reach higher performance.
It has currently been tested on:
- AMD MI100 GPUs
- AMD MI300x GPUs
(MI250x GPUs probably work as well, but have not been tested.)
The library being experimental means that there is probably quite some bugs lurking in there, but speed results should be representative.
Currently, the supported models (i.e. with most/all of their layers optimized) are:
More to come!
The following optimizations are already implemented:
-
optimized linear layers with faster GEMV
-
fused QKV
-
fused MLP (Gate/Up + SiLU)
-
fused residuals in linear layers
-
fused RMSNorm in linear layers
-
fused ROPE with write out in static/dynamic cache
-
experimental support for int8 RTN
-
flash decoding for attention computations
-
reduced CPU/GPU synchronizations due to attention mask checks
-
reduced CPU/GPU synchronizations during sampling
-
static cache support
-
reduced CPU overhead by using C++ modules instead of Python
-
tensor parallelism support (still being improved):
- sharded linear, mlp, attention layers
- custom low-latency fused GEMV-all-reduce: ~8us latency for 2 MI300x GPUs
Future optimizations (by order of likely implementation):
- fp8 support
- further improvements to linear/fused MLP to reach higher memory bandwidth
- layer interleaving
The numbers are changing at every commit, try it out by yourself!
But if you can't, here is the (outdated) approximate performance on a small prompt, generating 256 tokens, for a single GPU:
- fp16 on MI300x: 210 tokens/s average tokens/s
TODO: MI100 results, tensor parallelism results, comparison to Nvidia TensorRT, HuggingFace stock + compiled stock performance
The library has to be installed from source.
Before doing so, Pytorch for ROCM has to be installed first.
Please refer to the Pytorch website for how to install pytorch for ROCm.
To make the building process faster, make sure you have ninja installed as well:
pip install ninja
First clone the repository:
git clone https://github.com/Epliz/muiLLM.git
go to the directory of the cloned repository:
cd muiLLM
And install the library (creating a virtual environment beforehand is recommended):
pip install --upgrade build
pip install wheel
python setup.py bdist_wheel && pip install ./dist/muillm-0.0.1-cp310-cp310-linux_x86_64.whl
Then you can run one of the examples
Some examples are available in the examples folder
- examples/mistral7b.py an example of how to use muiLLM on the HuggingFace Transformers Mistral 7b model, in batch size 1 scenario.
- examples/mistral7b_batched.py an example of how to use muiLLM on the HuggingFace Transformers Mistral 7b model, in batched inference scenario.
- examples/tp_mistral7b.py an example of how to use tensor parallelism with muiLLM on the HuggingFace Transformers Mistral 7b model, in batch size 1 scenario.
- examples/tp_mistral7b_batched.py an example of how to use tensor parallelism with muiLLM on the HuggingFace Transformers Mistral 7b model, in batched inference scenario.
There are other examples in that folder, among which for LLama 3.
The tensor parallelism support either uses peer-to-peer memory transfers, or staged-CPU-buffers to do the collective operations.
For peer-to-peer to work, you will need to make sure that ACS is disabled. You can use the script in the AMD documentation.
For staged-CPU-buffers to work, you will need to make sure that your limit for locked memory (ulimit -l
) is high enough.