|
| 1 | +<div align="center"> |
| 2 | + <picture> |
| 3 | + <img alt="LightLLM" src="assets/lightllm.drawio.png" width=90%> |
| 4 | + </picture> |
| 5 | +</div> |
| 6 | + |
| 7 | +--- |
| 8 | +LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. LightLLM harnesses the strengths of numerous well-regarded open-source implementations, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention. |
| 9 | + |
| 10 | +## Features |
| 11 | + |
| 12 | +- Tri-process asynchronous collaboration: tokenization, model inference, and detokenization are performed asynchronously, leading to a considerable improvement in GPU utilization. |
| 13 | +- Nopad (Unpad): offers support for nopad attention operations across multiple models to efficiently handle requests with large length disparities. |
| 14 | +- Dynamic Batch: enables dynamic batch scheduling of requests |
| 15 | +- [FlashAttention](https://github.com/Dao-AILab/flash-attention): incorporates FlashAttention to improve speed and reduce GPU memory footprint during inference. |
| 16 | +- Tensor Parallelism: utilizes tensor parallelism over multiple GPUs for faster inference. |
| 17 | +- [Token Attention](./docs/TokenAttention.md): implements token-wise's KV cache memory management mechanism, allowing for zero memory waste during inference. |
| 18 | +- High-performance Router: collaborates with Token Attention to meticulously manage the GPU memory of each token, thereby optimizing system throughput. |
| 19 | + |
| 20 | +## Supported Model List |
| 21 | + |
| 22 | +- [BLOOM](https://huggingface.co/bigscience/bloom) |
| 23 | +- [LLaMA](https://github.com/facebookresearch/llama) |
| 24 | +- [LLaMA V2](https://huggingface.co/meta-llama) |
| 25 | + |
| 26 | +## Get started |
| 27 | + |
| 28 | +### Requirements |
| 29 | + |
| 30 | +The code has been tested with Pytorch>=1.3, CUDA 11.8, and Python 3.9. To install the necessary dependencies, please refer to the provided **requirements.txt** and follow the instructions as |
| 31 | + |
| 32 | +~~~shell |
| 33 | +pip install -r requirements.txt |
| 34 | +~~~ |
| 35 | + |
| 36 | +A more straightforward approach is to use the official Docker container: |
| 37 | + |
| 38 | +~~~shell |
| 39 | +docker build -t image_name . |
| 40 | +docker run -it --gpus all -p 8080:80 -v your_local_path:/data/ image_name /bin/bash |
| 41 | +~~~ |
| 42 | + |
| 43 | +### Installation |
| 44 | + |
| 45 | +- Install from the source code by |
| 46 | + |
| 47 | +~~~shell |
| 48 | +python setup.py install |
| 49 | +~~~ |
| 50 | + |
| 51 | + The code has been tested on a range of GPUs including V100, A100, A800, 4090, and H800. If you are running the code on V100, A100, A800, etc., we recommend using triton==2.0.0.dev20221202. If you are running the code on 4090, H800, etc., it is necessary to compile and install the source code of [triton==2.1.0](https://github.com/openai/triton/tree/main) from the GitHub repository. If the code doesn't work on other GPUs, try modifying the triton kernel used in model inference. |
| 52 | + |
| 53 | +### RUN LLaMA |
| 54 | +With efficient Routers and TokenAttention, LightLLM can be deployed as a service and achieve the state-of-the-art throughput performance. |
| 55 | + |
| 56 | +Launch the server: |
| 57 | + |
| 58 | +~~~shell |
| 59 | +python -m lightllm.server.api_server --model_dir /path/llama-7B --tp 1 --max_total_token_num 120000 |
| 60 | +~~~ |
| 61 | + |
| 62 | +The parameter `max_total_token_num` is influenced by the GPU memory of the deployment environment. A larger value for this parameter allows for the processing of more concurrent requests, thereby increasing system concurrency. For more startup parameters, please refer to [api_server.py](lightllm/server/api_server.py). |
| 63 | + |
| 64 | +To initiate a query in the shell: |
| 65 | + |
| 66 | +~~~shell |
| 67 | +curl 127.0.0.1:8000/generate \ |
| 68 | + -X POST \ |
| 69 | + -d '{"inputs":"What is AI?","parameters":{"max_new_tokens":17, "frequency_penalty":1}}' \ |
| 70 | + -H 'Content-Type: application/json' |
| 71 | +~~~ |
| 72 | + |
| 73 | +To query from Python: |
| 74 | + |
| 75 | +~~~python |
| 76 | + import time |
| 77 | + import requests |
| 78 | + import json |
| 79 | + |
| 80 | + url = 'http://localhost:8000/generate' |
| 81 | + headers = {'Content-Type': 'application/json'} |
| 82 | + data = { |
| 83 | + 'inputs': 'What is AI?', |
| 84 | + "parameters": { |
| 85 | + 'do_sample': False, |
| 86 | + 'ignore_eos': False, |
| 87 | + 'max_new_tokens': 1024, |
| 88 | + } |
| 89 | + } |
| 90 | + response = requests.post(url, headers=headers, data=json.dumps(data)) |
| 91 | + if response.status_code == 200: |
| 92 | + print(response.json()) |
| 93 | + else: |
| 94 | + print('Error:', response.status_code, response.text) |
| 95 | +~~~ |
| 96 | + |
| 97 | +## Performance |
| 98 | + |
| 99 | +### Service Performance |
| 100 | + |
| 101 | +We compared the service performance of LightLLM and vLLM==0.1.2 on LLaMA-7B using an A800 with 80G GPU memory. |
| 102 | + |
| 103 | +To begin, prepare the data as follows: |
| 104 | + |
| 105 | +~~~shell |
| 106 | +wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json |
| 107 | +~~~ |
| 108 | + |
| 109 | +Launch the service: |
| 110 | + |
| 111 | +~~~shell |
| 112 | +python -m lightllm.server.api_server --model_dir /path/llama-7b --tp 1 --max_total_token_num 121060 --tokenizer_mode auto |
| 113 | +~~~ |
| 114 | + |
| 115 | +Evaluation: |
| 116 | + |
| 117 | +~~~shell |
| 118 | +cd test |
| 119 | +python benchmark_serving.py --tokenizer /path/llama-7b --dataset /path/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000 --request-rate 200 |
| 120 | +~~~ |
| 121 | + |
| 122 | +The performance comparisons results are presented below: |
| 123 | + |
| 124 | +| vLLM | LightLLM | |
| 125 | +| ---------------------------------------------------- | ----------------------------------------------------- | |
| 126 | +| Total time: 361.79 s<br/>Throughput: 5.53 requests/s | Total time: 188.85 s<br/>Throughput: 10.59 requests/s | |
| 127 | + |
| 128 | +### Static inference performance |
| 129 | + |
| 130 | +For debugging, we offer static performance testing scripts for various models. For instance, you can evaluate the inference performance of the LLaMA model by |
| 131 | + |
| 132 | +~~~shell |
| 133 | +cd test/lightllama |
| 134 | +python test_model_infer.py |
| 135 | +~~~ |
| 136 | + |
| 137 | +### FAQ |
| 138 | + |
| 139 | +- In case the LLaMA tokenizer fails to load, consider resolving this by running the command 'pip install protobuf==3.20.0'. |
| 140 | + |
| 141 | +## License |
| 142 | + |
| 143 | +This repository is released under the [Apache-2.0](LICENSE) license. |
| 144 | + |
| 145 | +## Acknowledgement |
| 146 | + |
| 147 | +We learned a lot from the following projects when developing LightLLM. |
| 148 | +- [Faster Transformer](https://github.com/NVIDIA/FasterTransformer) |
| 149 | +- [Text Generation Inference](https://github.com/huggingface/text-generation-inference) |
| 150 | +- [vLLM](https://github.com/vllm-project/vllm) |
| 151 | +- [Flash Attention 1&2](https://github.com/Dao-AILab/flash-attention) |
0 commit comments