Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
<li>Qwen2 (0.5B - 72B)</li>
<li>Qwen2-MoE (57BA14B)</li>
<li>Qwen2.5 (0.5B - 32B)</li>
<li>Qwen3, Qwen3-MoE</li>
<li>Baichuan (7B)</li>
<li>Baichuan2 (7B-13B)</li>
<li>Code Llama (7B - 34B)</li>
Expand Down Expand Up @@ -158,6 +159,7 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
<li>InternVL-Chat (v1.1-v1.5)</li>
<li>InternVL2 (1B-76B)</li>
<li>InternVL2.5(MPO) (1B-78B)</li>
<li>InternVL3 (1B-78B)</li>
<li>Mono-InternVL (2B)</li>
<li>ChemVLM (8B-26B)</li>
<li>CogVLM-Chat (17B)</li>
Expand Down
2 changes: 2 additions & 0 deletions README_ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
<li>Qwen2 (0.5B - 72B)</li>
<li>Qwen2-MoE (57BA14B)</li>
<li>Qwen2.5 (0.5B - 32B)</li>
<li>Qwen3, Qwen3-MoE</li>
<li>Baichuan (7B)</li>
<li>Baichuan2 (7B-13B)</li>
<li>Code Llama (7B - 34B)</li>
Expand Down Expand Up @@ -156,6 +157,7 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
<li>InternVL-Chat (v1.1-v1.5)</li>
<li>InternVL2 (1B-76B)</li>
<li>InternVL2.5(MPO) (1B-78B)</li>
<li>InternVL3 (1B-78B)</li>
<li>Mono-InternVL (2B)</li>
<li>ChemVLM (8B-26B)</li>
<li>CogVLM-Chat (17B)</li>
Expand Down
2 changes: 2 additions & 0 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
<li>Qwen2 (0.5B - 72B)</li>
<li>Qwen2-MoE (57BA14B)</li>
<li>Qwen2.5 (0.5B - 32B)</li>
<li>Qwen3, Qwen3-MoE</li>
<li>Baichuan (7B)</li>
<li>Baichuan2 (7B-13B)</li>
<li>Code Llama (7B - 34B)</li>
Expand Down Expand Up @@ -160,6 +161,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
<li>InternVL-Chat (v1.1-v1.5)</li>
<li>InternVL2 (1B-76B)</li>
<li>InternVL2.5(MPO) (1B-78B)</li>
<li>InternVL3 (1B-78B)</li>
<li>Mono-InternVL (2B)</li>
<li>ChemVLM (8B-26B)</li>
<li>CogVLM-Chat (17B)</li>
Expand Down
47 changes: 46 additions & 1 deletion docker/Dockerfile_aarch64_ascend
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,8 @@ RUN echo "source /usr/local/Ascend/ascend-toolkit/set_env.sh" >> ~/.bashrc && \
RUN --mount=type=cache,target=/root/.cache/pip \
pip3 install torch==2.3.1 torchvision==0.18.1 torch-npu==2.3.1 && \
pip3 install transformers timm && \
pip3 install dlinfer-ascend
pip3 install dlinfer-ascend && \
pip3 install partial_json_parser shortuuid

# lmdeploy
FROM build_temp as copy_temp
Expand All @@ -122,3 +123,47 @@ WORKDIR /opt/lmdeploy

RUN --mount=type=cache,target=/root/.cache/pip \
LMDEPLOY_TARGET_DEVICE=ascend pip3 install -v --no-build-isolation -e .

ENV ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/lib64/plugin/opskernel:${ASCEND_TOOLKIT_HOME}/lib64/plugin/nnengine:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/$(arch):$LD_LIBRARY_PATH
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/tools/aml/lib64:${ASCEND_TOOLKIT_HOME}/tools/aml/lib64/plugin:$LD_LIBRARY_PATH
ENV PYTHONPATH=${ASCEND_TOOLKIT_HOME}/python/site-packages:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe:$PYTHONPATH
ENV PATH=${ASCEND_TOOLKIT_HOME}/bin:${ASCEND_TOOLKIT_HOME}/compiler/ccec_compiler/bin:${ASCEND_TOOLKIT_HOME}/tools/ccec_compiler/bin:$PATH
ENV ASCEND_AICPU_PATH=${ASCEND_TOOLKIT_HOME}
ENV ASCEND_OPP_PATH=${ASCEND_TOOLKIT_HOME}/opp
ENV TOOLCHAIN_HOME=${ASCEND_TOOLKIT_HOME}/toolkit
ENV ASCEND_HOME_PATH=${ASCEND_TOOLKIT_HOME}

ENV ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0
ENV LD_LIBRARY_PATH=${ATB_HOME_PATH}/lib:${ATB_HOME_PATH}/examples:${ATB_HOME_PATH}/tests/atbopstest:$LD_LIBRARY_PATH
ENV PATH=${ATB_HOME_PATH}/bin:$PATH

ENV ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ENV ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ENV ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ENV ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ENV ATB_OPSRUNNER_KERNEL_CACHE_TYPE=3
ENV ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ENV ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ENV ATB_OPSRUNNER_KERNEL_CACHE_TILING_SIZE=10240
ENV ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ENV ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
ENV ATB_COMPARE_TILING_EVERY_KERNEL=0
ENV ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ENV ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ENV ATB_SHARE_MEMORY_NAME_SUFFIX=""
ENV ATB_LAUNCH_KERNEL_WITH_TILING=1
ENV ATB_MATMUL_SHUFFLE_K_ENABLE=1
ENV ATB_RUNNER_POOL_SIZE=64

ENV ASDOPS_HOME_PATH=${ATB_HOME_PATH}
ENV ASDOPS_MATMUL_PP_FLAG=1
ENV ASDOPS_LOG_LEVEL=ERROR
ENV ASDOPS_LOG_TO_STDOUT=0
ENV ASDOPS_LOG_TO_FILE=1
ENV ASDOPS_LOG_TO_FILE_FLUSH=0
ENV ASDOPS_LOG_TO_BOOST_TYPE=atb
ENV ASDOPS_LOG_PATH=~
ENV ASDOPS_TILING_PARSE_CACHE_DISABLE=0

ENV LCCL_DETERMINISTIC=0
10 changes: 10 additions & 0 deletions docs/en/get_started/ascend/get_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,16 @@ lmdeploy lite auto_awq $HF_MODEL --work-dir $WORK_DIR --device npu

Please check [supported_models](../../supported_models/supported_models.md) before use this feature.

### w8a8 SMOOTH_QUANT

Run the following commands to quantize weights on Atlas 800T A2.

```bash
lmdeploy lite smooth_quant $HF_MODEL --work-dir $WORK_DIR --device npu
```

Please check [supported_models](../../supported_models/supported_models.md) before use this feature.

### int8 KV-cache Quantization

Ascend backend has supported offline int8 KV-cache Quantization on eager mode.
Expand Down
2 changes: 1 addition & 1 deletion docs/en/get_started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ pip install lmdeploy
The default prebuilt package is compiled on **CUDA 12**. If CUDA 11+ (>=11.3) is required, you can install lmdeploy by:

```shell
export LMDEPLOY_VERSION=0.7.2.post1
export LMDEPLOY_VERSION=0.8.0
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
```
Expand Down
15 changes: 8 additions & 7 deletions docs/en/multi_modal/internvl.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,14 @@

LMDeploy supports the following InternVL series of models, which are detailed in the table below:

| Model | Size | Supported Inference Engine |
| :-----------: | :--------: | :------------------------: |
| InternVL | 13B-19B | TurboMind |
| InternVL1.5 | 2B-26B | TurboMind, PyTorch |
| InternVL2 | 1B, 4B | PyTorch |
| InternVL2 | 2B, 8B-76B | TurboMind, PyTorch |
| Mono-InternVL | 2B | PyTorch |
| Model | Size | Supported Inference Engine |
| :-------------------: | :-----------: | :------------------------: |
| InternVL | 13B-19B | TurboMind |
| InternVL1.5 | 2B-26B | TurboMind, PyTorch |
| InternVL2 | 4B | PyTorch |
| InternVL2 | 1B-2B, 8B-76B | TurboMind, PyTorch |
| InternVL2.5/2.5-MPO/3 | 1B-78B | TurboMind, PyTorch |
| Mono-InternVL | 2B | PyTorch |

The next chapter demonstrates how to deploy an InternVL model using LMDeploy, with [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B) as an example.

Expand Down
43 changes: 26 additions & 17 deletions docs/en/supported_models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
| Qwen2<sup>\[2\]</sup> | 0.5B - 72B | LLM | Yes | Yes\* | Yes\* | Yes |
| Qwen2-MoE | 57BA14B | LLM | Yes | Yes | Yes | Yes |
| Qwen2.5<sup>\[2\]</sup> | 0.5B - 72B | LLM | Yes | Yes\* | Yes\* | Yes |
| Qwen3 | 0.6B-235B | LLM | Yes | Yes | Yes\* | Yes\* |
| Mistral<sup>\[1\]</sup> | 7B | LLM | Yes | Yes | Yes | No |
| Mixtral | 8x7B, 8x22B | LLM | Yes | Yes | Yes | Yes |
| DeepSeek-V2 | 16B, 236B | LLM | Yes | Yes | Yes | No |
Expand All @@ -36,6 +37,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
| InternVL | v1.1 - v1.5 | MLLM | Yes | Yes | Yes | Yes |
| InternVL2<sup>\[2\]</sup> | 1 - 2B, 8B - 76B | MLLM | Yes | Yes\* | Yes\* | Yes |
| InternVL2.5(MPO)<sup>\[2\]</sup> | 1 - 78B | MLLM | Yes | Yes\* | Yes\* | Yes |
| InternVL3<sup>\[2\]</sup> | 1 - 78B | MLLM | Yes | Yes\* | Yes\* | Yes |
| ChemVLM | 8B - 26B | MLLM | Yes | Yes | Yes | Yes |
| MiniCPM-Llama3-V-2_5 | - | MLLM | Yes | Yes | Yes | Yes |
| MiniCPM-V-2_6 | - | MLLM | Yes | Yes | Yes | Yes |
Expand Down Expand Up @@ -76,6 +78,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
| QWen1.5-MoE | A2.7B | LLM | Yes | Yes | Yes | No | No |
| QWen2 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes |
| Qwen2.5 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes |
| Qwen3 | 0.6B - 235B | LLM | Yes | Yes | Yes\* | - | Yes\* |
| QWen2-VL | 2B, 7B | MLLM | Yes | Yes | No | No | Yes |
| QWen2.5-VL | 3B - 72B | MLLM | Yes | No | No | No | No |
| DeepSeek-MoE | 16B | LLM | Yes | No | No | No | No |
Expand All @@ -95,6 +98,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | No | Yes |
| InternVL2 | 1B-76B | MLLM | Yes | Yes | Yes | - | - |
| InternVL2.5(MPO) | 1B-78B | MLLM | Yes | Yes | Yes | - | - |
| InternVL3 | 1B-78B | MLLM | Yes | Yes | Yes | - | - |
| Mono-InternVL<sup>\[1\]</sup> | 2B | MLLM | Yes | Yes | Yes | - | - |
| ChemVLM | 8B-26B | MLLM | Yes | Yes | No | - | - |
| Gemma2 | 9B-27B | LLM | Yes | Yes | Yes | - | - |
Expand All @@ -114,20 +118,25 @@ The following tables detail the models supported by LMDeploy's TurboMind engine

## PyTorchEngine on Huawei Ascend Platform

| Model | Size | Type | FP16/BF16(eager) | FP16/BF16(graph) | W4A16(eager) |
| :------------: | :------: | :--: | :--------------: | :--------------: | :----------: |
| Llama2 | 7B - 70B | LLM | Yes | Yes | Yes |
| Llama3 | 8B | LLM | Yes | Yes | Yes |
| Llama3.1 | 8B | LLM | Yes | Yes | Yes |
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes |
| InternLM2.5 | 7B - 20B | LLM | Yes | Yes | Yes |
| InternLM3 | 8B | LLM | Yes | Yes | Yes |
| Mixtral | 8x7B | LLM | Yes | Yes | No |
| QWen1.5-MoE | A2.7B | LLM | Yes | - | No |
| QWen2(.5) | 7B | LLM | Yes | Yes | No |
| QWen2-MoE | A14.57B | LLM | Yes | - | No |
| DeepSeek-V2 | 16B | LLM | No | Yes | No |
| InternVL(v1.5) | 2B-26B | MLLM | Yes | - | Yes |
| InternVL2 | 1B-40B | MLLM | Yes | Yes | Yes |
| CogVLM2-chat | 19B | MLLM | Yes | No | - |
| GLM4V | 9B | MLLM | Yes | No | - |
| Model | Size | Type | FP16/BF16(eager) | FP16/BF16(graph) | W8A8(graph) | W4A16(eager) |
| :------------: | :-------: | :--: | :--------------: | :--------------: | :---------: | :----------: |
| Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes |
| Llama3 | 8B | LLM | Yes | Yes | Yes | Yes |
| Llama3.1 | 8B | LLM | Yes | Yes | Yes | Yes |
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes |
| InternLM2.5 | 7B - 20B | LLM | Yes | Yes | Yes | Yes |
| InternLM3 | 8B | LLM | Yes | Yes | Yes | Yes |
| Mixtral | 8x7B | LLM | Yes | Yes | No | No |
| QWen1.5-MoE | A2.7B | LLM | Yes | - | No | No |
| QWen2(.5) | 7B | LLM | Yes | Yes | Yes | Yes |
| QWen2-VL | 2B, 7B | MLLM | Yes | Yes | - | - |
| QWen2.5-VL | 3B - 72B | MLLM | Yes | Yes | - | - |
| QWen2-MoE | A14.57B | LLM | Yes | - | No | No |
| QWen3 | 0.6B-235B | LLM | Yes | Yes | No | No |
| DeepSeek-V2 | 16B | LLM | No | Yes | No | No |
| InternVL(v1.5) | 2B-26B | MLLM | Yes | - | Yes | Yes |
| InternVL2 | 1B-40B | MLLM | Yes | Yes | Yes | Yes |
| InternVL2.5 | 1B-78B | MLLM | Yes | Yes | Yes | Yes |
| InternVL3 | 1B-78B | MLLM | Yes | Yes | Yes | Yes |
| CogVLM2-chat | 19B | MLLM | Yes | No | - | - |
| GLM4V | 9B | MLLM | Yes | No | - | - |
10 changes: 10 additions & 0 deletions docs/zh_cn/get_started/ascend/get_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,16 @@ lmdeploy lite auto_awq $HF_MODEL --work-dir $WORK_DIR --device npu

支持的模型列表请参考[支持的模型](../../supported_models/supported_models.md)。

### w8a8 SMOOTH_QUANT

运行下面的代码可以在Atlas 800T A2上对权重进行W8A8量化。

```bash
lmdeploy lite smooth_quant $HF_MODEL --work-dir $WORK_DIR --device npu
```

支持的模型列表请参考[支持的模型](../../supported_models/supported_models.md)。

### int8 KV-cache 量化

昇腾后端现在支持了在eager模式下的离线int8 KV-cache量化。
Expand Down
2 changes: 1 addition & 1 deletion docs/zh_cn/get_started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ pip install lmdeploy
默认的预构建包是在 **CUDA 12** 上编译的。如果需要 CUDA 11+ (>=11.3),你可以使用以下命令安装 lmdeploy:

```shell
export LMDEPLOY_VERSION=0.7.2.post1
export LMDEPLOY_VERSION=0.8.0
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
```
Expand Down
15 changes: 8 additions & 7 deletions docs/zh_cn/multi_modal/internvl.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,14 @@

LMDeploy 支持 InternVL 系列模型,具体如下:

| Model | Size | Supported Inference Engine |
| :-----------: | :--------: | :------------------------: |
| InternVL | 13B-19B | TurboMind |
| InternVL1.5 | 2B-26B | TurboMind, PyTorch |
| InternVL2 | 1B, 4B | PyTorch |
| InternVL2 | 2B, 8B-76B | TurboMind, PyTorch |
| Mono-InternVL | 2B | PyTorch |
| Model | Size | Supported Inference Engine |
| :-------------------: | :-----------: | :------------------------: |
| InternVL | 13B-19B | TurboMind |
| InternVL1.5 | 2B-26B | TurboMind, PyTorch |
| InternVL2 | 4B | PyTorch |
| InternVL2 | 1B-2B, 8B-76B | TurboMind, PyTorch |
| InternVL2.5/2.5-MPO/3 | 1B-78B | TurboMind, PyTorch |
| Mono-InternVL | 2B | PyTorch |

本文将以[InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B)为例,演示使用 LMDeploy 部署 InternVL 系列模型的方法。

Expand Down
Loading