InternLM · lvhan028 · May 4, 2025 · Apr 14, 2025 · Apr 17, 2025 · Apr 17, 2025
diff --git a/README.md b/README.md
@@ -124,6 +124,7 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
   <li>Qwen2 (0.5B - 72B)</li>
   <li>Qwen2-MoE (57BA14B)</li>
   <li>Qwen2.5 (0.5B - 32B)</li>
+  <li>Qwen3, Qwen3-MoE</li>
   <li>Baichuan (7B)</li>
   <li>Baichuan2 (7B-13B)</li>
   <li>Code Llama (7B - 34B)</li>
@@ -158,6 +159,7 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
   <li>InternVL-Chat (v1.1-v1.5)</li>
   <li>InternVL2 (1B-76B)</li>
   <li>InternVL2.5(MPO) (1B-78B)</li>
+  <li>InternVL3 (1B-78B)</li>
   <li>Mono-InternVL (2B)</li>
   <li>ChemVLM (8B-26B)</li>
   <li>CogVLM-Chat (17B)</li>

diff --git a/README_ja.md b/README_ja.md
@@ -122,6 +122,7 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
   <li>Qwen2 (0.5B - 72B)</li>
   <li>Qwen2-MoE (57BA14B)</li>
   <li>Qwen2.5 (0.5B - 32B)</li>
+  <li>Qwen3, Qwen3-MoE</li>
   <li>Baichuan (7B)</li>
   <li>Baichuan2 (7B-13B)</li>
   <li>Code Llama (7B - 34B)</li>
@@ -156,6 +157,7 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
   <li>InternVL-Chat (v1.1-v1.5)</li>
   <li>InternVL2 (1B-76B)</li>
   <li>InternVL2.5(MPO) (1B-78B)</li>
+  <li>InternVL3 (1B-78B)</li>
   <li>Mono-InternVL (2B)</li>
   <li>ChemVLM (8B-26B)</li>
   <li>CogVLM-Chat (17B)</li>

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -126,6 +126,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型
   <li>Qwen2 (0.5B - 72B)</li>
   <li>Qwen2-MoE (57BA14B)</li>
   <li>Qwen2.5 (0.5B - 32B)</li>
+  <li>Qwen3, Qwen3-MoE</li>
   <li>Baichuan (7B)</li>
   <li>Baichuan2 (7B-13B)</li>
   <li>Code Llama (7B - 34B)</li>
@@ -160,6 +161,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型
   <li>InternVL-Chat (v1.1-v1.5)</li>
   <li>InternVL2 (1B-76B)</li>
   <li>InternVL2.5(MPO) (1B-78B)</li>
+  <li>InternVL3 (1B-78B)</li>
   <li>Mono-InternVL (2B)</li>
   <li>ChemVLM (8B-26B)</li>
   <li>CogVLM-Chat (17B)</li>

diff --git a/docker/Dockerfile_aarch64_ascend b/docker/Dockerfile_aarch64_ascend
@@ -110,7 +110,8 @@ RUN echo "source /usr/local/Ascend/ascend-toolkit/set_env.sh" >> ~/.bashrc && \
 RUN --mount=type=cache,target=/root/.cache/pip \
     pip3 install torch==2.3.1 torchvision==0.18.1 torch-npu==2.3.1 && \
     pip3 install transformers timm && \
-    pip3 install dlinfer-ascend
+    pip3 install dlinfer-ascend && \
+    pip3 install partial_json_parser shortuuid
 
 # lmdeploy
 FROM build_temp as copy_temp
@@ -122,3 +123,47 @@ WORKDIR /opt/lmdeploy
 
 RUN --mount=type=cache,target=/root/.cache/pip \
     LMDEPLOY_TARGET_DEVICE=ascend pip3 install -v --no-build-isolation -e .
+
+ENV ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
+ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/lib64/plugin/opskernel:${ASCEND_TOOLKIT_HOME}/lib64/plugin/nnengine:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/$(arch):$LD_LIBRARY_PATH
+ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/tools/aml/lib64:${ASCEND_TOOLKIT_HOME}/tools/aml/lib64/plugin:$LD_LIBRARY_PATH
+ENV PYTHONPATH=${ASCEND_TOOLKIT_HOME}/python/site-packages:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe:$PYTHONPATH
+ENV PATH=${ASCEND_TOOLKIT_HOME}/bin:${ASCEND_TOOLKIT_HOME}/compiler/ccec_compiler/bin:${ASCEND_TOOLKIT_HOME}/tools/ccec_compiler/bin:$PATH
+ENV ASCEND_AICPU_PATH=${ASCEND_TOOLKIT_HOME}
+ENV ASCEND_OPP_PATH=${ASCEND_TOOLKIT_HOME}/opp
+ENV TOOLCHAIN_HOME=${ASCEND_TOOLKIT_HOME}/toolkit
+ENV ASCEND_HOME_PATH=${ASCEND_TOOLKIT_HOME}
+
+ENV ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0
+ENV LD_LIBRARY_PATH=${ATB_HOME_PATH}/lib:${ATB_HOME_PATH}/examples:${ATB_HOME_PATH}/tests/atbopstest:$LD_LIBRARY_PATH
+ENV PATH=${ATB_HOME_PATH}/bin:$PATH
+
+ENV ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
+ENV ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
+ENV ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
+ENV ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
+ENV ATB_OPSRUNNER_KERNEL_CACHE_TYPE=3
+ENV ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
+ENV ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
+ENV ATB_OPSRUNNER_KERNEL_CACHE_TILING_SIZE=10240
+ENV ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
+ENV ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
+ENV ATB_COMPARE_TILING_EVERY_KERNEL=0
+ENV ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
+ENV ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
+ENV ATB_SHARE_MEMORY_NAME_SUFFIX=""
+ENV ATB_LAUNCH_KERNEL_WITH_TILING=1
+ENV ATB_MATMUL_SHUFFLE_K_ENABLE=1
+ENV ATB_RUNNER_POOL_SIZE=64
+
+ENV ASDOPS_HOME_PATH=${ATB_HOME_PATH}
+ENV ASDOPS_MATMUL_PP_FLAG=1
+ENV ASDOPS_LOG_LEVEL=ERROR
+ENV ASDOPS_LOG_TO_STDOUT=0
+ENV ASDOPS_LOG_TO_FILE=1
+ENV ASDOPS_LOG_TO_FILE_FLUSH=0
+ENV ASDOPS_LOG_TO_BOOST_TYPE=atb
+ENV ASDOPS_LOG_PATH=~
+ENV ASDOPS_TILING_PARSE_CACHE_DISABLE=0
+
+ENV LCCL_DETERMINISTIC=0
diff --git a/docs/en/get_started/ascend/get_started.md b/docs/en/get_started/ascend/get_started.md
@@ -158,6 +158,16 @@ lmdeploy lite auto_awq $HF_MODEL --work-dir $WORK_DIR --device npu
 
 Please check [supported_models](../../supported_models/supported_models.md) before use this feature.
 
+### w8a8 SMOOTH_QUANT
+
+Run the following commands to quantize weights on Atlas 800T A2.
+
+```bash
+lmdeploy lite smooth_quant $HF_MODEL --work-dir $WORK_DIR --device npu
+```
+
+Please check [supported_models](../../supported_models/supported_models.md) before use this feature.
+
 ### int8 KV-cache Quantization
 
 Ascend backend has supported offline int8 KV-cache Quantization on eager mode.

diff --git a/docs/en/get_started/installation.md b/docs/en/get_started/installation.md
@@ -23,7 +23,7 @@ pip install lmdeploy
 The default prebuilt package is compiled on **CUDA 12**. If CUDA 11+ (>=11.3) is required, you can install lmdeploy by:
 
 ```shell
-export LMDEPLOY_VERSION=0.7.2.post1
+export LMDEPLOY_VERSION=0.8.0
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 ```

diff --git a/docs/en/multi_modal/internvl.md b/docs/en/multi_modal/internvl.md
@@ -2,13 +2,14 @@
 
 LMDeploy supports the following InternVL series of models, which are detailed in the table below:
 
-|     Model     |    Size    | Supported Inference Engine |
-| :-----------: | :--------: | :------------------------: |
-|   InternVL    |  13B-19B   |         TurboMind          |
-|  InternVL1.5  |   2B-26B   |     TurboMind, PyTorch     |
-|   InternVL2   |   1B, 4B   |          PyTorch           |
-|   InternVL2   | 2B, 8B-76B |     TurboMind, PyTorch     |
-| Mono-InternVL |     2B     |          PyTorch           |
+|         Model         |     Size      | Supported Inference Engine |
+| :-------------------: | :-----------: | :------------------------: |
+|       InternVL        |    13B-19B    |         TurboMind          |
+|      InternVL1.5      |    2B-26B     |     TurboMind, PyTorch     |
+|       InternVL2       |      4B       |          PyTorch           |
+|       InternVL2       | 1B-2B, 8B-76B |     TurboMind, PyTorch     |
+| InternVL2.5/2.5-MPO/3 |    1B-78B     |     TurboMind, PyTorch     |
+|     Mono-InternVL     |      2B       |          PyTorch           |
 
 The next chapter demonstrates how to deploy an InternVL model using LMDeploy, with [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B) as an example.
 

diff --git a/docs/en/supported_models/supported_models.md b/docs/en/supported_models/supported_models.md
@@ -22,6 +22,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
 |      Qwen2<sup>\[2\]</sup>       |    0.5B - 72B    | LLM  |    Yes    |  Yes\*  |  Yes\*  |  Yes  |
 |            Qwen2-MoE             |     57BA14B      | LLM  |    Yes    |   Yes   |   Yes   |  Yes  |
 |     Qwen2.5<sup>\[2\]</sup>      |    0.5B - 72B    | LLM  |    Yes    |  Yes\*  |  Yes\*  |  Yes  |
+|              Qwen3               |    0.6B-235B     | LLM  |    Yes    |   Yes   |  Yes\*  | Yes\* |
 |     Mistral<sup>\[1\]</sup>      |        7B        | LLM  |    Yes    |   Yes   |   Yes   |  No   |
 |             Mixtral              |   8x7B, 8x22B    | LLM  |    Yes    |   Yes   |   Yes   |  Yes  |
 |           DeepSeek-V2            |    16B, 236B     | LLM  |    Yes    |   Yes   |   Yes   |  No   |
@@ -36,6 +37,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
 |             InternVL             |   v1.1 - v1.5    | MLLM |    Yes    |   Yes   |   Yes   |  Yes  |
 |    InternVL2<sup>\[2\]</sup>     | 1 - 2B, 8B - 76B | MLLM |    Yes    |  Yes\*  |  Yes\*  |  Yes  |
 | InternVL2.5(MPO)<sup>\[2\]</sup> |     1 - 78B      | MLLM |    Yes    |  Yes\*  |  Yes\*  |  Yes  |
+|    InternVL3<sup>\[2\]</sup>     |     1 - 78B      | MLLM |    Yes    |  Yes\*  |  Yes\*  |  Yes  |
 |             ChemVLM              |     8B - 26B     | MLLM |    Yes    |   Yes   |   Yes   |  Yes  |
 |       MiniCPM-Llama3-V-2_5       |        -         | MLLM |    Yes    |   Yes   |   Yes   |  Yes  |
 |          MiniCPM-V-2_6           |        -         | MLLM |    Yes    |   Yes   |   Yes   |  Yes  |
@@ -76,6 +78,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
 |          QWen1.5-MoE           |      A2.7B      | LLM  |    Yes    |   Yes   |   Yes   |  No  |  No   |
 |             QWen2              |   0.5B - 72B    | LLM  |    Yes    |   Yes   |   No    | Yes  |  Yes  |
 |            Qwen2.5             |   0.5B - 72B    | LLM  |    Yes    |   Yes   |   No    | Yes  |  Yes  |
+|             Qwen3              |   0.6B - 235B   | LLM  |    Yes    |   Yes   |  Yes\*  |  -   | Yes\* |
 |            QWen2-VL            |     2B, 7B      | MLLM |    Yes    |   Yes   |   No    |  No  |  Yes  |
 |           QWen2.5-VL           |    3B - 72B     | MLLM |    Yes    |   No    |   No    |  No  |  No   |
 |          DeepSeek-MoE          |       16B       | LLM  |    Yes    |   No    |   No    |  No  |  No   |
@@ -95,6 +98,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
 |         InternVL(v1.5)         |     2B-26B      | MLLM |    Yes    |   Yes   |   Yes   |  No  |  Yes  |
 |           InternVL2            |     1B-76B      | MLLM |    Yes    |   Yes   |   Yes   |  -   |   -   |
 |        InternVL2.5(MPO)        |     1B-78B      | MLLM |    Yes    |   Yes   |   Yes   |  -   |   -   |
+|           InternVL3            |     1B-78B      | MLLM |    Yes    |   Yes   |   Yes   |  -   |   -   |
 | Mono-InternVL<sup>\[1\]</sup>  |       2B        | MLLM |    Yes    |   Yes   |   Yes   |  -   |   -   |
 |            ChemVLM             |     8B-26B      | MLLM |    Yes    |   Yes   |   No    |  -   |   -   |
 |             Gemma2             |     9B-27B      | LLM  |    Yes    |   Yes   |   Yes   |  -   |   -   |
@@ -114,20 +118,25 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
 
 ## PyTorchEngine on Huawei Ascend Platform
 
-|     Model      |   Size   | Type | FP16/BF16(eager) | FP16/BF16(graph) | W4A16(eager) |
-| :------------: | :------: | :--: | :--------------: | :--------------: | :----------: |
-|     Llama2     | 7B - 70B | LLM  |       Yes        |       Yes        |     Yes      |
-|     Llama3     |    8B    | LLM  |       Yes        |       Yes        |     Yes      |
-|    Llama3.1    |    8B    | LLM  |       Yes        |       Yes        |     Yes      |
-|   InternLM2    | 7B - 20B | LLM  |       Yes        |       Yes        |     Yes      |
-|  InternLM2.5   | 7B - 20B | LLM  |       Yes        |       Yes        |     Yes      |
-|   InternLM3    |    8B    | LLM  |       Yes        |       Yes        |     Yes      |
-|    Mixtral     |   8x7B   | LLM  |       Yes        |       Yes        |      No      |
-|  QWen1.5-MoE   |  A2.7B   | LLM  |       Yes        |        -         |      No      |
-|   QWen2(.5)    |    7B    | LLM  |       Yes        |       Yes        |      No      |
-|   QWen2-MoE    | A14.57B  | LLM  |       Yes        |        -         |      No      |
-|  DeepSeek-V2   |   16B    | LLM  |        No        |       Yes        |      No      |
-| InternVL(v1.5) |  2B-26B  | MLLM |       Yes        |        -         |     Yes      |
-|   InternVL2    |  1B-40B  | MLLM |       Yes        |       Yes        |     Yes      |
-|  CogVLM2-chat  |   19B    | MLLM |       Yes        |        No        |      -       |
-|     GLM4V      |    9B    | MLLM |       Yes        |        No        |      -       |
+|     Model      |   Size    | Type | FP16/BF16(eager) | FP16/BF16(graph) | W8A8(graph) | W4A16(eager) |
+| :------------: | :-------: | :--: | :--------------: | :--------------: | :---------: | :----------: |
+|     Llama2     | 7B - 70B  | LLM  |       Yes        |       Yes        |     Yes     |     Yes      |
+|     Llama3     |    8B     | LLM  |       Yes        |       Yes        |     Yes     |     Yes      |
+|    Llama3.1    |    8B     | LLM  |       Yes        |       Yes        |     Yes     |     Yes      |
+|   InternLM2    | 7B - 20B  | LLM  |       Yes        |       Yes        |     Yes     |     Yes      |
+|  InternLM2.5   | 7B - 20B  | LLM  |       Yes        |       Yes        |     Yes     |     Yes      |
+|   InternLM3    |    8B     | LLM  |       Yes        |       Yes        |     Yes     |     Yes      |
+|    Mixtral     |   8x7B    | LLM  |       Yes        |       Yes        |     No      |      No      |
+|  QWen1.5-MoE   |   A2.7B   | LLM  |       Yes        |        -         |     No      |      No      |
+|   QWen2(.5)    |    7B     | LLM  |       Yes        |       Yes        |     Yes     |     Yes      |
+|    QWen2-VL    |  2B, 7B   | MLLM |       Yes        |       Yes        |      -      |      -       |
+|   QWen2.5-VL   | 3B - 72B  | MLLM |       Yes        |       Yes        |      -      |      -       |
+|   QWen2-MoE    |  A14.57B  | LLM  |       Yes        |        -         |     No      |      No      |
+|     QWen3      | 0.6B-235B | LLM  |       Yes        |       Yes        |     No      |      No      |
+|  DeepSeek-V2   |    16B    | LLM  |        No        |       Yes        |     No      |      No      |
+| InternVL(v1.5) |  2B-26B   | MLLM |       Yes        |        -         |     Yes     |     Yes      |
+|   InternVL2    |  1B-40B   | MLLM |       Yes        |       Yes        |     Yes     |     Yes      |
+|  InternVL2.5   |  1B-78B   | MLLM |       Yes        |       Yes        |     Yes     |     Yes      |
+|   InternVL3    |  1B-78B   | MLLM |       Yes        |       Yes        |     Yes     |     Yes      |
+|  CogVLM2-chat  |    19B    | MLLM |       Yes        |        No        |      -      |      -       |
+|     GLM4V      |    9B     | MLLM |       Yes        |        No        |      -      |      -       |
diff --git a/docs/zh_cn/get_started/ascend/get_started.md b/docs/zh_cn/get_started/ascend/get_started.md
@@ -154,6 +154,16 @@ lmdeploy lite auto_awq $HF_MODEL --work-dir $WORK_DIR --device npu
 
 支持的模型列表请参考[支持的模型](../../supported_models/supported_models.md)。
 
+### w8a8 SMOOTH_QUANT
+
+运行下面的代码可以在Atlas 800T A2上对权重进行W8A8量化。
+
+```bash
+lmdeploy lite smooth_quant $HF_MODEL --work-dir $WORK_DIR --device npu
+```
+
+支持的模型列表请参考[支持的模型](../../supported_models/supported_models.md)。
+
 ### int8 KV-cache 量化
 
 昇腾后端现在支持了在eager模式下的离线int8 KV-cache量化。

diff --git a/docs/zh_cn/get_started/installation.md b/docs/zh_cn/get_started/installation.md
@@ -23,7 +23,7 @@ pip install lmdeploy
 默认的预构建包是在 **CUDA 12** 上编译的。如果需要 CUDA 11+ (>=11.3)，你可以使用以下命令安装 lmdeploy：
 
 ```shell
-export LMDEPLOY_VERSION=0.7.2.post1
+export LMDEPLOY_VERSION=0.8.0
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 ```

diff --git a/docs/zh_cn/multi_modal/internvl.md b/docs/zh_cn/multi_modal/internvl.md
@@ -2,13 +2,14 @@
 
 LMDeploy 支持 InternVL 系列模型，具体如下：
 
-|     Model     |    Size    | Supported Inference Engine |
-| :-----------: | :--------: | :------------------------: |
-|   InternVL    |  13B-19B   |         TurboMind          |
-|  InternVL1.5  |   2B-26B   |     TurboMind, PyTorch     |
-|   InternVL2   |   1B, 4B   |          PyTorch           |
-|   InternVL2   | 2B, 8B-76B |     TurboMind, PyTorch     |
-| Mono-InternVL |     2B     |          PyTorch           |
+|         Model         |     Size      | Supported Inference Engine |
+| :-------------------: | :-----------: | :------------------------: |
+|       InternVL        |    13B-19B    |         TurboMind          |
+|      InternVL1.5      |    2B-26B     |     TurboMind, PyTorch     |
+|       InternVL2       |      4B       |          PyTorch           |
+|       InternVL2       | 1B-2B, 8B-76B |     TurboMind, PyTorch     |
+| InternVL2.5/2.5-MPO/3 |    1B-78B     |     TurboMind, PyTorch     |
+|     Mono-InternVL     |      2B       |          PyTorch           |
 
 本文将以[InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B)为例，演示使用 LMDeploy 部署 InternVL 系列模型的方法。