[Bug]: 多模态在线推理性能较差，显卡利用率很低

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>
```text
PyTorch version: 2.5.1
Is debug build: False

OS: Ubuntu 22.04.5 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.3
Libc version: glibc-2.35

Python version: 3.10.17 (main, May  8 2025, 07:18:04) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.0-60.18.0.50.oe2203.aarch64-aarch64-with-glibc2.35

CPU:
Architecture:                    aarch64
CPU op-mode(s):                  64-bit
Byte Order:                      Little Endian
CPU(s):                          256
On-line CPU(s) list:             0-255
Vendor ID:                       HiSilicon
BIOS Vendor ID:                  HiSilicon
Model name:                      Kunpeng-920
BIOS Model name:                 HUAWEI Kunpeng 920 7265
Model:                           0
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       4
Stepping:                        0x1
Frequency boost:                 disabled
CPU max MHz:                     3000.0000
CPU min MHz:                     200.0000
BogoMIPS:                        200.00
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                       16 MiB (256 instances)
L1i cache:                       16 MiB (256 instances)
L2 cache:                        128 MiB (256 instances)
L3 cache:                        256 MiB (8 instances)
NUMA node(s):                    4
NUMA node0 CPU(s):               0-63
NUMA node1 CPU(s):               64-127
NUMA node2 CPU(s):               128-191
NUMA node3 CPU(s):               192-255
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] onnxruntime==1.22.1
[pip3] pyzmq==27.0.0
[pip3] torch==2.5.1
[pip3] torch-npu==2.5.1
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.52.4
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] pyzmq                     26.4.0                   pypi_0    pypi
[conda] torch                     2.7.0                    pypi_0    pypi
[conda] transformers              4.51.3                   pypi_0    pypi
vLLM Version: 0.9.1
vLLM Ascend Version: 0.9.1rc1

ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_TILING_SIZE=10240
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ASCEND_VISIBLE_DEVICES=2
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_OPSRUNNER_KERNEL_CACHE_TYPE=3
ATB_RUNNER_POOL_SIZE=64
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_LAUNCH_KERNEL_WITH_TILING=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc1                 Version: 24.1.rc1                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B3               | OK            | 96.3        37                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          65257/ 65536         |
+===========================+===============+====================================================+
| 1     910B3               | OK            | 94.3        37                0    / 0             |
| 0                         | 0000:C2:00.0  | 0           0    / 0          65259/ 65536         |
+===========================+===============+====================================================+
| 2     910B3               | OK            | 96.2        39                0    / 0             |
| 0                         | 0000:81:00.0  | 0           0    / 0          42974/ 65536         |
+===========================+===============+====================================================+
| 3     910B3               | OK            | 99.5        39                0    / 0             |
| 0                         | 0000:82:00.0  | 0           0    / 0          3320 / 65536         |
+===========================+===============+====================================================+
| 4     910B3               | OK            | 95.8        43                0    / 0             |
| 0                         | 0000:01:00.0  | 0           0    / 0          57192/ 65536         |
+===========================+===============+====================================================+
| 5     910B3               | OK            | 98.1        45                0    / 0             |
| 0                         | 0000:02:00.0  | 0           0    / 0          57070/ 65536         |
+===========================+===============+====================================================+
| 6     910B3               | OK            | 99.8        43                0    / 0             |
| 0                         | 0000:41:00.0  | 0           0    / 0          57071/ 65536         |
+===========================+===============+====================================================+
| 7     910B3               | OK            | 99.1        45                0    / 0             |
| 0                         | 0000:42:00.0  | 0           0    / 0          57072/ 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| 0       0                 | 1846420       |                          | 61917                   |
+===========================+===============+====================================================+
| 1       0                 | 1849807       |                          | 61973                   |
+===========================+===============+====================================================+
| 2       0                 | 2302578       |                          | 39689                   |
+===========================+===============+====================================================+
| No running processes found in NPU 3                                                            |
+===========================+===============+====================================================+
| 4       0                 | 3712967       |                          | 53785                   |
+===========================+===============+====================================================+
| 5       0                 | 3714154       |                          | 53785                   |
+===========================+===============+====================================================+
| 6       0                 | 3714155       |                          | 53785                   |
+===========================+===============+====================================================+
| 7       0                 | 3714156       |                          | 53785                   |
+===========================+===============+====================================================+

CANN:
package_name=Ascend-cann-toolkit
version=8.1.RC1
innerversion=V100R001C21SPC001B238
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.1.RC1/aarch64-linux
```

</details>


### 🐛 Describe the bug

我在vllm中使用多模态模型在线推理。客户端请求如下：
```python
data = {"model": "mineru",
            "messages": [
                # {"role": "system", "content": system_prompt},
                {"role": "user",
                 "content": [
                     {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image}"}},
                     {"type": "text", "text": "<image>\nDocument Parsing:"}, ], }],
            "add_generation_prompt": True,
            "temperature": 0, "top_p": 0.8, "top_k":20, "repetition_penalty": 1.0, "max_tokens": 8192, "skip_special_tokens": False}

    # 3.将字典转换为 JSON 字符串
    json_payload = json.dumps(data)

    # 4.发送 POST 请求
    headers = {'Content-Type': 'application/json'}
    response = requests.post(url, data=json_payload, headers=headers)
    result = response.json().get("choices", [])[0].get("message", []).get("content", [])
    print(result)
```
模型的尺寸为1B。使用中发现每一次请求的输出token只有20token，性能较差，而显卡使用率只在0-10之间波动。
请求日志如下：
<img width="1109" height="646" alt="Image" src="https://github.com/user-attachments/assets/8121ce2f-1842-48a2-b550-ffda5a79de67" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: 多模态在线推理性能较差，显卡利用率很低 #2099

Your current environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: 多模态在线推理性能较差，显卡利用率很低 #2099

Description

Your current environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions