[Bug]: FusedMoE kernel performance depends on input prompt length while decoding #10313

taegeonum · 2024-11-14T05:28:31Z

Your current environment

The output of `python collect_env.py`

PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.4
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.6.77

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.46.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.post2.dev338+gf0f2e563
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled

CUBLAS_VERSION=12.6.3.3
CUDA_CACHE_DISABLE=1
TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX
NCCL_VERSION=2.22.3
CUDA_VERSION=12.6.2.004
PYTORCH_VERSION=2.5.0a0+e000cf0
PYTORCH_BUILD_NUMBER=0
CUDNN_FRONTEND_VERSION=1.7.0
CUDNN_VERSION=9.5.0.50
PYTORCH_HOME=/opt/pytorch/pytorch
CUDA_DRIVER_VERSION=560.35.03
PYTORCH_BUILD_VERSION=2.5.0a0+e000cf0
CUDA_MODULE_LOADING=LAZY
NVIDIA_PYTORCH_VERSION=24.10
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1

Model Input Dumps

No response

🐛 Describe the bug

Environment

H100

Description

FuseMoE kernel should not depend on the size of input prompts while decoding because there is no dependency with the input length, but output tokens/sec during decoding significantly changes in Mixtral model when we change input prompt length. It degrades around 50% when input length increases to x2.
To verify this, we just commented out attention code that has dependency with the lengths of input/output tokens in Mixtral model and confirmed that the results of Mixtral decoding speed degrade when the input prompt length increases even if the attention code was commented out.

How to resolve

My guess was that there is some bug in the fused moe kernel, and took a look at the commit history of fused moe. There was a commit regarding improvement of fused moe performance ([Performance][Kernel] Fused_moe Performance Improvement #9384), but I'm not sure this commit is the root cause of the bug. I just rolled back the version from latest to 0.6.3.post1 release.
In 0.6.3.post1 release version, the bug disappered. The decoding speed of Mixtral w/o attention does not depend on the input prompt length.

Bug found with @Byeong-Chan

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

taegeonum · 2024-11-14T22:27:28Z

@charlifu Do you have any guesses?

taegeonum added the bug Something isn't working label Nov 14, 2024

taegeonum changed the title ~~[Bug]: FuseMoE kernel performance depends on input prompt length while decoding~~ [Bug]: FusedMoE kernel performance depends on input prompt length while decoding Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: FusedMoE kernel performance depends on input prompt length while decoding #10313

[Bug]: FusedMoE kernel performance depends on input prompt length while decoding #10313

taegeonum commented Nov 14, 2024

taegeonum commented Nov 14, 2024

[Bug]: FusedMoE kernel performance depends on input prompt length while decoding #10313

[Bug]: FusedMoE kernel performance depends on input prompt length while decoding #10313

Comments

taegeonum commented Nov 14, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Environment

Description

How to resolve

Before submitting a new issue...

taegeonum commented Nov 14, 2024