Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: FusedMoE kernel performance depends on input prompt length while decoding #10313

Open
1 task done
taegeonum opened this issue Nov 14, 2024 · 1 comment
Open
1 task done
Labels
bug Something isn't working

Comments

@taegeonum
Copy link

Your current environment

The output of `python collect_env.py`
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.4
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.6.77

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.46.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.post2.dev338+gf0f2e563
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled

CUBLAS_VERSION=12.6.3.3
CUDA_CACHE_DISABLE=1
TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX
NCCL_VERSION=2.22.3
CUDA_VERSION=12.6.2.004
PYTORCH_VERSION=2.5.0a0+e000cf0
PYTORCH_BUILD_NUMBER=0
CUDNN_FRONTEND_VERSION=1.7.0
CUDNN_VERSION=9.5.0.50
PYTORCH_HOME=/opt/pytorch/pytorch
CUDA_DRIVER_VERSION=560.35.03
PYTORCH_BUILD_VERSION=2.5.0a0+e000cf0
CUDA_MODULE_LOADING=LAZY
NVIDIA_PYTORCH_VERSION=24.10
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1

Model Input Dumps

No response

🐛 Describe the bug

Environment

  • H100

Description

  • FuseMoE kernel should not depend on the size of input prompts while decoding because there is no dependency with the input length, but output tokens/sec during decoding significantly changes in Mixtral model when we change input prompt length. It degrades around 50% when input length increases to x2.
  • To verify this, we just commented out attention code that has dependency with the lengths of input/output tokens in Mixtral model and confirmed that the results of Mixtral decoding speed degrade when the input prompt length increases even if the attention code was commented out.

How to resolve

  • My guess was that there is some bug in the fused moe kernel, and took a look at the commit history of fused moe. There was a commit regarding improvement of fused moe performance ([Performance][Kernel] Fused_moe Performance Improvement #9384), but I'm not sure this commit is the root cause of the bug. I just rolled back the version from latest to 0.6.3.post1 release.
  • In 0.6.3.post1 release version, the bug disappered. The decoding speed of Mixtral w/o attention does not depend on the input prompt length.

Bug found with @Byeong-Chan

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@taegeonum taegeonum added the bug Something isn't working label Nov 14, 2024
@taegeonum taegeonum changed the title [Bug]: FuseMoE kernel performance depends on input prompt length while decoding [Bug]: FusedMoE kernel performance depends on input prompt length while decoding Nov 14, 2024
@taegeonum
Copy link
Author

@charlifu Do you have any guesses?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant