Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not enough SMs to use max_autotune_gemm #4398

Open
geraldstanje opened this issue Mar 24, 2025 · 3 comments
Open

Not enough SMs to use max_autotune_gemm #4398

geraldstanje opened this issue Mar 24, 2025 · 3 comments

Comments

@geraldstanje
Copy link

geraldstanje commented Mar 24, 2025

Description

hi,

i get this warning Not enough SMs to use max_autotune_gemm when running with triton inference server nvcr.io/nvidia/tritonserver:25.02-py3.

what does that mean and how to fix it - change to different gpu? also how to use flash-attn 2 with triton inference server / tensorrt?

more logs:

./test_run_triton.sh  

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 25.02 (build 143749457)
Triton Server Version 2.55.0

Copyright (c) 2018-2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.8 driver version 570.86.10 with kernel driver version 535.230.02.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

Collecting transformers==4.49.0
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.0/44.0 kB 2.2 MB/s eta 0:00:00
Collecting torch==2.6.0
  Downloading torch-2.6.0-cp312-cp312-manylinux1_x86_64.whl.metadata (28 kB)
Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from transformers==4.49.0) (3.17.0)
Collecting huggingface-hub<1.0,>=0.26.0 (from transformers==4.49.0)
  Downloading huggingface_hub-0.29.3-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.12/dist-packages (from transformers==4.49.0) (1.26.4)
Requirement already satisfied: packaging>=20.0 in ./backends/dali/wheel/dali (from transformers==4.49.0) (24.2)
Collecting pyyaml>=5.1 (from transformers==4.49.0)
  Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting regex!=2019.12.17 (from transformers==4.49.0)
  Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.5/40.5 kB 5.9 MB/s eta 0:00:00
Collecting requests (from transformers==4.49.0)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers==4.49.0)
  Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.1 (from transformers==4.49.0)
  Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.12/dist-packages (from transformers==4.49.0) (4.67.1)
Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.12/dist-packages (from torch==2.6.0) (4.12.2)
Collecting networkx (from torch==2.6.0)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch==2.6.0)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec (from torch==2.6.0)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.6.0)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch==2.6.0)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch==2.6.0)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch==2.6.0)
  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch==2.6.0)
  Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch==2.6.0)
  Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparselt-cu12==0.6.2 (from torch==2.6.0)
  Downloading nvidia_cusparselt_cu12-0.6.2-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting nvidia-nccl-cu12==2.21.5 (from torch==2.6.0)
  Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-nvtx-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-nvjitlink-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting triton==3.2.0 (from torch==2.6.0)
  Downloading triton-3.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from torch==2.6.0) (68.1.2)
Collecting sympy==1.13.1 (from torch==2.6.0)
  Downloading sympy-1.13.1-py3-none-any.whl.metadata (12 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy==1.13.1->torch==2.6.0)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch==2.6.0)
  Downloading MarkupSafe-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting charset-normalizer<4,>=2 (from requests->transformers==4.49.0)
  Downloading charset_normalizer-3.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (35 kB)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==4.49.0) (3.10)
Collecting urllib3<3,>=1.21.1 (from requests->transformers==4.49.0)
  Downloading urllib3-2.3.0-py3-none-any.whl.metadata (6.5 kB)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==4.49.0) (2025.1.31)
Downloading transformers-4.49.0-py3-none-any.whl (10.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.0/10.0 MB 94.3 MB/s eta 0:00:00
Downloading torch-2.6.0-cp312-cp312-manylinux1_x86_64.whl (766.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 766.6/766.6 MB 3.7 MB/s eta 0:00:00
Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 363.4/363.4 MB 12.5 MB/s eta 0:00:00
Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.8/13.8 MB 134.8 MB/s eta 0:00:00
Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.6/24.6 MB 110.2 MB/s eta 0:00:00
Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 883.7/883.7 kB 68.6 MB/s eta 0:00:00
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 3.0 MB/s eta 0:00:00
Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.5/211.5 MB 26.5 MB/s eta 0:00:00
Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 70.6 MB/s eta 0:00:00
Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 127.9/127.9 MB 40.9 MB/s eta 0:00:00
Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 207.5/207.5 MB 25.8 MB/s eta 0:00:00
Downloading nvidia_cusparselt_cu12-0.6.2-py3-none-manylinux2014_x86_64.whl (150.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150.1/150.1 MB 9.7 MB/s eta 0:00:00
Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl (188.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 188.7/188.7 MB 8.3 MB/s eta 0:00:00
Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.1/21.1 MB 37.0 MB/s eta 0:00:00
Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (99 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 kB 845.2 kB/s eta 0:00:00
Downloading sympy-1.13.1-py3-none-any.whl (6.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 41.2 MB/s eta 0:00:00
Downloading triton-3.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (253.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 253.2/253.2 MB 5.9 MB/s eta 0:00:00
Downloading huggingface_hub-0.29.3-py3-none-any.whl (468 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 469.0/469.0 kB 9.4 MB/s eta 0:00:00
Downloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.6/193.6 kB 3.2 MB/s eta 0:00:00
Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 9.2 MB/s eta 0:00:00
Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (796 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 796.9/796.9 kB 45.7 MB/s eta 0:00:00
Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (471 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 471.6/471.6 kB 59.6 MB/s eta 0:00:00
Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.0/3.0 MB 119.5 MB/s eta 0:00:00
Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 kB 19.2 MB/s eta 0:00:00
Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 100.9 MB/s eta 0:00:00
Downloading requests-2.32.3-py3-none-any.whl (64 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.9/64.9 kB 7.6 MB/s eta 0:00:00
Downloading charset_normalizer-3.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (145 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 145.3/145.3 kB 19.1 MB/s eta 0:00:00
Downloading MarkupSafe-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23 kB)
Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 52.7 MB/s eta 0:00:00
Downloading urllib3-2.3.0-py3-none-any.whl (128 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.4/128.4 kB 26.0 MB/s eta 0:00:00
Installing collected packages: triton, nvidia-cusparselt-cu12, mpmath, urllib3, sympy, safetensors, regex, pyyaml, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, charset-normalizer, requests, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, huggingface-hub, torch, tokenizers, transformers
Successfully installed MarkupSafe-3.0.2 charset-normalizer-3.4.1 fsspec-2025.3.0 huggingface-hub-0.29.3 jinja2-3.1.6 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-cusparselt-cu12-0.6.2 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 pyyaml-6.0.2 regex-2024.11.6 requests-2.32.3 safetensors-0.5.3 sympy-1.13.1 tokenizers-0.21.1 torch-2.6.0 transformers-4.49.0 triton-3.2.0 urllib3-2.3.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
I0324 06:05:36.011379 1 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7f8462000000' with size 268435456"
I0324 06:05:36.011698 1 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0324 06:05:36.013881 1 model_lifecycle.cc:473] "loading: python-modernbert:1"
I0324 06:05:41.872182 1 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: python-modernbert_0_0 (GPU device 0)"
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████| 20.8k/20.8k [00:00<00:00, 66.2MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 2.13M/2.13M [00:00<00:00, 7.50MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████| 694/694 [00:00<00:00, 4.71MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1.19k/1.19k [00:00<00:00, 8.52MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████| 599M/599M [00:02<00:00, 237MB/s]
I0324 06:06:00.256478 1 model_lifecycle.cc:849] "successfully loaded 'python-modernbert'"
I0324 06:06:00.256691 1 model_lifecycle.cc:473] "loading: ensemble-model:1"
I0324 06:06:00.257042 1 model_lifecycle.cc:849] "successfully loaded 'ensemble-model'"
I0324 06:06:00.257123 1 server.cc:604] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0324 06:06:00.257168 1 server.cc:631] 
+---------+-------------------------------------------------------+--------------------------------------------------------------------------+
| Backend | Path                                                  | Config                                                                   |
+---------+-------------------------------------------------------+--------------------------------------------------------------------------+
| python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/trit |
|         |                                                       | onserver/backends","min-compute-capability":"6.000000","default-max-batc |
|         |                                                       | h-size":"4"}}                                                            |
+---------+-------------------------------------------------------+--------------------------------------------------------------------------+

I0324 06:06:00.257265 1 server.cc:674] 
+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| ensemble-model    | 1       | READY  |
| python-modernbert | 1       | READY  |
+-------------------+---------+--------+

I0324 06:06:00.281178 1 metrics.cc:890] "Collecting metrics for GPU 0: Tesla T4"
I0324 06:06:00.285853 1 metrics.cc:783] "Collecting CPU metrics"
I0324 06:06:00.285964 1 tritonserver.cc:2598] 
+----------------------------------+----------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                    |
+----------------------------------+----------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                   |
| server_version                   | 2.55.0                                                                                                   |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_confi |
|                                  | guration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging  |
| model_repository_path[0]         | /models                                                                                                  |
| model_control_mode               | MODE_NONE                                                                                                |
| strict_model_config              | 0                                                                                                        |
| model_config_name                |                                                                                                          |
| rate_limit                       | OFF                                                                                                      |
| pinned_memory_pool_byte_size     | 268435456                                                                                                |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                 |
| min_supported_compute_capability | 6.0                                                                                                      |
| strict_readiness                 | 1                                                                                                        |
| exit_timeout                     | 30                                                                                                       |
| cache_enabled                    | 0                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------+

I0324 06:06:00.287759 1 grpc_server.cc:2558] "Started GRPCInferenceService at 0.0.0.0:8001"
I0324 06:06:00.288014 1 http_server.cc:4725] "Started HTTPService at 0.0.0.0:8000"
I0324 06:06:00.330209 1 http_server.cc:358] "Started Metrics Service at 0.0.0.0:8002"
W0324 06:09:28.530000 208 torch/_inductor/utils.py:1137] [1/0] Not enough SMs to use max_autotune_gemm mode

Environment

Triton Inference Server Version: 25.02

TensorRT Version: 10.8.0.43 (i think thats the version which comes with Triton Inference Server Version 25.02 - see: https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-02.html)

NVIDIA GPU: Tesla T4

Used model: https://huggingface.co/answerdotai/ModernBERT-base

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

@lix19937
Copy link

It means assumes your GPU needs at least 68 SMs for this mode, which seems to be a hard-coded limitation in torch.compile. T4 has 40 SMs, from https://www.techpowerup.com/gpu-specs/tesla-t4.c3316.

see https://github.com/pytorch/pytorch/blob/v2.6.0/torch/_inductor/utils.py#L1134-L1138

@geraldstanje
Copy link
Author

geraldstanje commented Mar 24, 2025

@lix19937 does that mean model inference will run slower? will changing to nvidia a10 gpu help?
also how i can see if flash-attn 2 i used here?

@lix19937
Copy link

Yes, maybe can not find the best gemm tactic. You can use 3080 or other gpu which SMs(multi_processor_count) > 68

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants