Not enough SMs to use max_autotune_gemm #4398

geraldstanje · 2025-03-24T06:16:25Z

Description

hi,

i get this warning Not enough SMs to use max_autotune_gemm when running with triton inference server nvcr.io/nvidia/tritonserver:25.02-py3.

what does that mean and how to fix it - change to different gpu? also how to use flash-attn 2 with triton inference server / tensorrt?

more logs:

./test_run_triton.sh  

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 25.02 (build 143749457)
Triton Server Version 2.55.0

Copyright (c) 2018-2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.8 driver version 570.86.10 with kernel driver version 535.230.02.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

Collecting transformers==4.49.0
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.0/44.0 kB 2.2 MB/s eta 0:00:00
Collecting torch==2.6.0
  Downloading torch-2.6.0-cp312-cp312-manylinux1_x86_64.whl.metadata (28 kB)
Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from transformers==4.49.0) (3.17.0)
Collecting huggingface-hub<1.0,>=0.26.0 (from transformers==4.49.0)
  Downloading huggingface_hub-0.29.3-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.12/dist-packages (from transformers==4.49.0) (1.26.4)
Requirement already satisfied: packaging>=20.0 in ./backends/dali/wheel/dali (from transformers==4.49.0) (24.2)
Collecting pyyaml>=5.1 (from transformers==4.49.0)
  Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting regex!=2019.12.17 (from transformers==4.49.0)
  Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.5/40.5 kB 5.9 MB/s eta 0:00:00
Collecting requests (from transformers==4.49.0)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers==4.49.0)
  Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.1 (from transformers==4.49.0)
  Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.12/dist-packages (from transformers==4.49.0) (4.67.1)
Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.12/dist-packages (from torch==2.6.0) (4.12.2)
Collecting networkx (from torch==2.6.0)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch==2.6.0)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec (from torch==2.6.0)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.6.0)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch==2.6.0)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch==2.6.0)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch==2.6.0)
  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch==2.6.0)
  Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch==2.6.0)
  Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparselt-cu12==0.6.2 (from torch==2.6.0)
  Downloading nvidia_cusparselt_cu12-0.6.2-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting nvidia-nccl-cu12==2.21.5 (from torch==2.6.0)
  Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-nvtx-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-nvjitlink-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting triton==3.2.0 (from torch==2.6.0)
  Downloading triton-3.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from torch==2.6.0) (68.1.2)
Collecting sympy==1.13.1 (from torch==2.6.0)
  Downloading sympy-1.13.1-py3-none-any.whl.metadata (12 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy==1.13.1->torch==2.6.0)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch==2.6.0)
  Downloading MarkupSafe-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting charset-normalizer<4,>=2 (from requests->transformers==4.49.0)
  Downloading charset_normalizer-3.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (35 kB)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==4.49.0) (3.10)
Collecting urllib3<3,>=1.21.1 (from requests->transformers==4.49.0)
  Downloading urllib3-2.3.0-py3-none-any.whl.metadata (6.5 kB)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==4.49.0) (2025.1.31)
Downloading transformers-4.49.0-py3-none-any.whl (10.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.0/10.0 MB 94.3 MB/s eta 0:00:00
Downloading torch-2.6.0-cp312-cp312-manylinux1_x86_64.whl (766.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 766.6/766.6 MB 3.7 MB/s eta 0:00:00
Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 363.4/363.4 MB 12.5 MB/s eta 0:00:00
Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.8/13.8 MB 134.8 MB/s eta 0:00:00
Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.6/24.6 MB 110.2 MB/s eta 0:00:00
Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 883.7/883.7 kB 68.6 MB/s eta 0:00:00
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 3.0 MB/s eta 0:00:00
Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.5/211.5 MB 26.5 MB/s eta 0:00:00
Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 70.6 MB/s eta 0:00:00
Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 127.9/127.9 MB 40.9 MB/s eta 0:00:00
Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 207.5/207.5 MB 25.8 MB/s eta 0:00:00
Downloading nvidia_cusparselt_cu12-0.6.2-py3-none-manylinux2014_x86_64.whl (150.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150.1/150.1 MB 9.7 MB/s eta 0:00:00
Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl (188.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 188.7/188.7 MB 8.3 MB/s eta 0:00:00
Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.1/21.1 MB 37.0 MB/s eta 0:00:00
Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (99 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 kB 845.2 kB/s eta 0:00:00
Downloading sympy-1.13.1-py3-none-any.whl (6.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 41.2 MB/s eta 0:00:00
Downloading triton-3.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (253.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 253.2/253.2 MB 5.9 MB/s eta 0:00:00
Downloading huggingface_hub-0.29.3-py3-none-any.whl (468 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 469.0/469.0 kB 9.4 MB/s eta 0:00:00
Downloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.6/193.6 kB 3.2 MB/s eta 0:00:00
Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 9.2 MB/s eta 0:00:00
Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (796 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 796.9/796.9 kB 45.7 MB/s eta 0:00:00
Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (471 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 471.6/471.6 kB 59.6 MB/s eta 0:00:00
Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.0/3.0 MB 119.5 MB/s eta 0:00:00
Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 kB 19.2 MB/s eta 0:00:00
Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 100.9 MB/s eta 0:00:00
Downloading requests-2.32.3-py3-none-any.whl (64 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.9/64.9 kB 7.6 MB/s eta 0:00:00
Downloading charset_normalizer-3.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (145 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 145.3/145.3 kB 19.1 MB/s eta 0:00:00
Downloading MarkupSafe-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23 kB)
Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 52.7 MB/s eta 0:00:00
Downloading urllib3-2.3.0-py3-none-any.whl (128 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.4/128.4 kB 26.0 MB/s eta 0:00:00
Installing collected packages: triton, nvidia-cusparselt-cu12, mpmath, urllib3, sympy, safetensors, regex, pyyaml, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, charset-normalizer, requests, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, huggingface-hub, torch, tokenizers, transformers
Successfully installed MarkupSafe-3.0.2 charset-normalizer-3.4.1 fsspec-2025.3.0 huggingface-hub-0.29.3 jinja2-3.1.6 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-cusparselt-cu12-0.6.2 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 pyyaml-6.0.2 regex-2024.11.6 requests-2.32.3 safetensors-0.5.3 sympy-1.13.1 tokenizers-0.21.1 torch-2.6.0 transformers-4.49.0 triton-3.2.0 urllib3-2.3.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
I0324 06:05:36.011379 1 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7f8462000000' with size 268435456"
I0324 06:05:36.011698 1 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0324 06:05:36.013881 1 model_lifecycle.cc:473] "loading: python-modernbert:1"
I0324 06:05:41.872182 1 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: python-modernbert_0_0 (GPU device 0)"
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████| 20.8k/20.8k [00:00<00:00, 66.2MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 2.13M/2.13M [00:00<00:00, 7.50MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████| 694/694 [00:00<00:00, 4.71MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1.19k/1.19k [00:00<00:00, 8.52MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████| 599M/599M [00:02<00:00, 237MB/s]
I0324 06:06:00.256478 1 model_lifecycle.cc:849] "successfully loaded 'python-modernbert'"
I0324 06:06:00.256691 1 model_lifecycle.cc:473] "loading: ensemble-model:1"
I0324 06:06:00.257042 1 model_lifecycle.cc:849] "successfully loaded 'ensemble-model'"
I0324 06:06:00.257123 1 server.cc:604] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0324 06:06:00.257168 1 server.cc:631] 
+---------+-------------------------------------------------------+--------------------------------------------------------------------------+
| Backend | Path                                                  | Config                                                                   |
+---------+-------------------------------------------------------+--------------------------------------------------------------------------+
| python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/trit |
|         |                                                       | onserver/backends","min-compute-capability":"6.000000","default-max-batc |
|         |                                                       | h-size":"4"}}                                                            |
+---------+-------------------------------------------------------+--------------------------------------------------------------------------+

I0324 06:06:00.257265 1 server.cc:674] 
+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| ensemble-model    | 1       | READY  |
| python-modernbert | 1       | READY  |
+-------------------+---------+--------+

I0324 06:06:00.281178 1 metrics.cc:890] "Collecting metrics for GPU 0: Tesla T4"
I0324 06:06:00.285853 1 metrics.cc:783] "Collecting CPU metrics"
I0324 06:06:00.285964 1 tritonserver.cc:2598] 
+----------------------------------+----------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                    |
+----------------------------------+----------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                   |
| server_version                   | 2.55.0                                                                                                   |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_confi |
|                                  | guration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging  |
| model_repository_path[0]         | /models                                                                                                  |
| model_control_mode               | MODE_NONE                                                                                                |
| strict_model_config              | 0                                                                                                        |
| model_config_name                |                                                                                                          |
| rate_limit                       | OFF                                                                                                      |
| pinned_memory_pool_byte_size     | 268435456                                                                                                |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                 |
| min_supported_compute_capability | 6.0                                                                                                      |
| strict_readiness                 | 1                                                                                                        |
| exit_timeout                     | 30                                                                                                       |
| cache_enabled                    | 0                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------+

I0324 06:06:00.287759 1 grpc_server.cc:2558] "Started GRPCInferenceService at 0.0.0.0:8001"
I0324 06:06:00.288014 1 http_server.cc:4725] "Started HTTPService at 0.0.0.0:8000"
I0324 06:06:00.330209 1 http_server.cc:358] "Started Metrics Service at 0.0.0.0:8002"
W0324 06:09:28.530000 208 torch/_inductor/utils.py:1137] [1/0] Not enough SMs to use max_autotune_gemm mode

Environment

Triton Inference Server Version: 25.02

TensorRT Version: 10.8.0.43 (i think thats the version which comes with Triton Inference Server Version 25.02 - see: https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-02.html)

NVIDIA GPU: Tesla T4

Used model: https://huggingface.co/answerdotai/ModernBERT-base

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

The text was updated successfully, but these errors were encountered:

lix19937 · 2025-03-24T06:59:29Z

It means assumes your GPU needs at least 68 SMs for this mode, which seems to be a hard-coded limitation in torch.compile. T4 has 40 SMs, from https://www.techpowerup.com/gpu-specs/tesla-t4.c3316.

see https://github.com/pytorch/pytorch/blob/v2.6.0/torch/_inductor/utils.py#L1134-L1138

geraldstanje · 2025-03-24T13:50:20Z

@lix19937 does that mean model inference will run slower? will changing to nvidia a10 gpu help?
also how i can see if flash-attn 2 i used here?

lix19937 · 2025-03-25T01:02:35Z

Yes, maybe can not find the best gemm tactic. You can use 3080 or other gpu which SMs(multi_processor_count) > 68

lix19937 mentioned this issue Mar 24, 2025

Not enough SMs on RTX 2080 pytorch-labs/segment-anything-fast#117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not enough SMs to use max_autotune_gemm #4398

Not enough SMs to use max_autotune_gemm #4398

geraldstanje commented Mar 24, 2025 •

edited

Loading

lix19937 commented Mar 24, 2025

geraldstanje commented Mar 24, 2025 •

edited

Loading

lix19937 commented Mar 25, 2025

Not enough SMs to use max_autotune_gemm #4398

Not enough SMs to use max_autotune_gemm #4398

Comments

geraldstanje commented Mar 24, 2025 • edited Loading

Description

Environment

Relevant Files

Steps To Reproduce

lix19937 commented Mar 24, 2025

geraldstanje commented Mar 24, 2025 • edited Loading

lix19937 commented Mar 25, 2025

geraldstanje commented Mar 24, 2025 •

edited

Loading

geraldstanje commented Mar 24, 2025 •

edited

Loading