You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Reproducing Steps and Traceback
~/Desktop/Code/text-generation-inference/server$ SAFETENSORS_FAST_GPU=1 python text_generation_server/cli.py serve state-spaces/mamba-130m
2024-11-10 21:18:24.957 | INFO | text_generation_server.utils.import_utils::80 - Detected system cuda
/home/ubuntu/Desktop/Code/text-generation-inference/server/text_generation_server/utils/sgmv.py:18: UserWarning: Could not import SGMV kernel from Punica, falling back to loop.
warnings.warn("Could not import SGMV kernel from Punica, falling back to loop.")
Using prefix caching = True
Using Attention = flashinfer
Could not import Flash Attention enabled models: /opt/conda/envs/tgi/lib/python3.11/site-packages/moe_kernels/_moe_kernels_ops.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZNK3c105Error4whatEv
/opt/conda/envs/tgi/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
warnings.warn(
Error when initializing model
Traceback (most recent call last):
File "/home/ubuntu/Desktop/Code/text-generation-inference/server/text_generation_server/models/custom_modeling/mamba_modeling.py", line 213, in init
self.lm_head = SpeculativeHead.load(config, f"{prefix}.embeddings", weights)
File "/home/ubuntu/Desktop/Code/text-generation-inference/server/text_generation_server/layers/speculative.py", line 40, in load
lm_head = TensorParallelHead.load(config, prefix, weights)
File "/home/ubuntu/Desktop/Code/text-generation-inference/server/text_generation_server/layers/tensor_parallel.py", line 66, in load
weight = weights.get_tensor(f"{prefix}.weight")
File "/home/ubuntu/Desktop/Code/text-generation-inference/server/text_generation_server/utils/weights.py", line 213, in get_tensor
filename, tensor_name = self.get_filename(tensor_name)
File "/home/ubuntu/Desktop/Code/text-generation-inference/server/text_generation_server/utils/weights.py", line 192, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight backbone.embeddings.weight does not exist
System Info
System Specifications
2024-11-10T21:20:44.880890Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.1
Commit sha: 97f7a22
Docker label: N/A
nvidia-smi:
Sun Nov 10 21:20:43 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S On | 00000000:9E:00.0 Off | 0 |
| N/A 26C P8 32W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L40S On | 00000000:A0:00.0 Off | 0 |
| N/A 25C P8 32W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L40S On | 00000000:A2:00.0 Off | 0 |
| N/A 27C P8 32W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L40S On | 00000000:A4:00.0 Off | 0 |
| N/A 27C P8 31W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA L40S On | 00000000:C6:00.0 Off | 0 |
| N/A 26C P8 32W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA L40S On | 00000000:C8:00.0 Off | 0 |
| N/A 26C P8 30W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA L40S On | 00000000:CA:00.0 Off | 0 |
| N/A 29C P8 33W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA L40S On | 00000000:CC:00.0 Off | 0 |
| N/A 26C P8 30W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Reproducing Steps and Traceback
~/Desktop/Code/text-generation-inference/server$ SAFETENSORS_FAST_GPU=1 python text_generation_server/cli.py serve state-spaces/mamba-130m
2024-11-10 21:18:24.957 | INFO | text_generation_server.utils.import_utils::80 - Detected system cuda
/home/ubuntu/Desktop/Code/text-generation-inference/server/text_generation_server/utils/sgmv.py:18: UserWarning: Could not import SGMV kernel from Punica, falling back to loop.
warnings.warn("Could not import SGMV kernel from Punica, falling back to loop.")
Using prefix caching = True
Using Attention = flashinfer
Could not import Flash Attention enabled models: /opt/conda/envs/tgi/lib/python3.11/site-packages/moe_kernels/_moe_kernels_ops.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZNK3c105Error4whatEv
/opt/conda/envs/tgi/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
warnings.warn(
Error when initializing model
Traceback (most recent call last):
File "/home/ubuntu/Desktop/Code/text-generation-inference/server/text_generation_server/models/custom_modeling/mamba_modeling.py", line 213, in init
self.lm_head = SpeculativeHead.load(config, f"{prefix}.embeddings", weights)
File "/home/ubuntu/Desktop/Code/text-generation-inference/server/text_generation_server/layers/speculative.py", line 40, in load
lm_head = TensorParallelHead.load(config, prefix, weights)
File "/home/ubuntu/Desktop/Code/text-generation-inference/server/text_generation_server/layers/tensor_parallel.py", line 66, in load
weight = weights.get_tensor(f"{prefix}.weight")
File "/home/ubuntu/Desktop/Code/text-generation-inference/server/text_generation_server/utils/weights.py", line 213, in get_tensor
filename, tensor_name = self.get_filename(tensor_name)
File "/home/ubuntu/Desktop/Code/text-generation-inference/server/text_generation_server/utils/weights.py", line 192, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight backbone.embeddings.weight does not exist
Information
Tasks
Reproduction
SAFETENSORS_FAST_GPU=1 python text_generation_server/cli.py serve state-spaces/mamba-130m
Expected behavior
Web server starting
The text was updated successfully, but these errors were encountered: