Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Support for Nvidia 50 Series GPUs: Pytorch >=2.6 and CUDA 12.8 required #7144

Closed
elkay opened this issue Mar 17, 2025 · 12 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@elkay
Copy link

elkay commented Mar 17, 2025

Is your feature request related to a problem? Please describe.
Deepspeed cannot be used on a system with an Nvidia 50 Series GPU (i.e. 5080 or 5090) because these GPUs are only supported on Pytorch 2.6 and above, as well as CUDA 12.8.

Describe the solution you'd like
Add support for Pytorch >= 2.6 (2.7 is what most are already using) and CUDA 12.8.

Describe alternatives you've considered
There is no alternative solution.

Additional context
This is affecting projects that use TTS, among others.

@elkay elkay added the enhancement New feature or request label Mar 17, 2025
@loadams
Copy link
Collaborator

loadams commented Mar 17, 2025

Hi @elkay - can you share what error you are hitting and share your DeepSpeed version as well?

We've added 12.8 as a supported version here and have the 12.0 compute capability (which should be the version that supports these GPUs) listed here supporting Cuda 12.8 as well here.

@loadams loadams self-assigned this Mar 17, 2025
@elkay
Copy link
Author

elkay commented Mar 17, 2025

This is all on Windows 11, fully up to date. Hardware is 9800x3d, 64GB, RTX 5090.

CUDA is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:38:46_Pacific_Standard_Time_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0

The environment is Python 3.11.7 via Conda. Exact Torch versions are:

torch 2.7.0.dev20250311+cu128
torchaudio 2.6.0.dev20250312+cu128

Pytorch CUDA is working:

python -c "import torch; print(f'PyTorch is installed: {torch.version}'); print(f'CUDA available: {torch.cuda.is_available()}')"
PyTorch is installed: 2.7.0.dev20250311+cu128
CUDA available: True

DeepSpeed code is the latest master branch from Github.

I have tried various combinations and not been able to get a build to succeed, trying to turn off AIO, SPARSE_ATTN etc. and no combination of anything I've done has made a build succeed. I'm not new to building from source, and I've read building DeepSpeed is supposed to be pretty straightforward, but I've had no luck. With a fresh command prompt (no additional environment variables etc. yet), the build fails in the following manner. I'm pretty at a loss how to continue.

(speech_env) C:\AI-Content\xtts-api-server\DeepSpeed>pip install .
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing c:\ai-content\xtts-api-server\deepspeed
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [29 lines of output]
[2025-03-17 19:04:26,850] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-17 19:04:27,283] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
test.c
LINK : fatal error LNK1181: cannot open input file 'aio.lib'
test.c
LINK : fatal error LNK1181: cannot open input file 'cufile.lib'
W0317 19:04:31.559000 2832 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
test.c
LINK : fatal error LNK1181: cannot open input file 'aio.lib'
test.c
LINK : fatal error LNK1181: cannot open input file 'cufile.lib'
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "C:\AI-Content\xtts-api-server\DeepSpeed\setup.py", line 200, in
ext_modules.append(builder.builder())
^^^^^^^^^^^^^^^^^
File "C:\AI-Content\xtts-api-server\DeepSpeed\op_builder\builder.py", line 729, in builder
extra_link_args=self.strip_empty_entries(self.extra_ldflags()))
^^^^^^^^^^^^^^^^^^^^
File "C:\AI-Content\xtts-api-server\DeepSpeed\op_builder\inference_cutlass_builder.py", line 74, in extra_ldflags
import dskernels
ModuleNotFoundError: No module named 'dskernels'
DS_BUILD_OPS=1
[WARNING] Skip pre-compile of incompatible async_io; One can disable async_io with DS_BUILD_AIO=0
[WARNING] Skip pre-compile of incompatible evoformer_attn; One can disable evoformer_attn with DS_BUILD_EVOFORMER_ATTN=0
[WARNING] Skip pre-compile of incompatible fp_quantizer; One can disable fp_quantizer with DS_BUILD_FP_QUANTIZER=0
[WARNING] Skip pre-compile of incompatible gds; One can disable gds with DS_BUILD_GDS=0
[WARNING] Filtered compute capabilities [['6', '0'], ['6', '1'], ['7', '0']]
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

@elkay
Copy link
Author

elkay commented Mar 17, 2025

Also, if I do install DeepSpeed via pip, deepspeed-0.16.4 gets installed but this error is thrown by the app on launch. The last line is what led me to believe Torch 2.6 and/or CUDA 12.8 were not currently supported.

[2025-03-17 19:43:43,963] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
test.c
LINK : fatal error LNK1181: cannot open input file 'aio.lib'
test.c
LINK : fatal error LNK1181: cannot open input file 'cufile.lib'
W0317 19:43:46.499000 19984 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
[2025-03-17 19:43:46,619] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.4, git-hash=unknown, git-branch=unknown
[2025-03-17 19:43:46,629] [WARNING] [config_utils.py:70:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2025-03-17 19:43:46,629] [WARNING] [config_utils.py:70:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2025-03-17 19:43:46,629] [INFO] [logging.py:128:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2025-03-17 19:43:46,758] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 1024, 'intermediate_size': 4096, 'heads': 16, 'num_hidden_layers': -1, 'dtype': torch.float32, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': -1, 'rope_theta': 10000, 'invert_mask': True}
Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\xtts_api_server_main
.py", line 40, in
from xtts_api_server.server import app
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\xtts_api_server\server.py", line 78, in
XTTS.load_model()
^^^^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\xtts_api_server\tts_funcs.py", line 142, in load_model
self.load_local_model(load = is_official_model)
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\xtts_api_server\tts_funcs.py", line 163, in load_local_model
self.model.load_checkpoint(config,use_deepspeed=self.deepspeed, checkpoint_dir=str(checkpoint_dir))
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\TTS\tts\models\xtts.py", line 772, in load_checkpoint
self.gpt.init_gpt_for_inference(kv_cache=self.args.kv_cache, use_deepspeed=use_deepspeed)
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\TTS\tts\layers\xtts\gpt.py", line 224, in init_gpt_for_inference
self.ds_engine = deepspeed.init_inference(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed_init
.py", line 364, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\inference\engine.py", line 154, in init
self._apply_injection_policy(config)
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\inference\engine.py", line 388, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\module_inject\replace_module.py", line 400, in replace_transformer_layer
replaced_module = replace_module(model=model,
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\module_inject\replace_module.py", line 653, in replace_module
replaced_module, _ = _replace_module(model, policy, state_dict=sd)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\module_inject\replace_module.py", line 713, in _replace_module
_, layer_id = _replace_module(child,
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\module_inject\replace_module.py", line 713, in _replace_module
_, layer_id = _replace_module(child,
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\module_inject\replace_module.py", line 689, in _replace_module
replaced_module = policies[child.class][0](child,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\module_inject\replace_module.py", line 327, in replace_fn
new_module = replace_with_policy(child,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\module_inject\replace_module.py", line 248, in replace_with_policy
_container.create_module()
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\module_inject\containers\gpt2.py", line 20, in create_module
self.module = DeepSpeedGPTInference(_config, mp_group=self.mp_group)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\model_implementations\transformers\ds_gpt.py", line 20, in init
super().init(config, mp_group, quantize_scales, quantize_groups, merge_count, mlp_extra_grouping)
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\model_implementations\transformers\ds_transformer.py", line 68, in init
self.attention = DeepSpeedSelfAttention(self.config, mp_group, quantize_scales, quantize_groups,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\ops\transformer\inference\ds_attention.py", line 79, in init
self.qkv_func = QKVGemmOp(config)
^^^^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\ops\transformer\inference\op_binding\qkv_gemm.py", line 18, in init
super(QKVGemmOp, self).init(config)
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\ops\transformer\inference\op_binding\base.py", line 20, in init
BaseOp.inference_module = builder.load()
^^^^^^^^^^^^^^
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\ops\op_builder\builder.py", line 532, in load
self.validate_torch_version(torch_info)
File "C:\Users\xxxxx\MiniConda3\envs\speech_env\Lib\site-packages\deepspeed\ops\op_builder\builder.py", line 158, in validate_torch_version
raise RuntimeError("PyTorch version mismatch! DeepSpeed ops were compiled and installed "
RuntimeError: PyTorch version mismatch! DeepSpeed ops were compiled and installed with a different version than what is being used at runtime. Please re-install DeepSpeed or switch torch versions. Install torch version=2.3, Runtime torch version=2.7

@elkay
Copy link
Author

elkay commented Mar 19, 2025

FYI I've manually grabbed the 0.16.4 3.11 wheel from here and still have the same errors as 2nd post above after installing that wheel.

https://pypi.org/project/deepspeed/#files

The error seems to be indicating that DeepSpeed was compiled with pytorch 2.3 and that's not going to work on an RTX 5090 that needs a minimum of 2.6, and I'm running 2.7.

@loadams
Copy link
Collaborator

loadams commented Mar 19, 2025

@elkay - this is because the Windows whls come with all DeepSpeed ops pre-built. The unfortunate side effect of this is that we need to pick a torch version at build time, and we had previously chosen 2.3 (due to the changes with lib_uv and its lack of easy WIndows support after that). So that's why you are seeing the mismatch here:

Install torch version=2.3, Runtime torch version=2.7

If you set your local environment to be 2.3, you should see no issues with the pre-built wheels. I assume you need a newer torch in order to support these 50 series devices? In that case you would either need to build the wheel locally or wait for a future release and we can try publishing wheels with a newer torch version.

@elkay
Copy link
Author

elkay commented Mar 19, 2025

Yes, as mentioned the 50 series cards are only supported on Pytorch >= 2.6 and CUDA >= 12.8.

This is going to prevent anyone with 50 series consumer 5070-5090 cards, also the new Blackwell workstation cards coming out next month, and also the new Blackwell server cards next month as well, from being able to use DeepSpeed in any of their solutions unless they can get this self-compiled successfully.

I did try to build locally but that is not working, either. I'm attaching a stack trace.

DeepSpeed-compile-failure.txt

@loadams
Copy link
Collaborator

loadams commented Mar 19, 2025

Yes, as mentioned the 50 series cards are only supported on Pytorch >= 2.6 and CUDA >= 12.8.

This is going to prevent anyone with 50 series consumer 5070-5090 cards, also the new Blackwell workstation cards coming out next month, and also the new Blackwell server cards next month as well, from being able to use DeepSpeed in any of their solutions unless they can get this self-compiled successfully.

I did try to build locally but that is not working, either. I'm attaching a stack trace.

DeepSpeed-compile-failure.txt

@elkay - this will prevent building on Windows for the 5070-5090 cards, we have already verified support for these generations on Linux.

We can look into building future DeepSpeed Windows whls with pytorch 2.7, this will just be a change for existing users of those whls.

For your stack trace, the error is here:

error: command 'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.43.34808\bin\HostX86\x86\cl.exe' failed with exit code 2

This means that the VC vars aren't set correctly, please run this command to set them before trying to build:

"<install location>\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"

@elkay
Copy link
Author

elkay commented Mar 20, 2025

Yes, as mentioned the 50 series cards are only supported on Pytorch >= 2.6 and CUDA >= 12.8.
This is going to prevent anyone with 50 series consumer 5070-5090 cards, also the new Blackwell workstation cards coming out next month, and also the new Blackwell server cards next month as well, from being able to use DeepSpeed in any of their solutions unless they can get this self-compiled successfully.
I did try to build locally but that is not working, either. I'm attaching a stack trace.
DeepSpeed-compile-failure.txt

@elkay - this will prevent building on Windows for the 5070-5090 cards, we have already verified support for these generations on Linux.

We can look into building future DeepSpeed Windows whls with pytorch 2.7, this will just be a change for existing users of those whls.

For your stack trace, the error is here:

error: command 'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.43.34808\bin\HostX86\x86\cl.exe' failed with exit code 2

This means that the VC vars aren't set correctly, please run this command to set them before trying to build:

"<install location>\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"

Well that definitely got further! Built for a while, but did eventually choke again. I've attached the latest build log.

deepspeed-build-log.txt

@loadams
Copy link
Collaborator

loadams commented Mar 20, 2025

Hi @elkay - on our main Windows whl build systems I don't hit this, but I have been able to repro it on a Windows laptop with a GPU. I've not yet debugged this all the way, but as discussed here, you can modify the source code and it builds and runs fine: #6709

@elkay
Copy link
Author

elkay commented Mar 20, 2025

Removing those two lines also worked for me.

deepspeed build time = 387.7761631011963 secs
Successfully built deepspeed-0.16.5+d095b181-cp311-cp311-win_amd64.whl

Installing collected packages: deepspeed
Successfully installed deepspeed-0.16.5+d095b181

Also happy to report that not only did it successfully build and install, it appears to be working as it did before without any problems on my 5090!

@loadams
Copy link
Collaborator

loadams commented Mar 21, 2025

Great, thanks @elkay - I'll announce that we will publish our next release Windows wheels with pytorch 2.7 for support here. And glad it is tested, thanks.

@loadams
Copy link
Collaborator

loadams commented Mar 25, 2025

@elkay - any concerns with closing this issue for now? Tracking updating this for future Windows whls on the issue here: #7172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants