-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU processing is extremely slow for models loaded with torch_dtype = torch.float16
#34692
Comments
Hi @blincoln-bf Could you provide the model of your CPU? Generally, CPUs like the 13th Gen Intel(R) Core(TM) i5-13600K do not have specialized instruction sets for BF16 and FP16 data formats. This results in the CPU needing to convert data types to FP32 for computation and then back to BF16 or FP16. This back-and-forth process consumes a significant amount of time. |
I think your approach to benchmarking performance using different Your might consider sharing their findings on relevant forums or with the maintainers of the Transformers library, as this could help other users avoid similar pitfalls. Adding an informative warning message when Overall, the methodology and detailed performance metrics make a strong case for more awareness around dtype performance implications. |
Hi @Kevin0624. The system where I ran that benchmark script has an AMD Ryzen 9 7950X3D (16 cores) and 128 GiB of RAM, in addition to an RTX 4090. Regardless of the reason, it seems like warning the user in the documentation, or if they've specified a very inefficient However, you are correct in that it's very device-dependent. I ran the same benchmark on an M1 MacBook (using
I'll run that script with |
Script output for
|
In case you'd like to see some statistics that document the effect on a larger codebase, I ran through benchmarks on four different systems (one of which dual boots Linux and Windows) and documented them here: TLDR:
|
System Info
Transformers versions: 4.44.2, 4.46.2
PyTorch versions: 2.4.0, 2.5.1
Python version: 3.11.2
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
While troubleshooting very weird behaviour with GPT-NeoX when processed on a CPU device, I discovered that Transformers will load a model with
torch_dtype = torch.float16
and process it on the CPU without any apparent warning or other message, but its performance is very slow compared tofloat32
(5+ times slower) orbfloat16
(10+ times slower). Given the amount of documentation online that suggests using.half()
ortorch_dtype = torch.float16
to conserve memory, can I suggest adding a warning message when a model loaded in this format is processed on a CPU device? I knowfloat16
support for CPU at all is relatively new, but given the lack of information anywhere about the massive performance hit it currently incurs, I assumed CPU processing as a whole in Transformers was essentially unusable for real-world work (especially training / gradient operations). In reality, CPU processing is surprisingly fast when set tofloat32
orbfloat16
format.Here's a quick benchmark script based on the example usage for GPT-NeoX that loads the model, then generates text using three prompts. It performs this test for
torch_dtype = None
,torch.float32
,torch.float16
and ,torch.bfloat16
:Excerpt of the output with just the relevant statistics:
As you can see,
float16
performance scored about 7 times worse thanfloat32
for this run, and about 14 times worse thanbfloat16
, with simple text generation taking almost ten minutes infloat16
format. For training/gradient operations, the effect is even more of a problem. Operations that take a few minutes in the other formats can take hours infloat16
format (in the case of the GPT-NeoX issue, 10+ hours for a call toforward
). I don't have a good minimal test case for that, though.This is not limited to GPT-NeoX. For example, here's the same script, but modified to use Phi-3-mini-128k instead:
Relevant output for Phi-3-mini-128k:
In this case, the difference is 5 times worse than
float32
, and 10 times worse thanbfloat16
overall. There seems to be some kind of fixed overhead causing the issue, because the processing times for both Phi-3-mini-128k and GPT-NeoX infloat16
form are virtually identical, even when they vary by several times for the data in other formats.I assume the discrepancy is at least sort of a known issue to the Transformers developers, but I only discovered it myself when trying to debug a different problem. Adding a runtime warning and maybe an explicit warning in the documentation seems like it would be a good idea.
Expected behavior
If CPU processing is performed using a very inefficient format that is also commonly suggested as a way to reduce the memory footprint, I would expect Transformers to issue a warning.
The text was updated successfully, but these errors were encountered: