-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use double-wide types for additions in scalar code #786
Conversation
@gmaxwell You may want to test performance/branchlessness on various platforms. |
9f3a182
to
2858f4b
Compare
Good news: tests pass, including valgrind ct tests. Bad news: First is before the patch. x86_64 64-bit GCC 10.2.1 (asm disabled, obviously) x86_64 32-bit GCC 10.2.1 ARMv8 64-bit GCC 9.3 ARMv8 32-bit GCC 9.3 |
@sipa Do you remember #453? It goes in the same direction but it's a more radical approach, and maybe then the right way to do it. These benchmarks also remind me about my comment in #436. Optimizers seem to like these macros particularly well..
At least it's faster here. Hm, now I (somewhat seriously) wonder whether we should do the opposite and get rid of |
Life is compromise. Looking back at originating question the main idea behind suggestion was to leave no incentive for compiler to throw in branch. And double-width additions do that, unlike conditionals. So you have to ask yourself what's more important. Besides, formally speaking a data point from just one compiler isn't that conclusive. |
By the way, if anyone is interested: |
EPYC 7742 |
I forgot to mention that I don't think these benchmarks are a showstopper. If we think that the code is better (for readability and no branching), I'm fine with this. |
@real-or-random Agree, though it may inform us about trying variants that perhaps have less performance impact. Another point is that with the divstep inversion code, scalar mul/sqr performance will be almost entirely irrelevant for inversion (which is I expect the only way scalar operations matter to high-level functions like signing and verification). |
I think inversion is just interesting to look at because it gives a single number. Scalar mul performance is more important in the context of batch validation, some ZKP etc. I think it's worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn't that important.
AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for "just use asm" doesn't apply because it's a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon. |
Indeed, and then #453 is the way to go.
Ok, true. |
Yes, there is no way we can have true 64-bit operation without the ability to do a 64x64->128 bit multiplication in one way or another. But I don't think that's a problem - even MSVC has I also think it's orthogonal to the changes here. If we wanted to support non- |
This would need redo on top of #1000, but it's not clear to me it's worth it. Closing. |
Suggested here by Andy Polyakov.
This is shorter, easier to reason about, more likely to not contain branches, and likely (slightly) faster too.