Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use double-wide types for additions in scalar code #786

Closed
wants to merge 1 commit into from

Conversation

sipa
Copy link
Contributor

@sipa sipa commented Aug 7, 2020

Suggested here by Andy Polyakov.

This is shorter, easier to reason about, more likely to not contain branches, and likely (slightly) faster too.

@sipa
Copy link
Contributor Author

sipa commented Aug 7, 2020

@gmaxwell You may want to test performance/branchlessness on various platforms.

@sipa sipa force-pushed the 202008_wideadd branch 2 times, most recently from 9f3a182 to 2858f4b Compare August 7, 2020 01:54
@gmaxwell
Copy link
Contributor

gmaxwell commented Aug 7, 2020

Good news: tests pass, including valgrind ct tests.

Bad news:

First is before the patch.

x86_64 64-bit GCC 10.2.1 (asm disabled, obviously)
scalar_sqr: min 0.0421us / avg 0.0422us / max 0.0423us
scalar_mul: min 0.0410us / avg 0.0410us / max 0.0411us
scalar_inverse: min 12.2us / avg 12.2us / max 12.2us
scalar_sqr: min 0.0456us / avg 0.0457us / max 0.0457us
scalar_mul: min 0.0433us / avg 0.0434us / max 0.0434us
scalar_inverse: min 13.2us / avg 13.2us / max 13.2us

x86_64 32-bit GCC 10.2.1
scalar_sqr: min 0.122us / avg 0.122us / max 0.123us
scalar_mul: min 0.109us / avg 0.109us / max 0.109us
scalar_inverse: min 35.0us / avg 35.0us / max 35.0us
scalar_sqr: min 0.0961us / avg 0.0962us / max 0.0965us
scalar_mul: min 0.110us / avg 0.110us / max 0.110us
scalar_inverse: min 28.6us / avg 28.6us / max 28.6us

ARMv8 64-bit GCC 9.3
scalar_sqr: min 0.209us / avg 0.209us / max 0.209us
scalar_mul: min 0.222us / avg 0.222us / max 0.222us
scalar_inverse: min 61.1us / avg 61.1us / max 61.2us
scalar_sqr: min 0.209us / avg 0.209us / max 0.209us
scalar_mul: min 0.237us / avg 0.237us / max 0.237us
scalar_inverse: min 61.6us / avg 61.6us / max 61.7us

ARMv8 32-bit GCC 9.3
scalar_sqr: min 0.399us / avg 0.399us / max 0.400us
scalar_mul: min 0.388us / avg 0.388us / max 0.388us
scalar_inverse: min 115us / avg 115us / max 115us
scalar_sqr: min 0.392us / avg 0.393us / max 0.394us
scalar_mul: min 0.471us / avg 0.473us / max 0.483us
scalar_inverse: min 117us / avg 117us / max 118us

@real-or-random
Copy link
Contributor

real-or-random commented Aug 7, 2020

@sipa Do you remember #453? It goes in the same direction but it's a more radical approach, and maybe then the right way to do it.

These benchmarks also remind me about my comment in #436. Optimizers seem to like these macros particularly well..

x86_64 32-bit GCC 10.2.1

At least it's faster here. Hm, now I (somewhat seriously) wonder whether we should do the opposite and get rid of uint128_t in the field code, which is the only reason why MSVC does not support our 64bit code. (Even though there are other ways to fix this.)

@dot-asm
Copy link

dot-asm commented Aug 7, 2020

Bad news:

Life is compromise. Looking back at originating question the main idea behind suggestion was to leave no incentive for compiler to throw in branch. And double-width additions do that, unlike conditionals. So you have to ask yourself what's more important. Besides, formally speaking a data point from just one compiler isn't that conclusive.

@real-or-random
Copy link
Contributor

By the way, if anyone is interested:
https://godbolt.org/z/8GThbv (no. 1 before, no. 2 after the patch). It's interesting to see how far llvm-mca's prediction is from reality. @gmaxwell What's the exact x86 CPU that you used to run the benchmark?

@gmaxwell
Copy link
Contributor

gmaxwell commented Aug 7, 2020

What's the exact x86 CPU that you used to run the benchmark?

EPYC 7742

@real-or-random
Copy link
Contributor

I forgot to mention that I don't think these benchmarks are a showstopper. If we think that the code is better (for readability and no branching), I'm fine with this.

@sipa
Copy link
Contributor Author

sipa commented Aug 7, 2020

@real-or-random Agree, though it may inform us about trying variants that perhaps have less performance impact.

Another point is that with the divstep inversion code, scalar mul/sqr performance will be almost entirely irrelevant for inversion (which is I expect the only way scalar operations matter to high-level functions like signing and verification).

@gmaxwell
Copy link
Contributor

gmaxwell commented Aug 7, 2020

I think inversion is just interesting to look at because it gives a single number. Scalar mul performance is more important in the context of batch validation, some ZKP etc.

I think it's worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn't that important.

get rid of uint128_t in the field code,

AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for "just use asm" doesn't apply because it's a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.

@real-or-random
Copy link
Contributor

I think it's worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn't that important.

Indeed, and then #453 is the way to go.

get rid of uint128_t in the field code,

AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for "just use asm" doesn't apply because it's a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.

Ok, true.

@sipa
Copy link
Contributor Author

sipa commented Aug 7, 2020

AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance.

Yes, there is no way we can have true 64-bit operation without the ability to do a 64x64->128 bit multiplication in one way or another.

But I don't think that's a problem - even MSVC has _umul128 that effectively implements that; just not as a neat 128-bit data type.

I also think it's orthogonal to the changes here. If we wanted to support non-__int128 platforms, the code would be replaced by some set of configuration-dependent macros that translate to __int128 operations on supporting platforms, and _umul128 on pairs of uint64_t on MSVC - but the changes here would still matter.

@sipa
Copy link
Contributor Author

sipa commented May 8, 2023

This would need redo on top of #1000, but it's not clear to me it's worth it. Closing.

@sipa sipa closed this May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants