Use double-wide types for additions in scalar code #786

sipa · 2020-08-07T01:22:53Z

Suggested here by Andy Polyakov.

This is shorter, easier to reason about, more likely to not contain branches, and likely (slightly) faster too.

sipa · 2020-08-07T01:25:16Z

@gmaxwell You may want to test performance/branchlessness on various platforms.

gmaxwell · 2020-08-07T02:56:56Z

Good news: tests pass, including valgrind ct tests.

Bad news:

First is before the patch.

x86_64 64-bit GCC 10.2.1 (asm disabled, obviously)
scalar_sqr: min 0.0421us / avg 0.0422us / max 0.0423us
scalar_mul: min 0.0410us / avg 0.0410us / max 0.0411us
scalar_inverse: min 12.2us / avg 12.2us / max 12.2us
scalar_sqr: min 0.0456us / avg 0.0457us / max 0.0457us
scalar_mul: min 0.0433us / avg 0.0434us / max 0.0434us
scalar_inverse: min 13.2us / avg 13.2us / max 13.2us

x86_64 32-bit GCC 10.2.1
scalar_sqr: min 0.122us / avg 0.122us / max 0.123us
scalar_mul: min 0.109us / avg 0.109us / max 0.109us
scalar_inverse: min 35.0us / avg 35.0us / max 35.0us
scalar_sqr: min 0.0961us / avg 0.0962us / max 0.0965us
scalar_mul: min 0.110us / avg 0.110us / max 0.110us
scalar_inverse: min 28.6us / avg 28.6us / max 28.6us

ARMv8 64-bit GCC 9.3
scalar_sqr: min 0.209us / avg 0.209us / max 0.209us
scalar_mul: min 0.222us / avg 0.222us / max 0.222us
scalar_inverse: min 61.1us / avg 61.1us / max 61.2us
scalar_sqr: min 0.209us / avg 0.209us / max 0.209us
scalar_mul: min 0.237us / avg 0.237us / max 0.237us
scalar_inverse: min 61.6us / avg 61.6us / max 61.7us

ARMv8 32-bit GCC 9.3
scalar_sqr: min 0.399us / avg 0.399us / max 0.400us
scalar_mul: min 0.388us / avg 0.388us / max 0.388us
scalar_inverse: min 115us / avg 115us / max 115us
scalar_sqr: min 0.392us / avg 0.393us / max 0.394us
scalar_mul: min 0.471us / avg 0.473us / max 0.483us
scalar_inverse: min 117us / avg 117us / max 118us

real-or-random · 2020-08-07T08:20:27Z

@sipa Do you remember #453? It goes in the same direction but it's a more radical approach, and maybe then the right way to do it.

These benchmarks also remind me about my comment in #436. Optimizers seem to like these macros particularly well..

x86_64 32-bit GCC 10.2.1

At least it's faster here. Hm, now I (somewhat seriously) wonder whether we should do the opposite and get rid of uint128_t in the field code, which is the only reason why MSVC does not support our 64bit code. (Even though there are other ways to fix this.)

dot-asm · 2020-08-07T08:50:45Z

Bad news:

Life is compromise. Looking back at originating question the main idea behind suggestion was to leave no incentive for compiler to throw in branch. And double-width additions do that, unlike conditionals. So you have to ask yourself what's more important. Besides, formally speaking a data point from just one compiler isn't that conclusive.

real-or-random · 2020-08-07T09:25:54Z

By the way, if anyone is interested:
https://godbolt.org/z/8GThbv (no. 1 before, no. 2 after the patch). It's interesting to see how far llvm-mca's prediction is from reality. @gmaxwell What's the exact x86 CPU that you used to run the benchmark?

gmaxwell · 2020-08-07T15:54:22Z

What's the exact x86 CPU that you used to run the benchmark?

EPYC 7742

real-or-random · 2020-08-07T17:41:52Z

I forgot to mention that I don't think these benchmarks are a showstopper. If we think that the code is better (for readability and no branching), I'm fine with this.

sipa · 2020-08-07T17:50:29Z

@real-or-random Agree, though it may inform us about trying variants that perhaps have less performance impact.

Another point is that with the divstep inversion code, scalar mul/sqr performance will be almost entirely irrelevant for inversion (which is I expect the only way scalar operations matter to high-level functions like signing and verification).

gmaxwell · 2020-08-07T18:17:09Z

I think inversion is just interesting to look at because it gives a single number. Scalar mul performance is more important in the context of batch validation, some ZKP etc.

I think it's worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn't that important.

get rid of uint128_t in the field code,

AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for "just use asm" doesn't apply because it's a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.

real-or-random · 2020-08-07T18:28:27Z

I think it's worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn't that important.

Indeed, and then #453 is the way to go.

get rid of uint128_t in the field code,

AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for "just use asm" doesn't apply because it's a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.

Ok, true.

sipa · 2020-08-07T20:24:05Z

AFAIK you can't do that without losing access to the 64,64->high64 widening multiply which is important for performance.

Yes, there is no way we can have true 64-bit operation without the ability to do a 64x64->128 bit multiplication in one way or another.

But I don't think that's a problem - even MSVC has _umul128 that effectively implements that; just not as a neat 128-bit data type.

I also think it's orthogonal to the changes here. If we wanted to support non-__int128 platforms, the code would be replaced by some set of configuration-dependent macros that translate to __int128 operations on supporting platforms, and _umul128 on pairs of uint64_t on MSVC - but the changes here would still matter.

sipa · 2023-05-08T15:51:45Z

This would need redo on top of #1000, but it's not clear to me it's worth it. Closing.

sipa mentioned this pull request Aug 7, 2020

Improve constant-timeness on PowerPC #772

Merged

sipa force-pushed the 202008_wideadd branch 2 times, most recently from 9f3a182 to 2858f4b Compare August 7, 2020 01:54

Use double-wide types for additions in scalar code

3bfb85a

sipa force-pushed the 202008_wideadd branch from 2858f4b to 3bfb85a Compare August 7, 2020 01:57

sipa closed this May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use double-wide types for additions in scalar code #786

Use double-wide types for additions in scalar code #786

Uh oh!

sipa commented Aug 7, 2020 •

edited

Loading

Uh oh!

sipa commented Aug 7, 2020

Uh oh!

gmaxwell commented Aug 7, 2020 •

edited

Loading

Uh oh!

real-or-random commented Aug 7, 2020 •

edited

Loading

Uh oh!

dot-asm commented Aug 7, 2020

Uh oh!

real-or-random commented Aug 7, 2020

Uh oh!

gmaxwell commented Aug 7, 2020

Uh oh!

real-or-random commented Aug 7, 2020

Uh oh!

sipa commented Aug 7, 2020

Uh oh!

gmaxwell commented Aug 7, 2020

Uh oh!

real-or-random commented Aug 7, 2020

Uh oh!

sipa commented Aug 7, 2020 •

edited

Loading

Uh oh!

sipa commented May 8, 2023

Uh oh!

Uh oh!

Use double-wide types for additions in scalar code #786

Use double-wide types for additions in scalar code #786

Uh oh!

Conversation

sipa commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sipa commented Aug 7, 2020

Uh oh!

gmaxwell commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

real-or-random commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dot-asm commented Aug 7, 2020

Uh oh!

real-or-random commented Aug 7, 2020

Uh oh!

gmaxwell commented Aug 7, 2020

Uh oh!

real-or-random commented Aug 7, 2020

Uh oh!

sipa commented Aug 7, 2020

Uh oh!

gmaxwell commented Aug 7, 2020

Uh oh!

real-or-random commented Aug 7, 2020

Uh oh!

sipa commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sipa commented May 8, 2023

Uh oh!

Uh oh!

sipa commented Aug 7, 2020 •

edited

Loading

gmaxwell commented Aug 7, 2020 •

edited

Loading

real-or-random commented Aug 7, 2020 •

edited

Loading

sipa commented Aug 7, 2020 •

edited

Loading