Implement ABGLSV-Pornin multiplication#323
Implement ABGLSV-Pornin multiplication#323str4d wants to merge 5 commits intodalek-cryptography:mainfrom
Conversation
|
This is really cool! A few comments / questions based on a quick read of the source code:
|
I did consider this API, but wasn't sure whether there were any cases where we would want to not use cofactor multiplication on the result for
Yep, this is also what I'd like. I kept the prior API initially so I had something to benchmark against 😄
It looks like in (Curve9697) @pornin uses width-4 for runtime-calculated tables, and width-5 for pre-computed tables. IDK if he has relevant benchmarks, but it's another datapoint towards dropping width-8. I think it would make sense to examine this in a subsequent PR, separate to this change.
The input preparation is performance-critical, in that the pre-Pornin algorithms were slow enough that the reduction in doublings could not offset it (which led to the Ed25519 paper dismissing ABGLSV and using double-base scalar mult instead). That said, a I had originally started writing
Yes, this is what I mean. We want to leverage the fact that bit lengths are strictly-decreasing, to avoid operating on higher limbs that are guaranteed to not contain the MSB. |
9b8b93d to
b401033
Compare
|
I've reworked the PR following @hdevalence's comments, and added an AVX2 backend.
|
|
About window sizes: there are several parameters in play, not all of which apply to the present case; notably, my default implementations strive to work on very small systems, and that means using very little RAM. For Curve9767, each point in a window uses 80 bytes (affine coordinates, each coordinate is a polynomial of degree less than 19, coefficients on 16 bits each, two dummy slots for alignment); if the windows collectively contain 16 points (for instance), then that's 1280 bytes of stack space, and for the very low-end of microcontrollers, that's too much (I must leave a few hundred bytes for the temporaries used in field element operations, and the calling application may also have needs). ROM/Flash size is also a constraint (though usually less severe), again encouraging using relatively small windows. With a window of n bits, 2n-1 points must be stored (e.g. for a 16-bit window, this stores points P, 2P,... 8P, from which we can also dynamically obtain -P, -2P,... -8P). If using wNAF, we only need the odd multiplies of these points (i.e. P, 3P, 5P and 7P for a 16-bit window), lowering the storage cost to 2n-2 points. In the signature verification, I have two dynamic windows to store: computing uA+vB+wC, with B being the generator but A and C dynamically obtained, I need one window for A and another for C. Therefore, if I want to use only 8 points (640 stack bytes), then I must stick to 4-bit windows. Static windows are in ROM, and there's more space there, but there's exponential growth; each 5-bit window is 1280 bytes, and there are two of them, so 2560 bytes of ROM for these. In the x86/AVX2 implementation, for signature verification, I use 5-bit dynamic windows, and 7-bit static windows; for generic point multiplication (non-NAF, thus with also the even multiples), I have both static and dynamic 5-bit windows (four static windows for the base point). The static windows add up to 10240 bytes, which I think is a reasonable figure for big x86, since there will typically be about 32 kB of L1 cache: again, we must think that the caller also has data in cache, and if we use up all the L1 cache for the signature verification, this may look well on benchmarks, but in practical situations this will induce cache misses down the line. We should therefore strive to use only a minority of the total L1 cache. Note that Ed25519 points are somewhat larger than Curve9767 points: About CPU cost: this is a matter of trade-offs. In wNAF, with n-bit windows, building the window will require 2n-2-1 point additions, and will induce on average one point addition every n+1 bits. With 127-bit multipliers, this means that 4-bit windows need 28.4 point additions on average (for each window, not counting the 126 doublings), while 5-bit windows need about 28.2. With Curve9767, the latter is better (if you have the RAM) for another reason which is not applicable to Ed25519: long sequences of point doublings are slightly more efficient, and longer windows increase the average length of runs of doublings. This benefit does not apply to Ed25519. Thus, for dynamic windows and Ed25519, I'd say that 4-bit and 5-bit wNAF windows should be about equivalent (5-bit windows would be better if using 252-bit multipliers). With static windows, there is no CPU cost in building windows, and larger windows are better, but there are diminishing returns. Going from 7-bit to 8-bit windows would save less than two point additions, possibly not worth the effort unless you are aiming at breaking the record in a microbenchmark context which will be meaningless in real situations. |
|
Force-pushed to fix the serial and vector Straus impls, which were not correctly checking for the first non-zero |
2194715 to
93fd6ea
Compare
|
This PR was previously based on release 2.0.0. I've rebased it onto Assuming this PR is merged, I plan to make a separate PR to |
8b6c6c1 to
e267f48
Compare
|
Fixed the simple bugs, but CI is still failing because between 2.0.0 and |
81534c1 to
f6ec510
Compare
| ]) | ||
| .unwrap()), | ||
| ); | ||
| println!("b_shl_128_odd_lookup_table = {:?}", b_shl_128_odd_table); |
There was a problem hiding this comment.
I added this test to match the one for BASEPOINT_ODD_LOOKUP_TABLE, and used it to regenerate the AVX2 B_SHL_128_ODD_LOOKUP_TABLE table (the contents of which apparently gets generated differently after over 2 years of crate development, but both the old and new lookup tables pass tests).
|
|
||
| let basepoint_odd_table = | ||
| NafLookupTable8::<CachedPoint>::from(&constants::ED25519_BASEPOINT_POINT); | ||
| println!("basepoint_odd_lookup_table = {:?}", basepoint_odd_table); |
There was a problem hiding this comment.
I copied this from the AVX2 tests to have an equivalent check of the AVX512IFMA lookup table, but I don't have a suitable device to test this.
| ]) | ||
| .unwrap()), | ||
| ); | ||
| println!("b_shl_128_odd_lookup_table = {:?}", b_shl_128_odd_table); |
There was a problem hiding this comment.
I copied this from the AVX2 tests to have an equivalent check of the AVX512IFMA lookup table. Someone with a suitable device needs to run this test and extract the output of this println so we can update ifma::constants with the correct table.
| ]); | ||
|
|
||
| /// Odd multiples of `[2^128]B`. | ||
| // TODO: generate real constants using test in `super::edwards`. |
There was a problem hiding this comment.
This is currently just a duplicate of BASEPOINT_ODD_LOOKUP_TABLE to get the build CI checks to pass.
|
Moving the todo list out of the top post:
The last two items are not blockers for this PR. |
d785f10 to
d448edd
Compare
Uses Algorithm 4 from Pornin 2020 to find a suitable short vector. References: - Pornin 2020: https://eprint.iacr.org/2020/454
d448edd to
b13b3a6
Compare
|
Force-pushed to fix post-rebase bugs and get CI passing. |
a3524dc to
20e355e
Compare
|
Force-pushed to add changelog entries and fix documentation. |
| /// Checks whether \\([8a]A + [8b]B = [8]C\\) in variable time. | ||
| /// | ||
| /// This can be used to implement [RFC 8032]-compatible Ed25519 signature validation. | ||
| /// Note that it includes a multiplication by the cofactor. | ||
| /// | ||
| /// [RFC 8032]: https://tools.ietf.org/html/rfc8032 | ||
| pub fn vartime_check_double_scalar_mul_basepoint( |
There was a problem hiding this comment.
ed25519-dalek is now in the same workspace as curve25519-dalek, so I can make changes to it in this PR, but I think the next question is how we use this method.
I opened this PR in May 2020. Originally I just returned the scalar mul output directly, but @hdevalence suggested this "check" API instead, where the EdwardsPoint version would multiply by the cofactor. I migrated to that, noting that we might want to make the cofactor multiplication configurable.
In October 2020 @hdevalence published his survey of Ed25519 validation criteria. Some time in the intervening 3.5 years, ed25519-dalek has gained several separate signature verification methods, that all use this helper function internally:
curve25519-dalek/ed25519-dalek/src/verifying.rs
Lines 216 to 224 in cc3421a
These helpers are therefore either checking "ad-hoc" or "strict" equality of R, neither of which multiply by the cofactor. Meanwhile the ed25519-zebra crate implements the ZIP 215 signature validation rules, which are the "expansive" rules (R is not required to be a canonical encoding, and multiplication by cofactor is required).
So I think we do want some kind of configurability here over the cofactor multiplication. What should this look like? A boolean argument, or two separate APIs?
Note also that the scalar mul optimization implemented in this PR actually checks [δa]A + [δb]B = [δ]C, where δ is a value invertible mod
If there is on the curve a non-trivial point
Tof orderh, then replacingRwithR+Twill make
the standard verification equation fail, but the second one will still accept the signature if it
so happens that the valueδ(obtained from the lattice basis reduction algorithm) turns out to
be a multiple ofh.
Is there a way we can avoid this by adjusting the lattice basis reduction algorithm to filter out these δ values? If not, then we cannot use this optimisation for the "strict" verification methods, and it is debatable whether we should even use it for the "ad-hoc" methods (as doing so would change the ill-defined set of valid signatures - not that there isn't already wide inconsistencies between implementations here already, but this would be a difference between two versions of curve25519-dalek, and IDK what the maintainers' policy here is).
Regardless, we definitely should offer a "mul-by-cofactor" version of this in the API, as curve25519-zebra (and anyone else using the cofactor check equation) will benefit from it (as will anyone using RistrettoPoint in a signature scheme, which fortunately does not suffer from this problem).
There was a problem hiding this comment.
I have looked a bit at the problem, I think leveraging the optimization while being strictly equivalent to the cofactorless equation is doable, but it is a bit unpleasant.
We have a public key A, generator is B, signature is (R, s), and during the verification, the challenge k is computed as a SHA-512 output, which is then interpreted as an integer. The curve has order 8*L. Points A and R are on the curve, but not necessarily in the subgroup of order L. The cofactorless verification equation is:
s*B - k*A = R
First, we should note that while k is nominally a 512-bit integer, the implementation in curve25519-dalek represents k as a Scalar, which implies reduction modulo L. This already deviates from the cofactorless equation in RFC 8032, where there is no such reduction. This matters if A is not in the subgroup of order L; for instance, it may happen that k is, as an integer, a multiple of 8, while k mod L is an odd integer, in which case the cofactorless equation would report a success, while the dalek implementation would reject it. The reverse is also possible (signature accepted by dalek but rejected by the RFC). All these variants are still within the scope of the signature algorithm, i.e. the discrepancies between verifier behaviours do not allow actual signature forgeries by attackers not knowing the private key. There is some extra discussion in the Taming the many EdDSAs paper (page 11). Here I am discussing reproducing the exact behaviour of the current dalek implementation, and therefore I call k the reduction of the SHA-512 output modulo L.
Given k, one can compute k8 = k mod 8 (the three low bits of k). The cofactorless equation is then equivalent to:
s*B - ((k - k8)/8)*(8*A) - (R + k8*A) = 0
Thus, by replacing k, A and R with, respectively, (k >> 3), 8*A and R + (k & 7)*A, I have a completely equivalent equation (thus with the same behaviour), but I have also guaranteed that the A point is in the proper subgroup of order L. Thus we can now assume that A is in that subgroup. This is important: when multiplying A by an integer x, we can now reduce x modulo L without any loss of information.
When we apply Lagrange's algorithm on the lattice basis ((k, 1), (L, 0)), we get a new basis ((u0, u1), (v0, v1)) for the same lattice. In algorithm 4 in my paper, we stop as soon as the smaller of these two vectors is "small enough", but we can also reuse the stopping condition from algorithm 3, i.e. we can change this:
if len(N_v) <= t:
return (v, u)
into:
if len(N_v) <= t:
if 2*abs(p) <= N_v:
return (v, u)
This would, on average, add maybe one or two iterations to the algorithm, i.e. the extra cost on the algorithm would likely be negligible. By using this test, we ensure that not only v is truly the smallest non-zero vector in the lattice, but u is the second smallest non-zero vector among those which are not colinear to v (this kind of assertion breaks down at higher lattice dimensions, but in dimension 2 it works).
Now, Lagrange's algorithm starts here with u = (k, 1), and 1 is odd. Moreover, each step either adds a multiple of v to u, or a multiple of u to v. The consequence is that u1 and v1 can never both be even; at least one of them is odd. The important point here is that if v1 is an odd integer, and less than L (by construction), then it is invertible modulo L (since L is prime) but also modulo 8 (since it is odd). Thus, v1 is invertible modulo 8*L. If v1 is invertible modulo 8*L, which is the whole curve order, then we can multiply the verification equation by v1 in a reversible way, i.e. without changing the behaviour. We thus get:
(v1*s mod L)*B - v0*A - v1*R = 0
which is the Antipa et al optimization. Note that the equivalence relies on two properties: that A is in the right subgroup (so that we can replace k*v1 with v0), and that v1 is odd.
The unpleasantness is that v1 might be even. As explained above, if v1 is even, then u1 must be odd, hence we can use (u0, u1) instead of (v0, v1). However, the smallest non-zero vector in the lattice is v, not u. Heuristically, u is not much bigger than v, but there are some degenerate cases. For instance, if k = (L - 1)/2, then the output of Lagrange's algorithm is v = (1, -2) (very small, but -2 is even), and u = (2*(L+1)/5, (L-4)/5) (denominator u1 = (L-4)/5 is odd, but both u0 and u1 are almost as large as L).
In the verification algorithm, k is an output of SHA-512, and thus attackers would have trouble crafting signatures that leverage the most degenerate cases, and we can heuristically consider that u won't be a very large vector, but the lattice reduction algorithm must still performs update on u and v with their full 254-bit size (including the sign bit); the nice trick of computing them only over 128 bits is no longer applicable. This may conceivably increase the cost, and thus decrease the usefulness of the optimization.
Summary: the behaviour of the current implementation (with the cofactorless equation) can be maintained while applying the Antipa et al optimization, provided that the following process is applied:
- Compute
kas previously, with a SHA-512 output and with reduction moduloL(to maintain backward compatibility). - Replace
k,AandRwithk >> 3,8*AandR + (k & 7)*A, respectively. - Compute Lagrange's algorithm over
((k, 1), (L, 0))(optionally with the extra ending test so that a truly size-reduced basis is obtained, to make both basis vectors as small as possible). Updates to coordinates ofuandvmust be maintained over their full size (254 bits). - Given the output
((v0, v1), (u0, u1))of Lagrange's algorithm (with(v0, v1)being the smallest non-zero vector in the lattice), use(v0, v1)ifv1is odd; but ifv1turns out to be even, use(u0, u1)instead (in that case,u1is odd). Sinceuis not the smallest vector, its coordinates can be larger thansqrt(1.16*L), so the combined Straus algorithm must be able to handle large coefficients (even if these are improbable in practice).
WARNING: I wrote all this without actually implementing it. It seems to make sense on paper, but until it is implemented and tested, there's no guarantee I did not make a mistake.
There was a problem hiding this comment.
Thanks @pornin for looking into this! I think the proposed changes are complex enough that they should be made and tested in a separate PR.
To avoid blocking this PR further, I propose that we rename EdwardsPoint::vartime_check_double_scalar_mul_basepoint to something like EdwardsPoint::vartime_check_double_scalar_mul_basepoint_cofactor, and then in a subsequent PR we can attempt to expose an EdwardsPoint::vartime_check_double_scalar_mul_basepoint that is cofactor-less.
20e355e to
fd8952c
Compare
|
Force-pushed to move the new generated serial tables into separate submodules, and added cfg-flagged tests to generate them, and a CI job that verifies them. If this works, I'll attempt to replicate this for the vector tables. |
0d8eea8 to
a66efb2
Compare
This corresponds to the signature verification optimisation presented in Antipa et al 2005. It uses windowed non-adjacent form Straus for the multiscalar multiplication. References: - Antipa et al 2005: http://cacr.uwaterloo.ca/techreports/2005/cacr2005-28.pdf
Checks whether [8a]A + [8b]B = [8]C in variable time. This can be used to implement RFC 8032-compatible Ed25519 signature validation. Note that it includes a multiplication by the cofactor.
Checks whether [a]A + [b]B = C in variable time.
a66efb2 to
5e03d5c
Compare
|
Force-pushed to fix the Fiat backends, and adjust the new CI check to fail if the table generators do nothing (as they generate output that is incorrectly formatted, and thus detectable). |
5e03d5c to
c96c810
Compare
|
Force-pushed to implement a similar kind of generator approach for the AVX2 vector table. It doesn't currently work because the |
c96c810 to
a77e13b
Compare
|
Force-pushed to fix the AVX2 table generator. The generated constant is concretely different from before (I presume something changed about the wNAF implementation in the intervening four years), but tests pass before and after the change (and I checked that mutating either version of the constant causes a test to fail). |
a77e13b to
01a9e9e
Compare
|
Force-pushed to implement a generator for the IFMA vector table, based on the working AVX2 generator. It should work, but I don't have the hardware to run it, and so the IFMA constants remain invalid. Someone with compatible hardware needs to run the following commands on this branch: and then provide the resulting diff to the IFMA table. |
|
It's been almost six years since I implemented this, and there's still no path to getting this in. I've rebased it every couple of years to bring it up-to-date with all the intervening maintenance, and clearly I need to do so again. @rozbb what needs to happen in order to make progress on this? |
|
I'm sorry it's been such a drag. I went through the PR again. It's really a lot of code, and as a part-time maintainer I need to weigh new code against ability to fit into my head. I think this is probably too much for me to handle unfortunately. @tarcieri what say you? |
Adds a backend for computing
δ(aA + bB - C)in variable time, where:Bis the Ed25519 basepoint;δis a value invertible modℓ, which is selected internally to the function.This corresponds to the signature verification optimisation presented in Antipa et al 2005. It uses Algorithm 4 from Pornin 2020 to find a suitable short vector, and then windowed non-adjacent form Straus for the resulting multiscalar multiplication.
References: