Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster INTT on AArch64 by more efficient reduction #773

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rod-chapman
Copy link
Contributor

This PR introduces a small change in AArch64 intt_clean.S that reduces the number of intermediate
reduction steps from 4 to 3, through more thorough tracking of coefficient bounds.

TODO

  1. Applicatioin of SLOTHY to this code to also update intt_opt.S
  2. Re-proof of both Clean and Opt versions with HOL-Light
  3. Benchmarks

@rod-chapman rod-chapman added benchmark this PR should be benchmarked in CI aarch64 labels Feb 14, 2025
@rod-chapman rod-chapman requested a review from a team as a code owner February 14, 2025 18:53
Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 9680 cycles 9594 cycles 1.01
ML-KEM-512 encaps 11253 cycles 11228 cycles 1.00
ML-KEM-512 decaps 15808 cycles 15296 cycles 1.03
ML-KEM-768 keypair 16653 cycles 16378 cycles 1.02
ML-KEM-768 encaps 18007 cycles 17908 cycles 1.01
ML-KEM-768 decaps 23510 cycles 23636 cycles 0.99
ML-KEM-1024 keypair 22248 cycles 22233 cycles 1.00
ML-KEM-1024 encaps 24101 cycles 24083 cycles 1.00
ML-KEM-1024 decaps 31967 cycles 31905 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 29500 cycles 29453 cycles 1.00
ML-KEM-512 encaps 34717 cycles 35071 cycles 0.99
ML-KEM-512 decaps 45263 cycles 45674 cycles 0.99
ML-KEM-768 keypair 50213 cycles 50349 cycles 1.00
ML-KEM-768 encaps 55526 cycles 55719 cycles 1.00
ML-KEM-768 decaps 70330 cycles 70756 cycles 0.99
ML-KEM-1024 keypair 73206 cycles 73199 cycles 1.00
ML-KEM-1024 encaps 81647 cycles 82250 cycles 0.99
ML-KEM-1024 decaps 101738 cycles 102458 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 29398 cycles 29311 cycles 1.00
ML-KEM-512 encaps 34683 cycles 34724 cycles 1.00
ML-KEM-512 decaps 44967 cycles 45050 cycles 1.00
ML-KEM-768 keypair 48655 cycles 48655 cycles 1
ML-KEM-768 encaps 55658 cycles 55747 cycles 1.00
ML-KEM-768 decaps 67588 cycles 67507 cycles 1.00
ML-KEM-1024 keypair 72636 cycles 72600 cycles 1.00
ML-KEM-1024 encaps 84691 cycles 84719 cycles 1.00
ML-KEM-1024 decaps 101753 cycles 101613 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 16150 cycles 16161 cycles 1.00
ML-KEM-512 encaps 18798 cycles 18387 cycles 1.02
ML-KEM-512 decaps 24983 cycles 24935 cycles 1.00
ML-KEM-768 keypair 27836 cycles 27810 cycles 1.00
ML-KEM-768 encaps 29540 cycles 29529 cycles 1.00
ML-KEM-768 decaps 38906 cycles 38947 cycles 1.00
ML-KEM-1024 keypair 38262 cycles 37715 cycles 1.01
ML-KEM-1024 encaps 40720 cycles 40713 cycles 1.00
ML-KEM-1024 decaps 53236 cycles 53246 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 17260 cycles 17271 cycles 1.00
ML-KEM-512 encaps 19051 cycles 19067 cycles 1.00
ML-KEM-512 decaps 24630 cycles 24503 cycles 1.01
ML-KEM-768 keypair 29411 cycles 29400 cycles 1.00
ML-KEM-768 encaps 30812 cycles 30641 cycles 1.01
ML-KEM-768 decaps 38582 cycles 38593 cycles 1.00
ML-KEM-1024 keypair 43743 cycles 43347 cycles 1.01
ML-KEM-1024 encaps 45090 cycles 44937 cycles 1.00
ML-KEM-1024 decaps 55593 cycles 55416 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 11648 cycles 11633 cycles 1.00
ML-KEM-512 encaps 13328 cycles 13296 cycles 1.00
ML-KEM-512 decaps 18237 cycles 18216 cycles 1.00
ML-KEM-768 keypair 20184 cycles 20515 cycles 0.98
ML-KEM-768 encaps 21209 cycles 21177 cycles 1.00
ML-KEM-768 decaps 28439 cycles 28411 cycles 1.00
ML-KEM-1024 keypair 26971 cycles 27039 cycles 1.00
ML-KEM-1024 encaps 29009 cycles 29022 cycles 1.00
ML-KEM-1024 decaps 38524 cycles 38526 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 17879 cycles 17886 cycles 1.00
ML-KEM-512 encaps 21105 cycles 21242 cycles 0.99
ML-KEM-512 decaps 27750 cycles 27950 cycles 0.99
ML-KEM-768 keypair 30812 cycles 30850 cycles 1.00
ML-KEM-768 encaps 33710 cycles 33938 cycles 0.99
ML-KEM-768 decaps 43292 cycles 43591 cycles 0.99
ML-KEM-1024 keypair 44386 cycles 44447 cycles 1.00
ML-KEM-1024 encaps 49670 cycles 49886 cycles 1.00
ML-KEM-1024 decaps 62606 cycles 62820 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 47709 cycles 47710 cycles 1.00
ML-KEM-512 encaps 55617 cycles 55544 cycles 1.00
ML-KEM-512 decaps 71793 cycles 71776 cycles 1.00
ML-KEM-768 keypair 76884 cycles 76906 cycles 1.00
ML-KEM-768 encaps 87284 cycles 87347 cycles 1.00
ML-KEM-768 decaps 108142 cycles 108373 cycles 1.00
ML-KEM-1024 keypair 112878 cycles 112725 cycles 1.00
ML-KEM-1024 encaps 126444 cycles 126316 cycles 1.00
ML-KEM-1024 decaps 153142 cycles 152984 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 39050 cycles 39065 cycles 1.00
ML-KEM-512 encaps 47207 cycles 47268 cycles 1.00
ML-KEM-512 decaps 60795 cycles 60849 cycles 1.00
ML-KEM-768 keypair 63600 cycles 63516 cycles 1.00
ML-KEM-768 encaps 74265 cycles 74236 cycles 1.00
ML-KEM-768 decaps 92145 cycles 92578 cycles 1.00
ML-KEM-1024 keypair 94298 cycles 94265 cycles 1.00
ML-KEM-1024 encaps 107212 cycles 107467 cycles 1.00
ML-KEM-1024 decaps 129676 cycles 129805 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 36725 cycles 36492 cycles 1.01
ML-KEM-512 encaps 43090 cycles 43065 cycles 1.00
ML-KEM-512 decaps 56132 cycles 56111 cycles 1.00
ML-KEM-768 keypair 59542 cycles 59492 cycles 1.00
ML-KEM-768 encaps 67987 cycles 67953 cycles 1.00
ML-KEM-768 decaps 85141 cycles 85220 cycles 1.00
ML-KEM-1024 keypair 88081 cycles 88107 cycles 1.00
ML-KEM-1024 encaps 98826 cycles 98850 cycles 1.00
ML-KEM-1024 decaps 120598 cycles 120601 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 18959 cycles 18916 cycles 1.00
ML-KEM-512 encaps 22421 cycles 22691 cycles 0.99
ML-KEM-512 decaps 29652 cycles 29998 cycles 0.99
ML-KEM-768 keypair 32341 cycles 32374 cycles 1.00
ML-KEM-768 encaps 35738 cycles 36074 cycles 0.99
ML-KEM-768 decaps 46039 cycles 46507 cycles 0.99
ML-KEM-1024 keypair 46651 cycles 46587 cycles 1.00
ML-KEM-1024 encaps 52263 cycles 52649 cycles 0.99
ML-KEM-1024 decaps 66119 cycles 66591 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 35973 cycles 35951 cycles 1.00
ML-KEM-512 encaps 41664 cycles 40871 cycles 1.02
ML-KEM-512 decaps 52289 cycles 52315 cycles 1.00
ML-KEM-768 keypair 59972 cycles 63503 cycles 0.94
ML-KEM-768 encaps 67073 cycles 67650 cycles 0.99
ML-KEM-768 decaps 81836 cycles 81530 cycles 1.00
ML-KEM-1024 keypair 89258 cycles 89213 cycles 1.00
ML-KEM-1024 encaps 98967 cycles 99057 cycles 1.00
ML-KEM-1024 decaps 117887 cycles 117883 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 29506 cycles 29483 cycles 1.00
ML-KEM-512 encaps 34722 cycles 35063 cycles 0.99
ML-KEM-512 decaps 45269 cycles 45743 cycles 0.99
ML-KEM-768 keypair 50227 cycles 50361 cycles 1.00
ML-KEM-768 encaps 55528 cycles 55821 cycles 0.99
ML-KEM-768 decaps 70343 cycles 70723 cycles 0.99
ML-KEM-1024 keypair 73340 cycles 73216 cycles 1.00
ML-KEM-1024 encaps 81677 cycles 82269 cycles 0.99
ML-KEM-1024 decaps 101671 cycles 102486 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 39099 cycles 39101 cycles 1.00
ML-KEM-512 encaps 45001 cycles 44970 cycles 1.00
ML-KEM-512 decaps 56876 cycles 56873 cycles 1.00
ML-KEM-768 keypair 64515 cycles 64576 cycles 1.00
ML-KEM-768 encaps 72112 cycles 72801 cycles 0.99
ML-KEM-768 decaps 88232 cycles 88137 cycles 1.00
ML-KEM-1024 keypair 96335 cycles 96248 cycles 1.00
ML-KEM-1024 encaps 106445 cycles 106349 cycles 1.00
ML-KEM-1024 decaps 127105 cycles 127373 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 59518 cycles 59546 cycles 1.00
ML-KEM-512 encaps 68123 cycles 68085 cycles 1.00
ML-KEM-512 decaps 86773 cycles 86727 cycles 1.00
ML-KEM-768 keypair 99655 cycles 99248 cycles 1.00
ML-KEM-768 encaps 111282 cycles 110660 cycles 1.01
ML-KEM-768 decaps 134682 cycles 134859 cycles 1.00
ML-KEM-1024 keypair 148868 cycles 149034 cycles 1.00
ML-KEM-1024 encaps 164066 cycles 164405 cycles 1.00
ML-KEM-1024 decaps 195925 cycles 195634 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 225295 cycles 225269 cycles 1.00
ML-KEM-512 encaps 271065 cycles 271025 cycles 1.00
ML-KEM-512 decaps 345546 cycles 345491 cycles 1.00
ML-KEM-768 keypair 372892 cycles 372949 cycles 1.00
ML-KEM-768 encaps 433575 cycles 433538 cycles 1.00
ML-KEM-768 decaps 531199 cycles 531315 cycles 1.00
ML-KEM-1024 keypair 554548 cycles 554640 cycles 1.00
ML-KEM-1024 encaps 633794 cycles 633914 cycles 1.00
ML-KEM-1024 decaps 755414 cycles 756120 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 52882 cycles 53629 cycles 0.99
ML-KEM-512 encaps 61052 cycles 61642 cycles 0.99
ML-KEM-512 decaps 77703 cycles 79021 cycles 0.98
ML-KEM-768 keypair 89960 cycles 91270 cycles 0.99
ML-KEM-768 encaps 98109 cycles 99862 cycles 0.98
ML-KEM-768 decaps 122537 cycles 124565 cycles 0.98
ML-KEM-1024 keypair 135820 cycles 135453 cycles 1.00
ML-KEM-1024 encaps 148661 cycles 148361 cycles 1.00
ML-KEM-1024 decaps 181617 cycles 181489 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 9145809 Previous: 10cccc5 Ratio
ML-KEM-512 keypair 54741 cycles 52947 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 keypair 59569 cycles 59512 cycles 1.00
ML-KEM-512 encaps 67172 cycles 67193 cycles 1.00
ML-KEM-512 decaps 86262 cycles 86365 cycles 1.00
ML-KEM-768 keypair 101414 cycles 101280 cycles 1.00
ML-KEM-768 encaps 112349 cycles 112488 cycles 1.00
ML-KEM-768 decaps 139650 cycles 139807 cycles 1.00
ML-KEM-1024 keypair 153548 cycles 153796 cycles 1.00
ML-KEM-1024 encaps 171375 cycles 171232 cycles 1.00
ML-KEM-1024 decaps 207531 cycles 210029 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

@hanno-becker hanno-becker removed request for a team and hanno-becker February 15, 2025 12:55
Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rod-chapman Good catch, this may indeed be a little improvement, but there's more work to do in this PR?

@rod-chapman
Copy link
Contributor Author

Yes... is "slothy-cli" installed by the NIX environment? If not, then what platform should I be running SLOTHY on?

@rod-chapman rod-chapman force-pushed the faster_aarch64_intt branch 3 times, most recently from cee0db3 to 918d5b2 Compare February 19, 2025 08:37
@rod-chapman rod-chapman force-pushed the faster_aarch64_intt branch 5 times, most recently from 64bd87e to 2f4c339 Compare February 24, 2025 13:09
@rod-chapman rod-chapman force-pushed the faster_aarch64_intt branch 4 times, most recently from 2523d7a to e59b18b Compare March 7, 2025 14:42
@rod-chapman rod-chapman force-pushed the faster_aarch64_intt branch 2 times, most recently from 6553ca4 to 7baea43 Compare March 10, 2025 17:02
@rod-chapman rod-chapman added benchmark this PR should be benchmarked in CI and removed benchmark this PR should be benchmarked in CI labels Mar 10, 2025
Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Intel Xeon 4th gen (c7i)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 2b96f0c Previous: c1cd91f Ratio
ML-KEM-512 decaps 15808 cycles 15296 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

@rod-chapman rod-chapman force-pushed the faster_aarch64_intt branch from 2b96f0c to 95a15a3 Compare March 12, 2025 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aarch64 benchmark this PR should be benchmarked in CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants