Skip to content

Commit a349fe3

Browse files
committed
Update avx_512 readme
1 parent 9060497 commit a349fe3

File tree

1 file changed

+63
-4
lines changed

1 file changed

+63
-4
lines changed

README_AVX512.md

+63-4
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,25 @@
22

33
The synthetic benchmark utility avx512_test will run various AVX-512 operations in a loop.
44

5+
```
56
cd src
67
cp Makefile.vdf-client Makefile
78
make clean
89
make avx512_test
10+
```
911

1012
The benchmark is performed for 512 bit and 1024 bit integers. The results are the number of clock cycles for 1000 iterations.
1113

1214
There might be multiple values for a single benchmark if there was high variance in the time measurement.
1315

1416
For example:
17+
18+
```
1519
benchmark_512_to_avx512, :
1620
99%: 11719
1721
0%: 22699
1822
0%: 40319
23+
```
1924

2025
99% of the runs had an average of 11719 cycles for 1000 iterations. The other results should be ignored.
2126

@@ -24,82 +29,130 @@ The avx512_test benchmark is run on a HP Notebook 14-dq1045cl laptop with turbo
2429
The GMP integers need to be converted into AVX-512 format (52 bits per limb) before they can be operated on. This takes 12 cycles for a 512-bit integer:
2530

2631
benchmark_512_to_avx512, :
32+
33+
```
2734
99%: 11719
2835
0%: 22699
2936
0%: 40319
37+
```
3038

3139
For an addition, this is done two times and the inputs can then be added with the AVX-512 code, which takes 29 cycles:
3240

3341
benchmark_512_add_avx512, :
42+
43+
```
3444
99%: 28838
3545
0%: 36706
46+
```
3647

3748
Finally, the result is converted back to GMP format, which takes 19 cycles:
3849

3950
benchmark_512_add_to_gmp, :
51+
52+
```
4053
100%: 19118
54+
```
4155

4256
Using GMP for the addition instead of AVX-512 takes 28 cycles:
4357

4458
benchmark_512_add_gmp, :
59+
60+
```
4561
99%: 28084
4662
0%: 37884
63+
```
4764

4865
GMP's addition routine takes the same amount of time as the AVX-512 code, so the AVX-512 code would be slower if the integers are in GMP format. The AVX-512 addition code might still be useful if the integers are stored in AVX-512 format.
4966

5067
For 1024 bits, conversion to AVX-512 format takes 17 cycles, AVX-512 addition takes 42 cycles, and conversion back to GMP format takes 24 cycles. GMP's addition routine takes 37 cycles.
5168

5269
benchmark_1024_to_avx512, :
70+
71+
```
5372
99%: 16981
5473
0%: 46808
74+
```
5575

5676
benchmark_1024_add_avx512, :
77+
78+
```
5779
99%: 41645
5880
0%: 72434
81+
```
5982

6083
benchmark_1024_add_to_gmp, :
84+
85+
```
6186
99%: 24406
6287
0%: 55040
88+
```
6389

6490
benchmark_1024_add_gmp, :
91+
92+
```
6593
99%: 37362
6694
0%: 66657
95+
```
6796

68-
For multiplication of two 512-bit integers with a 1024-bit result, the AVX-512 code takes 67 cycles and 24 cycles are required to conver the 1024-bit result back to GMP format:
97+
For multiplication of two 512-bit integers with a 1024-bit result, the AVX-512 code takes 67 cycles and 24 cycles are required to converT the 1024-bit result back to GMP format:
6998

7099
benchmark_512_mul_avx512, :
100+
101+
```
71102
100%: 67262
103+
```
72104

73105
benchmark_512_mul_to_gmp, :
106+
107+
```
74108
99%: 24376
75109
0%: 35036
110+
```
76111

77112
The cycle counts for converting two integers from GMP to AVX-512 format, multiplying them, and converting back to GMP format are 17+17+67+24 = 125 cycles. However, performing these operations in a single loop takes 180 cycles:
78113

79114
benchmark_512_mul_avx512_gmp, :
115+
116+
```
80117
100%: 180744
118+
```
81119

82120
The AVX-512 multiplication with both the inputs and output in GMP format is still faster than the GMP multiplication, which takes 207 cycles:
83121

84122
benchmark_512_mul_gmp, :
123+
124+
```
85125
100%: 207191
126+
```
86127

87128
For 1024 bits, the speedup of the AVX-512 code over the GMP code is higher. The AVX-512 code takes 300 cycles with the inputs and outputs in GMP format, but the GMP code takes 676 cycles:
88129

89130
benchmark_1024_mul_avx512, :
131+
132+
```
90133
100%: 175146
134+
```
91135

92136
benchmark_1024_mul_avx512_gmp, :
137+
138+
```
93139
100%: 300046
140+
```
94141

95142
benchmark_1024_mul_gmp, :
143+
144+
```
96145
100%: 676242
146+
```
97147

98148
benchmark_1024_mul_to_gmp, :
149+
150+
```
99151
99%: 36454
100152
0%: 69371
153+
```
101154

102-
# Performance of the VDF with AVX-512 operations
155+
## Performance of the VDF with AVX-512 operations
103156

104157
GMP multiplications are replaced with AVX-512 multiplications if "enable_avx512_ifma" is set to true in "parameters.h". The default is false. All integers are stored in GMP format.
105158

@@ -108,22 +161,28 @@ Additions were not replaced because the AVX-512 implementation is slower than th
108161
The benchmark was performed with this command: "taskset -c 0,1 ./vdf_bench square_asm 1000000"
109162

110163
AVX-512 disabled:
164+
165+
```
111166
Time: 15208 ms; n_slow: 213; speed: 65.7K ips
112167
Time: 14981 ms; n_slow: 213; speed: 66.7K ips
113168
Time: 15160 ms; n_slow: 213; speed: 65.9K ips
114169
Time: 14919 ms; n_slow: 213; speed: 67.0K ips
115170
Time: 15095 ms; n_slow: 213; speed: 66.2K ips
171+
```
116172

117173
AVX-512 enabled:
174+
175+
```
118176
Time: 15966 ms; n_slow: 213; speed: 62.6K ips
119177
Time: 15986 ms; n_slow: 213; speed: 62.5K ips
120178
Time: 15910 ms; n_slow: 213; speed: 62.8K ips
121179
Time: 16001 ms; n_slow: 213; speed: 62.4K ips
122180
Time: 15803 ms; n_slow: 213; speed: 63.2K ips
181+
```
123182

124183
Enabling AVX-512 causes a slight reduction in the overall performance, even though the synthetic benchmark showed an increase in performance. This could be due to instruction cache misses, since the AVX-512 code doesn't have loops and the GMP code does.
125184

126-
# Documentation for newly added code
185+
## Documentation for newly added code
127186

128187
The new AVX-512 integer implementation is in "asm_avx512_ifma.h" and "avx512_integer.h".
129188

@@ -133,4 +192,4 @@ The "add" function is used to add two AVX-512 integers. The original entry used
133192

134193
The "multiply" and "apply_carry" functions use similar algorithms to the implementation from the original entry.
135194

136-
"avx512_integer.h" contains a class which will call the assembly implementations. Only the operand sizes that are required are compiled.
195+
"avx512_integer.h" contains a class which will call the assembly implementations. Only the operand sizes that are required are compiled.

0 commit comments

Comments
 (0)