You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README_AVX512.md
+63-4
Original file line number
Diff line number
Diff line change
@@ -2,20 +2,25 @@
2
2
3
3
The synthetic benchmark utility avx512_test will run various AVX-512 operations in a loop.
4
4
5
+
```
5
6
cd src
6
7
cp Makefile.vdf-client Makefile
7
8
make clean
8
9
make avx512_test
10
+
```
9
11
10
12
The benchmark is performed for 512 bit and 1024 bit integers. The results are the number of clock cycles for 1000 iterations.
11
13
12
14
There might be multiple values for a single benchmark if there was high variance in the time measurement.
13
15
14
16
For example:
17
+
18
+
```
15
19
benchmark_512_to_avx512, :
16
20
99%: 11719
17
21
0%: 22699
18
22
0%: 40319
23
+
```
19
24
20
25
99% of the runs had an average of 11719 cycles for 1000 iterations. The other results should be ignored.
21
26
@@ -24,82 +29,130 @@ The avx512_test benchmark is run on a HP Notebook 14-dq1045cl laptop with turbo
24
29
The GMP integers need to be converted into AVX-512 format (52 bits per limb) before they can be operated on. This takes 12 cycles for a 512-bit integer:
25
30
26
31
benchmark_512_to_avx512, :
32
+
33
+
```
27
34
99%: 11719
28
35
0%: 22699
29
36
0%: 40319
37
+
```
30
38
31
39
For an addition, this is done two times and the inputs can then be added with the AVX-512 code, which takes 29 cycles:
32
40
33
41
benchmark_512_add_avx512, :
42
+
43
+
```
34
44
99%: 28838
35
45
0%: 36706
46
+
```
36
47
37
48
Finally, the result is converted back to GMP format, which takes 19 cycles:
38
49
39
50
benchmark_512_add_to_gmp, :
51
+
52
+
```
40
53
100%: 19118
54
+
```
41
55
42
56
Using GMP for the addition instead of AVX-512 takes 28 cycles:
43
57
44
58
benchmark_512_add_gmp, :
59
+
60
+
```
45
61
99%: 28084
46
62
0%: 37884
63
+
```
47
64
48
65
GMP's addition routine takes the same amount of time as the AVX-512 code, so the AVX-512 code would be slower if the integers are in GMP format. The AVX-512 addition code might still be useful if the integers are stored in AVX-512 format.
49
66
50
67
For 1024 bits, conversion to AVX-512 format takes 17 cycles, AVX-512 addition takes 42 cycles, and conversion back to GMP format takes 24 cycles. GMP's addition routine takes 37 cycles.
51
68
52
69
benchmark_1024_to_avx512, :
70
+
71
+
```
53
72
99%: 16981
54
73
0%: 46808
74
+
```
55
75
56
76
benchmark_1024_add_avx512, :
77
+
78
+
```
57
79
99%: 41645
58
80
0%: 72434
81
+
```
59
82
60
83
benchmark_1024_add_to_gmp, :
84
+
85
+
```
61
86
99%: 24406
62
87
0%: 55040
88
+
```
63
89
64
90
benchmark_1024_add_gmp, :
91
+
92
+
```
65
93
99%: 37362
66
94
0%: 66657
95
+
```
67
96
68
-
For multiplication of two 512-bit integers with a 1024-bit result, the AVX-512 code takes 67 cycles and 24 cycles are required to conver the 1024-bit result back to GMP format:
97
+
For multiplication of two 512-bit integers with a 1024-bit result, the AVX-512 code takes 67 cycles and 24 cycles are required to converT the 1024-bit result back to GMP format:
69
98
70
99
benchmark_512_mul_avx512, :
100
+
101
+
```
71
102
100%: 67262
103
+
```
72
104
73
105
benchmark_512_mul_to_gmp, :
106
+
107
+
```
74
108
99%: 24376
75
109
0%: 35036
110
+
```
76
111
77
112
The cycle counts for converting two integers from GMP to AVX-512 format, multiplying them, and converting back to GMP format are 17+17+67+24 = 125 cycles. However, performing these operations in a single loop takes 180 cycles:
78
113
79
114
benchmark_512_mul_avx512_gmp, :
115
+
116
+
```
80
117
100%: 180744
118
+
```
81
119
82
120
The AVX-512 multiplication with both the inputs and output in GMP format is still faster than the GMP multiplication, which takes 207 cycles:
83
121
84
122
benchmark_512_mul_gmp, :
123
+
124
+
```
85
125
100%: 207191
126
+
```
86
127
87
128
For 1024 bits, the speedup of the AVX-512 code over the GMP code is higher. The AVX-512 code takes 300 cycles with the inputs and outputs in GMP format, but the GMP code takes 676 cycles:
88
129
89
130
benchmark_1024_mul_avx512, :
131
+
132
+
```
90
133
100%: 175146
134
+
```
91
135
92
136
benchmark_1024_mul_avx512_gmp, :
137
+
138
+
```
93
139
100%: 300046
140
+
```
94
141
95
142
benchmark_1024_mul_gmp, :
143
+
144
+
```
96
145
100%: 676242
146
+
```
97
147
98
148
benchmark_1024_mul_to_gmp, :
149
+
150
+
```
99
151
99%: 36454
100
152
0%: 69371
153
+
```
101
154
102
-
# Performance of the VDF with AVX-512 operations
155
+
##Performance of the VDF with AVX-512 operations
103
156
104
157
GMP multiplications are replaced with AVX-512 multiplications if "enable_avx512_ifma" is set to true in "parameters.h". The default is false. All integers are stored in GMP format.
105
158
@@ -108,22 +161,28 @@ Additions were not replaced because the AVX-512 implementation is slower than th
108
161
The benchmark was performed with this command: "taskset -c 0,1 ./vdf_bench square_asm 1000000"
109
162
110
163
AVX-512 disabled:
164
+
165
+
```
111
166
Time: 15208 ms; n_slow: 213; speed: 65.7K ips
112
167
Time: 14981 ms; n_slow: 213; speed: 66.7K ips
113
168
Time: 15160 ms; n_slow: 213; speed: 65.9K ips
114
169
Time: 14919 ms; n_slow: 213; speed: 67.0K ips
115
170
Time: 15095 ms; n_slow: 213; speed: 66.2K ips
171
+
```
116
172
117
173
AVX-512 enabled:
174
+
175
+
```
118
176
Time: 15966 ms; n_slow: 213; speed: 62.6K ips
119
177
Time: 15986 ms; n_slow: 213; speed: 62.5K ips
120
178
Time: 15910 ms; n_slow: 213; speed: 62.8K ips
121
179
Time: 16001 ms; n_slow: 213; speed: 62.4K ips
122
180
Time: 15803 ms; n_slow: 213; speed: 63.2K ips
181
+
```
123
182
124
183
Enabling AVX-512 causes a slight reduction in the overall performance, even though the synthetic benchmark showed an increase in performance. This could be due to instruction cache misses, since the AVX-512 code doesn't have loops and the GMP code does.
125
184
126
-
# Documentation for newly added code
185
+
##Documentation for newly added code
127
186
128
187
The new AVX-512 integer implementation is in "asm_avx512_ifma.h" and "avx512_integer.h".
129
188
@@ -133,4 +192,4 @@ The "add" function is used to add two AVX-512 integers. The original entry used
133
192
134
193
The "multiply" and "apply_carry" functions use similar algorithms to the implementation from the original entry.
135
194
136
-
"avx512_integer.h" contains a class which will call the assembly implementations. Only the operand sizes that are required are compiled.
195
+
"avx512_integer.h" contains a class which will call the assembly implementations. Only the operand sizes that are required are compiled.
0 commit comments