You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@yingsu00 has initialized the design for fast C++ bit unpacking (#2353) Based on that, we would like to add AVX-512 path to accelerate the bit further unpacking for parquet further. In this doc we will introduce the design of AVX-512 path.
Based on the guide of https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html, AVX-512 introduces many intrinsics that could be used in bit-unpacking. While AVX-512 is avaliable on more and more platforms, some of the intrisics need AVX512_VMBI support that only avaliable on Intel ICX and newer platforms. But there will be compiled time checks to ensure the compilation will not be broken for platforms without those flags.
Similar to AVX2's design, AVX-512 path is also relevent to the input bit width and output types:
uint8_t
uint16_t
uint32_t
Implementation Details
Input [1 - 7] to output uint8_t:
Generally, using _mm256_maskz_expandloadu_epi8 and _mm256_multishift_epi64_epi8 together to load input buffer and shift with masks, processing 32 values each loop. For example, for input bit width 3, we expand load the input buffer with a mask of 0x07070707, and shift mask of { 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21} to shift to correct position, then store to result buffer.
While for bit witdh 1, we could process 64 values each time by using _mm512_movm_epi8 to convert 64 bits input to 512-bit register. 0 --> 0x00, 1 --> 0xFF. And use _mm512_abs_epi8 to make 0x00 --> 0x00, 0xFF --> 0x01 and store back to result.
Input [8] to output uint8_t is a direct memcpy.
Input [1 - 7] to output uint16_t:
Generally, use _mm256_maskz_expandloadu_epi8 and _mm256_multishift_epi64_epi8 similar to ouput uint8_t, but add one more step to _mm512_cvtepu8_epi16.
Input [8] to output uint16_t: Call _mm512_cvtepu8_epi16 directly to extend the bit width.
Input [9 - 15] to output uint16_t:
The algorithm will be more complex than the smaller bit width. It's based on Intel open-sourced QPL library: https://github.com/intel/qpl, which supports some unpacking functionality similarly. First use _mm512_permutexvar_epi16 to permute the elements with even indexes and odd ones seperately. Use _mm512_srlv_epi32 and _mm512_sllv_epi32 to shift elements they start from the start of the word. Then use _mm512_mask_mov_epi16 to gather even and odd elements together.
Input [16] to output uint16_t is a direct memcpy.
Input [1 - 4] to output uint32_t:
Similar to uint8_t, but here we do not need to use _mm256_maskz_expandloadu_epi8 but _mm512_set1_epi64 as we now only process 16 values each times and do not need to load more than 64 bits. Then use _mm256_multishift_epi64_epi8 to shift the loaded bits.
Input [5 - 7] to output uint32_t:
Copy 2 * bitWidth bytes into 2 integers, 8 values each. Then call _pdep_u64 on each of the integers to deposit the values from bitWidth to 8 bits wide, and store the 8 * 2 output values to a piece of memory aligned by 16 bytes. Now use _mm512_cvtepu8_epi32 to cast these values in the register to 32 bits wide and store them back to memory.
Input [8] to output uint32_t: Call _mm512_cvtepu8_epi32 directly to extend the bit width.
Input [9 - 15] to output uint32_t:
Use _mm256_maskz_expandloadu_epi8, _mm256_shuffle_epi8, and _mm256_multishift_epi64_epi8
Similar to [1 - 4] but need one more shuffle since the input bit widths are larger than 8.
Input [16] to output uint32_t: Call _mm512_cvtepu16_epi32 directly to extend the bit width.
Input [17 - 31] to output uint32_t:
Similar to input [9 - 15] to output uint16_t, the algorithm is based on Intel open-sourced QPL library. Use _mm512_permutexvar_epi16 to permute the elements with even and odd indexes. Use _mm512_srlv_epi32 and _mm512_sllv_epi32 to shift elements. And use _mm512_mask_mov_epi16 to gather even and odd elements together.
Input [16] to output uint16_t is a direct memcpy.
Boundary handling
Reuse AVX2 and unpackNaive code to handle the remaing elements.
Other Considerations
The choose between Function overloading vs. templates and Class vs. Global Functions In Utility Headers is basically following the AVX2 solution. For function pointers vs. switches, we compared the performance between function pointers and switch cases, almost the same, but using switch cases will be slightly better in general.
Prefetch
Using prefetch for the output buffer will give extra perf gain, since the top-downn microarchtecture analysis showd that bit-unpacking with AVX-512 is memroy bound, store bound specificlly, which might be due to AVX-512's store latency.
Benchmark Results
The following benchmark results are from BitUnpackBenchmark.cpp for unpacking 8M values. The code was compiled byUbuntu clang version 12.0.0-3ubuntu1~20.04.5 with CPU Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (IceLake). The results are in microseconds(us) or milliseconds(ms).
Output Bit Width
Input Bit Width
Velox AVX2
Velox AVX-512
Speedup ratio
8
1
403.69us
334.41us
1.21x
8
2
425.02us
370.87us
1.15x
8
3
459.34us
411.88us
1.12x
8
4
490.61us
442.35us
1.11x
8
5
511.18us
474.38us
1.08x
8
6
536.59us
507.04us
1.06x
8
7
565.43us
542.74us
1.04x
8
8
585.75us
555.51us
1.05x
16
1
675.32us
638.09us
1.06x
16
2
709.71us
671.05us
1.06x
16
3
744.21us
707.89us
1.05x
16
4
769.49us
740.88us
1.04x
16
5
841.84us
775.09us
1.09x
16
6
871.24us
809.47us
1.08x
16
7
1.05ms
849.84us
1.23x
16
8
905.23us
901.96us
1.00x
16
9
1.36ms
930.92us
1.46x
16
10
1.20ms
973.46us
1.24x
16
11
1.60ms
1.01ms
1.59x
16
12
1.29ms
1.05ms
1.22x
16
13
1.87ms
1.09ms
1.71x
16
14
1.70ms
1.14ms
1.49x
16
15
1.47ms
1.17ms
1.26x
16
16
1.03ms
1.03ms
1.00x
32
1
1.77ms
1.34ms
1.32x
32
2
1.81ms
1.37ms
1.32x
32
3
1.87ms
1.43ms
1.31x
32
4
1.92ms
1.47ms
1.30x
32
5
1.97ms
1.57ms
1.26x
32
6
2.04ms
1.63ms
1.25x
32
7
2.11ms
1.72ms
1.22x
32
8
2.19ms
1.68ms
1.30x
32
9
2.06ms
1.76ms
1.17x
32
10
2.12ms
1.82ms
1.16x
32
11
2.24ms
1.88ms
1.20x
32
13
2.36ms
2.06ms
1.15x
32
15
2.59ms
2.18ms
1.19x
32
17
2.94ms
2.27ms
1.29x
32
19
3.04ms
2.43ms
1.25x
32
21
3.14ms
2.57ms
1.22x
32
24
3.26ms
2.79ms
1.17x
32
28
3.64ms
3.09ms
1.18x
32
30
3.75ms
3.25ms
1.16x
32
32
2.13ms
2.13ms
1.00x
Unpack Selectively
We would like to support the functionality of unpacking uncontinuious elements as well, since in real senarios scan and filter could happen together with filter pushdown. While based on the initial benchmark to compare the performance of AVX2 and AVX-512 solution, the perf gain is very limited, and we have conducted some Vtune and top-down microarchitecture analysis for odd rows bit unpack (semi dense), AVX-512's gather instruction latency seems to be relatively high, with a CPI 2x higher than AVX2 gather. Additionally, TMA showed that the bottleneck mainly lies in memory bound-L1 cache bound. Prefetching may not be helpful in such cases as the non-continuous rows are accessed unpredictably.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Background
@yingsu00 has initialized the design for fast C++ bit unpacking (#2353) Based on that, we would like to add AVX-512 path to accelerate the bit further unpacking for parquet further. In this doc we will introduce the design of AVX-512 path.
Based on the guide of https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html, AVX-512 introduces many intrinsics that could be used in bit-unpacking. While AVX-512 is avaliable on more and more platforms, some of the intrisics need AVX512_VMBI support that only avaliable on Intel ICX and newer platforms. But there will be compiled time checks to ensure the compilation will not be broken for platforms without those flags.
Similar to AVX2's design, AVX-512 path is also relevent to the input bit width and output types:
Implementation Details
Input [1 - 7] to output uint8_t:
Generally, using
_mm256_maskz_expandloadu_epi8
and_mm256_multishift_epi64_epi8
together to load input buffer and shift with masks, processing 32 values each loop. For example, for input bit width 3, we expand load the input buffer with a mask of0x07070707
, and shift mask of{ 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21}
to shift to correct position, then store to result buffer.While for bit witdh 1, we could process 64 values each time by using
_mm512_movm_epi8
to convert 64 bits input to 512-bit register. 0 --> 0x00, 1 --> 0xFF. And use_mm512_abs_epi8
to make 0x00 --> 0x00, 0xFF --> 0x01 and store back to result.Input [8] to output uint8_t is a direct memcpy.
Input [1 - 7] to output uint16_t:
Generally, use
_mm256_maskz_expandloadu_epi8
and_mm256_multishift_epi64_epi8
similar to ouput uint8_t, but add one more step to_mm512_cvtepu8_epi16
.Input [8] to output uint16_t: Call
_mm512_cvtepu8_epi16
directly to extend the bit width.Input [9 - 15] to output uint16_t:
The algorithm will be more complex than the smaller bit width. It's based on Intel open-sourced QPL library: https://github.com/intel/qpl, which supports some unpacking functionality similarly. First use
_mm512_permutexvar_epi16
to permute the elements with even indexes and odd ones seperately. Use_mm512_srlv_epi32
and_mm512_sllv_epi32
to shift elements they start from the start of the word. Then use_mm512_mask_mov_epi16
to gather even and odd elements together.Input [16] to output uint16_t is a direct memcpy.
Input [1 - 4] to output uint32_t:
Similar to uint8_t, but here we do not need to use
_mm256_maskz_expandloadu_epi8
but_mm512_set1_epi64
as we now only process 16 values each times and do not need to load more than 64 bits. Then use_mm256_multishift_epi64_epi8
to shift the loaded bits.Input [5 - 7] to output uint32_t:
Copy 2 * bitWidth bytes into 2 integers, 8 values each. Then call
_pdep_u64
on each of the integers to deposit the values from bitWidth to 8 bits wide, and store the 8 * 2 output values to a piece of memory aligned by 16 bytes. Now use_mm512_cvtepu8_epi32
to cast these values in the register to 32 bits wide and store them back to memory.Input [8] to output uint32_t: Call
_mm512_cvtepu8_epi32
directly to extend the bit width.Input [9 - 15] to output uint32_t:
Use
_mm256_maskz_expandloadu_epi8
,_mm256_shuffle_epi8
, and_mm256_multishift_epi64_epi8
Similar to [1 - 4] but need one more shuffle since the input bit widths are larger than 8.
Input [16] to output uint32_t: Call
_mm512_cvtepu16_epi32
directly to extend the bit width.Input [17 - 31] to output uint32_t:
Similar to input [9 - 15] to output uint16_t, the algorithm is based on Intel open-sourced QPL library. Use
_mm512_permutexvar_epi16
to permute the elements with even and odd indexes. Use_mm512_srlv_epi32
and_mm512_sllv_epi32
to shift elements. And use_mm512_mask_mov_epi16
to gather even and odd elements together.Input [16] to output uint16_t is a direct memcpy.
Boundary handling
Reuse AVX2 and
unpackNaive
code to handle the remaing elements.Other Considerations
The choose between Function overloading vs. templates and Class vs. Global Functions In Utility Headers is basically following the AVX2 solution. For function pointers vs. switches, we compared the performance between function pointers and switch cases, almost the same, but using switch cases will be slightly better in general.
Prefetch
Using prefetch for the output buffer will give extra perf gain, since the top-downn microarchtecture analysis showd that bit-unpacking with AVX-512 is memroy bound, store bound specificlly, which might be due to AVX-512's store latency.
Benchmark Results
The following benchmark results are from BitUnpackBenchmark.cpp for unpacking 8M values. The code was compiled byUbuntu clang version 12.0.0-3ubuntu1~20.04.5 with CPU Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (IceLake). The results are in microseconds(us) or milliseconds(ms).
Unpack Selectively
We would like to support the functionality of unpacking uncontinuious elements as well, since in real senarios scan and filter could happen together with filter pushdown. While based on the initial benchmark to compare the performance of AVX2 and AVX-512 solution, the perf gain is very limited, and we have conducted some Vtune and top-down microarchitecture analysis for odd rows bit unpack (semi dense), AVX-512's gather instruction latency seems to be relatively high, with a CPI 2x higher than AVX2 gather. Additionally, TMA showed that the bottleneck mainly lies in memory bound-L1 cache bound. Prefetching may not be helpful in such cases as the non-continuous rows are accessed unpredictably.
Beta Was this translation helpful? Give feedback.
All reactions