[Design] AVX-512 bitpack optimization #4918

yyang52 · 2023-05-12T09:07:33Z

yyang52
May 12, 2023

Background

@yingsu00 has initialized the design for fast C++ bit unpacking (#2353) Based on that, we would like to add AVX-512 path to accelerate the bit further unpacking for parquet further. In this doc we will introduce the design of AVX-512 path.

Based on the guide of https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html, AVX-512 introduces many intrinsics that could be used in bit-unpacking. While AVX-512 is avaliable on more and more platforms, some of the intrisics need AVX512_VMBI support that only avaliable on Intel ICX and newer platforms. But there will be compiled time checks to ensure the compilation will not be broken for platforms without those flags.

Similar to AVX2's design, AVX-512 path is also relevent to the input bit width and output types:

uint8_t
uint16_t
uint32_t

Implementation Details

Input [1 - 7] to output uint8_t:
Generally, using _mm256_maskz_expandloadu_epi8 and _mm256_multishift_epi64_epi8 together to load input buffer and shift with masks, processing 32 values each loop. For example, for input bit width 3, we expand load the input buffer with a mask of 0x07070707, and shift mask of { 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21, 0, 3, 6, 9, 12, 15, 18, 21} to shift to correct position, then store to result buffer.
While for bit witdh 1, we could process 64 values each time by using _mm512_movm_epi8 to convert 64 bits input to 512-bit register. 0 --> 0x00, 1 --> 0xFF. And use _mm512_abs_epi8 to make 0x00 --> 0x00, 0xFF --> 0x01 and store back to result.
Input [8] to output uint8_t is a direct memcpy.
Input [1 - 7] to output uint16_t:
Generally, use _mm256_maskz_expandloadu_epi8 and _mm256_multishift_epi64_epi8 similar to ouput uint8_t, but add one more step to _mm512_cvtepu8_epi16.
Input [8] to output uint16_t: Call _mm512_cvtepu8_epi16 directly to extend the bit width.
Input [9 - 15] to output uint16_t:
The algorithm will be more complex than the smaller bit width. It's based on Intel open-sourced QPL library: https://github.com/intel/qpl, which supports some unpacking functionality similarly. First use _mm512_permutexvar_epi16 to permute the elements with even indexes and odd ones seperately. Use _mm512_srlv_epi32 and _mm512_sllv_epi32 to shift elements they start from the start of the word. Then use _mm512_mask_mov_epi16 to gather even and odd elements together.
Input [16] to output uint16_t is a direct memcpy.
Input [1 - 4] to output uint32_t:
Similar to uint8_t, but here we do not need to use _mm256_maskz_expandloadu_epi8 but _mm512_set1_epi64 as we now only process 16 values each times and do not need to load more than 64 bits. Then use _mm256_multishift_epi64_epi8 to shift the loaded bits.
Input [5 - 7] to output uint32_t:
Copy 2 * bitWidth bytes into 2 integers, 8 values each. Then call _pdep_u64 on each of the integers to deposit the values from bitWidth to 8 bits wide, and store the 8 * 2 output values to a piece of memory aligned by 16 bytes. Now use _mm512_cvtepu8_epi32 to cast these values in the register to 32 bits wide and store them back to memory.
Input [8] to output uint32_t: Call _mm512_cvtepu8_epi32 directly to extend the bit width.
Input [9 - 15] to output uint32_t:
Use _mm256_maskz_expandloadu_epi8, _mm256_shuffle_epi8, and _mm256_multishift_epi64_epi8
Similar to [1 - 4] but need one more shuffle since the input bit widths are larger than 8.
Input [16] to output uint32_t: Call _mm512_cvtepu16_epi32 directly to extend the bit width.
Input [17 - 31] to output uint32_t:
Similar to input [9 - 15] to output uint16_t, the algorithm is based on Intel open-sourced QPL library. Use _mm512_permutexvar_epi16 to permute the elements with even and odd indexes. Use _mm512_srlv_epi32 and _mm512_sllv_epi32 to shift elements. And use _mm512_mask_mov_epi16 to gather even and odd elements together.
Input [16] to output uint16_t is a direct memcpy.

Boundary handling

Reuse AVX2 and unpackNaive code to handle the remaing elements.

Other Considerations

The choose between Function overloading vs. templates and Class vs. Global Functions In Utility Headers is basically following the AVX2 solution. For function pointers vs. switches, we compared the performance between function pointers and switch cases, almost the same, but using switch cases will be slightly better in general.

Prefetch

Using prefetch for the output buffer will give extra perf gain, since the top-downn microarchtecture analysis showd that bit-unpacking with AVX-512 is memroy bound, store bound specificlly, which might be due to AVX-512's store latency.

Benchmark Results

The following benchmark results are from BitUnpackBenchmark.cpp for unpacking 8M values. The code was compiled byUbuntu clang version 12.0.0-3ubuntu1~20.04.5 with CPU Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (IceLake). The results are in microseconds(us) or milliseconds(ms).

Output Bit Width	Input Bit Width	Velox AVX2	Velox AVX-512	Speedup ratio
8	1	403.69us	334.41us	1.21x
8	2	425.02us	370.87us	1.15x
8	3	459.34us	411.88us	1.12x
8	4	490.61us	442.35us	1.11x
8	5	511.18us	474.38us	1.08x
8	6	536.59us	507.04us	1.06x
8	7	565.43us	542.74us	1.04x
8	8	585.75us	555.51us	1.05x
16	1	675.32us	638.09us	1.06x
16	2	709.71us	671.05us	1.06x
16	3	744.21us	707.89us	1.05x
16	4	769.49us	740.88us	1.04x
16	5	841.84us	775.09us	1.09x
16	6	871.24us	809.47us	1.08x
16	7	1.05ms	849.84us	1.23x
16	8	905.23us	901.96us	1.00x
16	9	1.36ms	930.92us	1.46x
16	10	1.20ms	973.46us	1.24x
16	11	1.60ms	1.01ms	1.59x
16	12	1.29ms	1.05ms	1.22x
16	13	1.87ms	1.09ms	1.71x
16	14	1.70ms	1.14ms	1.49x
16	15	1.47ms	1.17ms	1.26x
16	16	1.03ms	1.03ms	1.00x
32	1	1.77ms	1.34ms	1.32x
32	2	1.81ms	1.37ms	1.32x
32	3	1.87ms	1.43ms	1.31x
32	4	1.92ms	1.47ms	1.30x
32	5	1.97ms	1.57ms	1.26x
32	6	2.04ms	1.63ms	1.25x
32	7	2.11ms	1.72ms	1.22x
32	8	2.19ms	1.68ms	1.30x
32	9	2.06ms	1.76ms	1.17x
32	10	2.12ms	1.82ms	1.16x
32	11	2.24ms	1.88ms	1.20x
32	13	2.36ms	2.06ms	1.15x
32	15	2.59ms	2.18ms	1.19x
32	17	2.94ms	2.27ms	1.29x
32	19	3.04ms	2.43ms	1.25x
32	21	3.14ms	2.57ms	1.22x
32	24	3.26ms	2.79ms	1.17x
32	28	3.64ms	3.09ms	1.18x
32	30	3.75ms	3.25ms	1.16x
32	32	2.13ms	2.13ms	1.00x

Unpack Selectively

We would like to support the functionality of unpacking uncontinuious elements as well, since in real senarios scan and filter could happen together with filter pushdown. While based on the initial benchmark to compare the performance of AVX2 and AVX-512 solution, the perf gain is very limited, and we have conducted some Vtune and top-down microarchitecture analysis for odd rows bit unpack (semi dense), AVX-512's gather instruction latency seems to be relatively high, with a CPI 2x higher than AVX2 gather. Additionally, TMA showed that the bottleneck mainly lies in memory bound-L1 cache bound. Prefetching may not be helpful in such cases as the non-continuous rows are accessed unpredictably.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design] AVX-512 bitpack optimization #4918

{{title}}

Replies: 0 comments

Select a reply

[Design] AVX-512 bitpack optimization #4918

yyang52 May 12, 2023

Background

Implementation Details

Other Considerations

Prefetch

Benchmark Results

Unpack Selectively

Replies: 0 comments

yyang52
May 12, 2023