crc32: optimize CRC32C computation on arm64 using 6-way parallel processing#79047
crc32: optimize CRC32C computation on arm64 using 6-way parallel processing#79047zwtao40 wants to merge 1 commit intogolang:masterfrom
Conversation
…essing This change optimizes CRC32C calculation on arm64 architecture by extending the original single-lane CRC32 computation to six parallel lanes. Each lane operates independently without data dependencies. After all six lanes complete their computations, the intermediate results are merged using carry-less multiplication instructions for subsequent iterations until the termination condition is reached. This approach fully utilizes computational resources and improves instruction-level parallelism. Performance benchmark on Huawei Kunpeng 920: 4K: 15GB/s -> 34GB/s (126% improvement) 8K: 14GB/s -> 35GB/s (150% improvement) Signed-off-by: zhuwentao <1357420890@qq.com>
|
This PR (HEAD: de67d11) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/go/+/772322. Important tips:
|
|
Message from Gopher Robot: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/772322. |
|
Message from Gopher Robot: Patch Set 1: Congratulations on opening your first change. Thank you for your contribution! Next steps: Most changes in the Go project go through a few rounds of revision. This can be During May-July and Nov-Jan the Go project is in a code freeze, during which Please don’t reply on this GitHub thread. Visit golang.org/cl/772322. |
|
Message from 祝文涛: Patch Set 1: (2 comments) Please don’t reply on this GitHub thread. Visit golang.org/cl/772322. |
This PR optimizes CRC32C calculation on arm64 architecture by implementing 6-way parallel processing, achieving significant performance improvements on ARM64 processors.
Background
The current CRC32C implementation on arm64 uses single-lane computation, which cannot fully utilize the pipeline parallelism capabilities of modern ARM64 processors. The CRC32C instructions have a latency of several cycles, creating a bottleneck when processing data sequentially.
Implementation
The optimization extends the original single-lane CRC32 computation to six parallel lanes:
6-way parallel lanes: Each lane operates independently without data dependencies, allowing the processor to schedule multiple CRC32C instructions concurrently.
Carry-less multiplication for merging: After all six lanes complete their computations, the intermediate results are merged using VPMULL (carry-less multiplication) instructions for subsequent iterations until the termination condition is reached.
Threshold-based dispatch: The parallel path is activated for data sizes >= 1024 bytes, ensuring optimal performance for both small and large buffers.
Loop unrolling: The inner loop processes 4 iterations (64 bytes per lane per iteration), totaling 384 bytes per loop cycle across all 6 lanes.
Technical Details
Performance Benchmark
Tested on Huawei Kunpeng 920 (ARMv8.2-A):
4K: 15 GB/s -> 34 GB/s (+126%)
8K: 14 GB/s -> 35 GB/s (+150%)
The optimization provides approximately 2.3x throughput improvement for typical buffer sizes.
Compatibility
Testing
Updates #79052