Skip to content

crc32: optimize CRC32C computation on arm64 using 6-way parallel processing#79047

Open
zwtao40 wants to merge 1 commit intogolang:masterfrom
zwtao40:dev_aarch64_crc32c_optimize
Open

crc32: optimize CRC32C computation on arm64 using 6-way parallel processing#79047
zwtao40 wants to merge 1 commit intogolang:masterfrom
zwtao40:dev_aarch64_crc32c_optimize

Conversation

@zwtao40
Copy link
Copy Markdown

@zwtao40 zwtao40 commented Apr 30, 2026

This PR optimizes CRC32C calculation on arm64 architecture by implementing 6-way parallel processing, achieving significant performance improvements on ARM64 processors.

Background

The current CRC32C implementation on arm64 uses single-lane computation, which cannot fully utilize the pipeline parallelism capabilities of modern ARM64 processors. The CRC32C instructions have a latency of several cycles, creating a bottleneck when processing data sequentially.

Implementation

The optimization extends the original single-lane CRC32 computation to six parallel lanes:

  1. 6-way parallel lanes: Each lane operates independently without data dependencies, allowing the processor to schedule multiple CRC32C instructions concurrently.

  2. Carry-less multiplication for merging: After all six lanes complete their computations, the intermediate results are merged using VPMULL (carry-less multiplication) instructions for subsequent iterations until the termination condition is reached.

  3. Threshold-based dispatch: The parallel path is activated for data sizes >= 1024 bytes, ensuring optimal performance for both small and large buffers.

  4. Loop unrolling: The inner loop processes 4 iterations (64 bytes per lane per iteration), totaling 384 bytes per loop cycle across all 6 lanes.

Technical Details

  • Uses registers R9, R1-R5 for 6 parallel CRC32C accumulators
  • Pre-computed constants (R1-R5) for carry-less multiplication merging
  • Leverages ARM64 NEON VPMULL instruction for efficient result combination
  • Processes 1024 bytes per large_loop iteration before merging

Performance Benchmark

Tested on Huawei Kunpeng 920 (ARMv8.2-A):

4K: 15 GB/s -> 34 GB/s (+126%)
8K: 14 GB/s -> 35 GB/s (+150%)

The optimization provides approximately 2.3x throughput improvement for typical buffer sizes.

Compatibility

  • Requires ARM64 architecture with CRC32 and NEON extensions
  • Falls back to the original sequential path for buffers < 1024 bytes
  • No changes to API or behavior, fully backward compatible

Testing

  • All existing crc32 tests pass
  • Benchmark results are reproducible across multiple runs

Updates #79052

…essing

This change optimizes CRC32C calculation on arm64 architecture by extending the original single-lane CRC32 computation to six parallel lanes. Each lane operates independently without data dependencies. After all six lanes complete their computations, the intermediate results are merged using carry-less multiplication instructions for subsequent iterations until the termination condition is reached.

This approach fully utilizes computational resources and improves instruction-level parallelism.

Performance benchmark on Huawei Kunpeng 920:

  4K: 15GB/s -> 34GB/s (126% improvement)

  8K: 14GB/s -> 35GB/s (150% improvement)

Signed-off-by: zhuwentao <1357420890@qq.com>
@gopherbot
Copy link
Copy Markdown
Contributor

This PR (HEAD: de67d11) has been imported to Gerrit for code review.

Please visit Gerrit at https://go-review.googlesource.com/c/go/+/772322.

Important tips:

  • Don't comment on this PR. All discussion takes place in Gerrit.
  • You need a Gmail or other Google account to log in to Gerrit.
  • To change your code in response to feedback:
    • Push a new commit to the branch used by your GitHub PR.
    • A new "patch set" will then appear in Gerrit.
    • Respond to each comment by marking as Done in Gerrit if implemented as suggested. You can alternatively write a reply.
    • Critical: you must click the blue Reply button near the top to publish your Gerrit responses.
    • Multiple commits in the PR will be squashed by GerritBot.
  • The title and description of the GitHub PR are used to construct the final commit message.
    • Edit these as needed via the GitHub web interface (not via Gerrit or git).
    • You should word wrap the PR description at ~76 characters unless you need longer lines (e.g., for tables or URLs).
  • See the Sending a change via GitHub and Reviews sections of the Contribution Guide as well as the FAQ for details.

@gopherbot
Copy link
Copy Markdown
Contributor

Message from Gopher Robot:

Patch Set 1:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/772322.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Copy Markdown
Contributor

Message from Gopher Robot:

Patch Set 1:

Congratulations on opening your first change. Thank you for your contribution!

Next steps:
A maintainer will review your change and provide feedback. See
https://go.dev/doc/contribute#review for more info and tips to get your
patch through code review.

Most changes in the Go project go through a few rounds of revision. This can be
surprising to people new to the project. The careful, iterative review process
is our way of helping mentor contributors and ensuring that their contributions
have a lasting impact.

During May-July and Nov-Jan the Go project is in a code freeze, during which
little code gets reviewed or merged. If a reviewer responds with a comment like
R=go1.11 or adds a tag like "wait-release", it means that this CL will be
reviewed as part of the next development cycle. See https://go.dev/s/release
for more details.


Please don’t reply on this GitHub thread. Visit golang.org/cl/772322.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Copy Markdown
Contributor

Message from 祝文涛:

Patch Set 1:

(2 comments)


Please don’t reply on this GitHub thread. Visit golang.org/cl/772322.
After addressing review feedback, remember to publish your drafts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants