cpu: rv64: add rvv inner product implementation using rvv intrinsics #4115
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces optimized inner product (ip) primitive for RV64 architectures using RVV (RISC-V Vector) intrinsics. The current
rvv_inner_product
implementation focus on 2-D plain inputs layouts.This initial version provides:
f32
,s32
,s8
,u8
f32:f32
,s8:s8
andu8:s8
Implementation Details
src
andweight
if they are non-continuous in memory usingscratchpad
methodcompute_ip_rvv
andfinalize_ip_acc
:compute_ip_rvv
: Do the vector compuation using different RVV intrinsics for various src/weight data type combinationsfinalize_ip_acc
: After vector compuation, get the scalar result and add bias (will do post-ops here in the future), then clampdst
and convert to the required data typeChecklist
General
make test
andmake test_benchdnn_*
) pass locally for each commit?Performance improvements
All experiments are performed on a SG2044 platform including:
benchdnn
inputs of alexnet / resnet_50 / rnn_t / transformer_lt--mode=p --dir=FWD_B
We draw comparisons among 1st baseline method of scalar implementation, 2nd baseline method of auto vectoration by compiler,and our method of RVV intrinsic implementation.
ref_gemm_inner_product
implementation compiled by gcc 14.2 with-march=rv64gc -O3
ref_gemm_inner_product
implementation compiled by gcc 14.2 with-march=rv64gcv -O3 -ftree-vectorize
rvv_inner_product
implementation compiled by gcc 14.2 with-march=rv64gc -O3
Results
Averagely, runtimes using rvv instrinsics have obtained a 1.49x speep up compared to those with scalar method, and a 1.28x speep up compared to those with compiler auto vectorization.
In the best test case of
--dtu8:s8:s32 mb4ic1024oc4096n"Transformer_lt:Decoder_MM_7*240"
, rvv intrinsics have a 9.55x speep up compared to scalar method, and a 17.01x speep up compared to compiler auto vectorization.Detailed results are as follows: