Skip to content

Conversation

zhangjian29
Copy link
Contributor

@zhangjian29 zhangjian29 commented Oct 11, 2025

Description

This PR introduces optimized inner product (ip) primitive for RV64 architectures using RVV (RISC-V Vector) intrinsics. The current rvv_inner_product implementation focus on 2-D plain inputs layouts.

This initial version provides:

  1. Supported memory layouts: plain, dense, no inner blocks
  2. Supported dst data types: f32, s32, s8, u8
  3. Support only 2D source/weights with data type combinations of f32:f32, s8:s8 and u8:s8
  4. Support bias computation

Implementation Details

  • Memory Reorder: Reorder src and weight if they are non-continuous​ in memory using scratchpad method
  • Seperate Computations: Seperate inner product computations into compute_ip_rvv and finalize_ip_acc:
    • compute_ip_rvv: Do the vector compuation using different RVV intrinsics for various src/weight data type combinations
    • finalize_ip_acc: After vector compuation, get the scalar result and add bias (will do post-ops here in the future), then clamp dst and convert to the required data type

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Performance improvements

All experiments are performed on a SG2044 platform including:

  • Test Cases: benchdnn inputs of alexnet / resnet_50 / rnn_t / transformer_lt
  • Test Dtypes: f32:f32:f32 / s8:s8:f32 / u8:s8:s32
  • Test Args: --mode=p --dir=FWD_B

We draw comparisons among 1st baseline method of scalar implementation, 2nd baseline method of auto vectoration by compiler,and our method of RVV intrinsic implementation.

  1. Scalar: ref_gemm_inner_product implementation compiled by gcc 14.2 with -march=rv64gc -O3
  2. Auto Vectorization: ref_gemm_inner_product implementation compiled by gcc 14.2 with -march=rv64gcv -O3 -ftree-vectorize
  3. RVV Intrinsic: Our rvv_inner_product implementation compiled by gcc 14.2 with -march=rv64gc -O3

Results

Averagely, runtimes using rvv instrinsics have obtained a 1.49x speep up compared to those with scalar method, and a 1.28x speep up compared to those with compiler auto vectorization.

In the best test case of --dtu8:s8:s32 mb4ic1024oc4096n"Transformer_lt:Decoder_MM_7*240", rvv intrinsics have a 9.55x speep up compared to scalar method, and a 17.01x speep up compared to compiler auto vectorization.

Detailed results are as follows:

Methods Data types Cases Runtime (ms) Speedups
scalar f32:f32:f32 alexnet 1785.14 1.00
scalar f32:f32:f32 resnet_50 26.2 0.70
scalar f32:f32:f32 rnn_t 314.22 1.54
scalar f32:f32:f32 transformer_lt 159.75 0.51
scalar s8:s8:f32 alexnet 1860.67 1.32
scalar s8:s8:f32 resnet_50 46.22 3.63
scalar s8:s8:f32 rnn_t 493.47 2.43
scalar s8:s8:f32 transformer_lt 746.73 4.28
scalar u8:s8:s32 alexnet 2329.33 1.46
scalar u8:s8:s32 resnet_50 68.54 4.80
scalar u8:s8:s32 rnn_t 439.22 1.91
scalar u8:s8:s32 transformer_lt 900.87 5.08
auto vectorization f32:f32:f32 alexnet 476.65 0.27
auto vectorization f32:f32:f32 resnet_50 33.74 0.91
auto vectorization f32:f32:f32 rnn_t 246.08 1.21
auto vectorization f32:f32:f32 transformer_lt 178.69 0.57
auto vectorization s8:s8:f32 alexnet 2551.34 1.81
auto vectorization s8:s8:f32 resnet_50 50.26 3.95
auto vectorization s8:s8:f32 rnn_t 812.71 4.01
auto vectorization s8:s8:f32 transformer_lt 783.0 4.48
auto vectorization u8:s8:s32 alexnet 1412.33 0.88
auto vectorization u8:s8:s32 resnet_50 46.65 3.27
auto vectorization u8:s8:s32 rnn_t 488.94 2.12
auto vectorization u8:s8:s32 transformer_lt 799.07 4.51
rvv intrinsic f32:f32:f32 alexnet 1794.01 -
rvv intrinsic f32:f32:f32 resnet_50 37.19 -
rvv intrinsic f32:f32:f32 rnn_t 203.82 -
rvv intrinsic f32:f32:f32 transformer_lt 314.39 -
rvv intrinsic s8:s8:f32 alexnet 1409.76 -
rvv intrinsic s8:s8:f32 resnet_50 12.72 -
rvv intrinsic s8:s8:f32 rnn_t 202.84 -
rvv intrinsic s8:s8:f32 transformer_lt 174.67 -
rvv intrinsic u8:s8:s32 alexnet 1600.82 -
rvv intrinsic u8:s8:s32 resnet_50 14.28 -
rvv intrinsic u8:s8:s32 rnn_t 230.54 -
rvv intrinsic u8:s8:s32 transformer_lt 177.26 -

@zhangjian29 zhangjian29 force-pushed the add-rvv-inner-product-integration branch from 5ad843d to 31ad649 Compare October 14, 2025 03:03
@zhangjian29 zhangjian29 marked this pull request as ready for review October 14, 2025 08:39
@zhangjian29 zhangjian29 requested a review from a team as a code owner October 14, 2025 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants