cpu: rv64: add rvv inner product implementation using rvv intrinsics #4115

zhangjian29 · 2025-10-11T03:27:29Z

Description

This PR introduces optimized inner product (ip) primitive for RV64 architectures using RVV (RISC-V Vector) intrinsics. The current rvv_inner_product implementation focus on 2-D plain inputs layouts.

This initial version provides:

Supported memory layouts: plain, dense, no inner blocks
Supported dst data types: f32, s32, s8, u8
Support only 2D source/weights with data type combinations of f32:f32, s8:s8 and u8:s8
Support bias computation

Implementation Details

Memory Reorder: Reorder src and weight if they are non-continuous in memory using scratchpad method
Seperate Computations: Seperate inner product computations into compute_ip_rvv and finalize_ip_acc:
- compute_ip_rvv: Do the vector compuation using different RVV intrinsics for various src/weight data type combinations
- finalize_ip_acc: After vector compuation, get the scalar result and add bias (will do post-ops here in the future), then clamp dst and convert to the required data type

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

Performance improvements

All experiments are performed on a SG2044 platform including:

Test Cases: benchdnn inputs of alexnet / resnet_50 / rnn_t / transformer_lt
Test Dtypes: f32:f32:f32 / s8:s8:f32 / u8:s8:s32
Test Args: --mode=p --dir=FWD_B

We draw comparisons among 1st baseline method of scalar implementation, 2nd baseline method of auto vectoration by compiler,and our method of RVV intrinsic implementation.

Scalar: ref_gemm_inner_product implementation compiled by gcc 14.2 with -march=rv64gc -O3
Auto Vectorization: ref_gemm_inner_product implementation compiled by gcc 14.2 with -march=rv64gcv -O3 -ftree-vectorize
RVV Intrinsic: Our rvv_inner_product implementation compiled by gcc 14.2 with -march=rv64gc -O3

Results

Averagely, runtimes using rvv instrinsics have obtained a 1.49x speep up compared to those with scalar method, and a 1.28x speep up compared to those with compiler auto vectorization.

In the best test case of --dtu8:s8:s32 mb4ic1024oc4096n"Transformer_lt:Decoder_MM_7*240", rvv intrinsics have a 9.55x speep up compared to scalar method, and a 17.01x speep up compared to compiler auto vectorization.

Detailed results are as follows:

Methods	Data types	Cases	Runtime (ms)	Speedups
scalar	f32:f32:f32	alexnet	1785.14	1.00
scalar	f32:f32:f32	resnet_50	26.2	0.70
scalar	f32:f32:f32	rnn_t	314.22	1.54
scalar	f32:f32:f32	transformer_lt	159.75	0.51
scalar	s8:s8:f32	alexnet	1860.67	1.32
scalar	s8:s8:f32	resnet_50	46.22	3.63
scalar	s8:s8:f32	rnn_t	493.47	2.43
scalar	s8:s8:f32	transformer_lt	746.73	4.28
scalar	u8:s8:s32	alexnet	2329.33	1.46
scalar	u8:s8:s32	resnet_50	68.54	4.80
scalar	u8:s8:s32	rnn_t	439.22	1.91
scalar	u8:s8:s32	transformer_lt	900.87	5.08
auto vectorization	f32:f32:f32	alexnet	476.65	0.27
auto vectorization	f32:f32:f32	resnet_50	33.74	0.91
auto vectorization	f32:f32:f32	rnn_t	246.08	1.21
auto vectorization	f32:f32:f32	transformer_lt	178.69	0.57
auto vectorization	s8:s8:f32	alexnet	2551.34	1.81
auto vectorization	s8:s8:f32	resnet_50	50.26	3.95
auto vectorization	s8:s8:f32	rnn_t	812.71	4.01
auto vectorization	s8:s8:f32	transformer_lt	783.0	4.48
auto vectorization	u8:s8:s32	alexnet	1412.33	0.88
auto vectorization	u8:s8:s32	resnet_50	46.65	3.27
auto vectorization	u8:s8:s32	rnn_t	488.94	2.12
auto vectorization	u8:s8:s32	transformer_lt	799.07	4.51
rvv intrinsic	f32:f32:f32	alexnet	1794.01	-
rvv intrinsic	f32:f32:f32	resnet_50	37.19	-
rvv intrinsic	f32:f32:f32	rnn_t	203.82	-
rvv intrinsic	f32:f32:f32	transformer_lt	314.39	-
rvv intrinsic	s8:s8:f32	alexnet	1409.76	-
rvv intrinsic	s8:s8:f32	resnet_50	12.72	-
rvv intrinsic	s8:s8:f32	rnn_t	202.84	-
rvv intrinsic	s8:s8:f32	transformer_lt	174.67	-
rvv intrinsic	u8:s8:s32	alexnet	1600.82	-
rvv intrinsic	u8:s8:s32	resnet_50	14.28	-
rvv intrinsic	u8:s8:s32	rnn_t	230.54	-
rvv intrinsic	u8:s8:s32	transformer_lt	177.26	-

github-actions bot added platform:cpu-rv64 RISC-V component:common labels Oct 11, 2025

cpu: rv64: add rvv inner product integration

31ad649

zhangjian29 force-pushed the add-rvv-inner-product-integration branch from 5ad843d to 31ad649 Compare October 14, 2025 03:03

zhangjian29 marked this pull request as ready for review October 14, 2025 08:39

zhangjian29 requested a review from a team as a code owner October 14, 2025 08:39

vpirogov approved these changes Oct 15, 2025

View reviewed changes

zhangjian29 mentioned this pull request Oct 17, 2025

cpu: rv64: add support for rv64 eltwise feature with RISC-V Vector Extension #3898

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cpu: rv64: add rvv inner product implementation using rvv intrinsics #4115

cpu: rv64: add rvv inner product implementation using rvv intrinsics #4115

Uh oh!

zhangjian29 commented Oct 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cpu: rv64: add rvv inner product implementation using rvv intrinsics #4115

Are you sure you want to change the base?

cpu: rv64: add rvv inner product implementation using rvv intrinsics #4115

Uh oh!

Conversation

zhangjian29 commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Implementation Details

Checklist

General

Performance improvements

Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhangjian29 commented Oct 11, 2025 •

edited

Loading