OptimizeDGEMM

General matrix multiplication (GEMM) serves as a bottleneck of most of the matrix computation algorithms. Modern CPU and GPU architectures can achieve near-optimal GEMM performance through specialized implementations. Leading BLAS libraries such as Intel MKL, OpenBLAS, and BLIS regularly attain over 90% of theoretical peak performance on x86 processors. For GPU acceleration, NVIDIA's cuBLAS library delivers similarly efficient matrix computations.

This repository contains the code for an optimization of the DGEMM (Double Precision General Matrix Multiplication). The matrices are generated and managed by Eigen library, and stored in the column-major order. The matrix elements $C_{i,j}$ is stored in the memory as C[i + j * LDC], in this case, LDC = C.cols().

This project is based on this project and the awesome lecture notes.

Pure C version

In this part, some basic optimization techniques are used to optimize the DGEMM:

Loop reordering: The loops are reordered to have the stride-1 access in the innermost loop. This improves spatial locality and allows better vectorization.
Loop unrolling: The innermost loop is unrolled to perform multiple operations per iteration. This reduces loop overhead and allows better instruction-level parallelism.
Blocking: The matrices are divided into smaller submatrices or blocks. Specifically, the algorithm is divided tow micro_kernels, which are the smallest units of computation; and a macro_kernel that schedules the micro_kernels.

These optimizations, combined with compiler flags for auto-vectorization and aggressive optimization, can significantly improve the performance of DGEMM compared to a naive triple-loop implementation.

There are few versions of the pure C version:

naive: the naive triple-loop implementation, with M-N-K order.
naive-knm: the naive triple-loop implementation, with K-N-M order.
naive-nkm: the naive triple-loop implementation, with N-K-M order.

Out of these, naive-nkm is the most efficient one. The pure C code is then optimized with blocking technique. It is easy to see that the performance is further improved. By using valgrind to profile the time cost of the code, micro_kernel is the most time-consuming part. The following part of the note will show how to use SSE and AVX2 instructions to further improve the performance.

Usage

./run.sh

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
plot		plot
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
analysis.txt		analysis.txt
environment.yml		environment.yml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OptimizeDGEMM

Pure C version

Usage

Results

About

Uh oh!

Releases

Packages

Languages

yangjunjie0320/OptimizeDGEMM

Folders and files

Latest commit

History

Repository files navigation

OptimizeDGEMM

Pure C version

Usage

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages