Skip to content

Add conditional compilation for zen4 calls to AOCL-BLAS#4

Open
bartoldeman wants to merge 177 commits intoamd:devfrom
bartoldeman:zen4-conditional-compilation
Open

Add conditional compilation for zen4 calls to AOCL-BLAS#4
bartoldeman wants to merge 177 commits intoamd:devfrom
bartoldeman:zen4-conditional-compilation

Conversation

@bartoldeman
Copy link

If AOCL-BLAS is compiled with "auto" on Zen3 it does not include the zen4 functions, so compilation errors occur for the zen4 function calls. Using the BLIS_KERNELS_ZEN4 define in blis.h we can make these calls conditional to zen4 support actually available in AOCL-BLAS

pradeeptrgit and others added 30 commits October 25, 2024 16:20
NOTICES file with Third Party Licence information

Change-Id: I8b91edd3481bef3e5378a2c91e67c1b4eb81d1b1
… multi-thread is enabled

CPUPL-5677
Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com>
Change-Id: Iaa1b79ca8e7e7c15df31b02e7fef450eb530a4e7
Added LAPACKE interface testing support in the main test suite
for GBTRF and GBTRS APIs.

Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com>
Change-Id: I97b7aaf8b697a1d1779bbebb69dc7cec9348d1f8
Correction in condition checks for invalid input params in sgesdd_fla_check,
dgesdd_fla_check APIs. Modified condition to check ldvt based on jobvt
instead of jobu.

Modified jobz comparison logic to check regardless of case.

AMD-Internal: CPUPL-5889
Signed-off-by: dnikku <Deepika.Nikku@amd.com>
Change-Id: Ib3fed47fbd05163aa2b496cec6cd59cd79156880
Update License text with latest Third Party Notices

Change-Id: I79589b0821cab802157406415d49d898a7a83d2f
This reverts commit 3d7b017.

Change-Id: I1002f00295f77caf2097ff67bf1b51889067bf96
APIs from AOCL-Utils to get ISA information have changed in
4.2 release and the ones used in libflame have been deprecated.
Use the latest AOCL-Utils API.

AMD Internal : [CPUPL-5906]

Change-Id: I84d576f9749a399aea23d96b5d2d636497bed540
Earlier these APIs were just a wrapper around sgelss/dgelss.
gelsd APIs provide significant performance gains over gelss
APIs.

Signed-off-by: samahmad <Sameer.Ahmad@amd.com>
AMD-Internal: CPUPL-5766
Change-Id: I867c107816fcd50fb16a890ca5947c6b9ff80e3d
Corrected error in calculation of length parameter while
handling of values less than safe-min in norm calculation.

Also, fixed issue in usage of lda in macro inside macro.

Signed-off-by: Vasanthakumar R <varajago@amd.com>
AMD-Internal: SWLCSG-3226, SWLCSG-3217
Change-Id: Ic75a10aa7020f043e71f1501bb4219f4498be901
   1. Optimized the LAPACK_GETRI_SMALL_D_3x3 kernel by reducing the internal repetitive memory load/stores

AMD Internal : [CPUPL-5865]

Change-Id: Ib4e33cccdc4a78820014a676bcb9808f4c685797
… major

Fixed init_matrix_from_file() API to read matrix from input file in
column major format

Added fix for gtsv AOCC issue.

AMD-Internal: CPUPL-5922
Signed-off-by: dnikku <Deepika.Nikku@amd.com>
Change-Id: I159583837028dd11ee1dbc844b0d495d0c1be1fc
Initialization of first column/row of U added.
Previous optimization to remove the initialization
caused test failures in few performance cases.
Corrected modification of signs of Vt in 2x2 cases.

Signed-off-by: Vasanthakumar R <varajago@amd.com>
AMD-Internal: CPUPL-6074
Change-Id: I19b7044029b4580013ea7decb58c3500097d0f88
- Overflow/underflow tests for sygvd/hegvd
- Memory leak fixes for sygvd test cases
- Enabling lapacke interfaces for sygvd/hegvd

Change-Id: I79ec9c009e6ba52df17bc6247bb726e60193d5ed
Signed-off-by: samahmad <Sameer.Ahmad@amd.com>
AMD-Internal: CPUPL-6037
… cases

Fixed the copy matrix sizes in validate_gesdd/gesvd to avoid out of bound
memory access while testing corner cases.
for gesdd, jobz = O, m >= n, ldu = 1, m < n, ldvt = 1
for gesvd, jobu/vt = O, ldu/ldvt = 1 cases

Fixed gtsv test2 under validate_gtsv(). Scaling down the residual
by 10 times to fall in the expected threshold range as input matrix Xact
is randomly generated.

AMD-Internal: CPUPL-5926
Signed-off-by: dnikku <Deepika.Nikku@amd.com>
Change-Id: Ia99bb6d81b76de394265ffded0069fb440de979f
details: Datatype alignment changes for the structures used in test suite
Signed-off-by: ksaithar <katteboina.saitharun@amd.com>
Change-Id: I08ce86f5d642189b6f9142c74af41a633415b1f3
Components added:
1. Test run/validation
2. Negative test cases
3. Extreme test cases
4. Overflow/Underflow tests
5. Lapacke test

Signed-off-by: Vibhav Gupta <Vibhav.Gupta@amd.com>
AMD-Internal: CPUPL-5903

Change-Id: I38fa28ac0216740e0669e41509ca7870fd3adab8
1. Move block size computation to a separate function for each of the
   4 types.
2. Optimal block sizes for various input sizes vary as OMP_NUM_THREADS
   is varied. Set optimal block sizes based on input size ranges only
   when OMP_NUM_THREADS=1
3. For small sizes, take the un-optimized path because with the
   optimized path there are regressions due to overhead of openmp
   calls.

Gains obtained for single threaded runs - Upto 15% on genoa and 28%
on turin

Signed-off-by: Vibhav Gupta <Vibhav.Gupta@amd.com>
AMD-Internal: CPUPL-5876

Change-Id: I8fdeccdf0debdacec3913f8192711d86e9d62314
Port Netlib Lapack 3.12 FORTRAN code to C files for double precision APIs

Signed-off-by: Venkatesha <vprasada@amd.com>
AMD-Internal: [CPUPL-5708]
Change-Id: If2ddc85a9ad0818c96155945340b9cea23b40c8e
- Added avx2, avx512 and parallel version for sgetrf

Change-Id: I724cc5c9bf98f42014bcaf680a2fa7373195f10d
Signed-off-by: samahmad <Sameer.Ahmad@amd.com>
AMD-Internal: CPUPL-6060
Following features are implemented in this commit:
1. Library path and include path for aocl-utils and blis can be automatically inferred while building libflame if pkg-config files
for these libraries are available. Only works on Linux for now.
2. Various cmake configure/build/install/test/workflow presets.
3. Cmake presets for Windows (msvc and ninja). As of now test presets do not work!
4. Minimum cmake version upgraded to 3.26.0

Preset names follow the convention: <os>-<make/ninja>-<compiler>-<st/mt>-<lp/ilp>-<static/shared>-<isa-mode>-<other optional commands>

Usage:
$ cmake --build --list-presets

-- Without aocl-utils pkgconfig file
$ cmake --preset {chosen-preset} -DLIBAOCLUTILS_INCLUDE_PATH={aoclutils header path} -DLIBAOCLUTILS_LIBRARY_PATH={aoclutils library path}

-- With aocl-utils pkgconfig file
$ cmake --preset {chosen-preset}

$ cmake --build --preset {chosen-preset}

Build and test workflow

-- If aocl-utils and blis pkg-config files are available
$ cmake --workflow --preset {chosen-preset}

More info in BUILD.md

Change-Id: I8bc54b3eabed9a18c305e911df9aa76d8ff746d0
Signed-off-by: samahmad <Sameer.Ahmad@amd.com>
AMD-Internal: CPUPL-5862
Port Netlib Lapack-3.12 newly added double precision fortran files to c
Files added : dgedmd.c, dgedmdq.c, dgeqp3rk.c, dlaqp2rk.c, dlaqp3rk.c.
Netlib test for lapack-3.12 included.

Signed-off-by: Venkatesha <vprasada@amd.com>
AMD-Internal: [CPUPL-5708]
Change-Id: I60b5c47505162882a19f2086e4842c858e0586e8
a. Added separate invoke functions in CPP for each API
b. Added support for cmake and make
c. Added support for --interface in cmd line
d. Resolved warnings in existing interface header file
e. Resolved errors/warnings in Windows
f. Added ENABLE_CPP_TEST flag in cmake, make files to enable/disable CPP test interface.
g. Updated Readme as per latest changes.
h. Added CPP changes for 25+ test APIs.

Change-Id: I6b77d24e204833134401c69813a3a1672de02c18
Port Netlib Lapack 3.12 FORTRAN code to C files for double precision complex APIs
Note: Retained lapack-3.11 zlaqr5.c, to overcome netlib test failures.

Signed-off-by: Venkatesha <vprasada@amd.com>
AMD-Internal: [CPUPL-6150]
Change-Id: I40df2b270d82159cd0bc16a0158951139054b90a
Signed-off-by: Vibhav Gupta <Vibhav.Gupta@amd.com>
AMD-Internal: CPUPL-6155

Change-Id: Icbd84fdfa434875bf3bfd2072b09d8e77c326702
…ex New files

Port Netlib Lapack-3.12 newly added double precision complex fortran files to c
Files added : zgedmd.c, zgedmdq.c, zgeqp3rk.c, zlaqp2rk.c, zlaqp3rk.c, zrscl.c.
Updating DTL logging in dgedmd.c, dgedmdq.c, dgeqp3rk.c, dlaqp2rk.c, dlaqp3rk.c

Signed-off-by: Venkatesha <vprasada@amd.com>
AMD-Internal: [CPUPL-6150]
Change-Id: Ia043bdeace2efc61fb25c64e47c2a45bdd8bda9c
Updated test code to display output status as INVALID_PARAM when
1) illegal inputs are passed
2) LAPACK API returns illegal input warning

NOTE: Existing behaviour is to display status = FAIL for these cases.
      Modified FLA_TEST_CHECK_EINFO macro.

Signed-off-by: dnikku <Deepika.Nikku@amd.com>
AMD-Internal: CPUPL-6250
Change-Id: I913fbef39f1f2e58142c21765033b2688de4585a
Port Netlib Lapack 3.12 FORTRAN code to C files for single precision APIs
Note: Retained lapack-3.11 slaqr5.c, to overcome netlib test failures.

Signed-off-by: Venkatesha <vprasada@amd.com>
AMD-Internal: [CPUPL-6149]
Change-Id: Id4b96fd45e78246640ca681b5b9236bde09cab52
   1. Optimized the DNRM2 blas api with avx2 and avx512 intrinsics.

AMD Internal : [CPUPL-6122]

Change-Id: I8d8822f5a300997bda3cee15b730489892d938f9
-> Removed unused variables.
-> Initialized variables where they were not

Signed-off-by: Venkatesha <vprasada@amd.com>
Change-Id: If6e4b04a1d23b008812afa920efd226ac923e2c1
…iles

Port Netlib Lapack-3.12 newly added single precision fortran files to c
Files added : sgedmd.c, sgedmdq.c, sgeqp3rk.c, slaqp2rk.c, slaqp3rk.c.

Signed-off-by: Venkatesha <vprasada@amd.com>
AMD-Internal: [CPUPL-6149]
Change-Id: Idea38525527da5893fa61760f58e62953274708e
Govindaswamy, Sridhar and others added 24 commits November 13, 2025 04:48
Regression is observed for small sizes due to multi-threading overhead.
Updated the code to take single-threaded path for sizes below 30.

AMD-Internal: [CPUPL-6977]
Fixed regression for small sizes caused by thread calculation function invoked even when not required for small sizes.

AMD-Internal: CPUPL-6961
Change-Id: I362556afb63548b5bc43aab301255e6e1489f267
* Add AI Code Review Self-enablement file

This will enable triggering AI Code Review tool when PR is raised
* DGEMV direct call to AOCL-BLAS Kernel

Added direct call to AOCL-BLAS kernel for DGEMV to make a single threaded call.
For FLAME path, this was done inside 'bl1_dgemv_blas' which is a common wrapper in FLAME path.
For DLARFT, direct calls to DGEMV kernels were added by replacing DGEMV calls.
DGESVD thresholding for one of the small paths changed.

AMD Internal: CPUPL-6998, 6997, 6981
* AOCL LAPACK: dgesvd fix to make singular values positive

Added code changes for making singular values non negative and
ctest fixes in potri, hetrf

AMD-Internal: CPUPL-7326, CPUPL-7316

Signed-off-by: dnikku <Deepika.Nikku@amd.com>
Add unistd.h to test_libflame.c to resolve compilation errors caused by its removal from FLAME.h in the legacy test suite.

AMD_Internal: CPUPL-7346

Signed-off-by: samahmad <Sameer.Ahmad@amd.com>
Overflow test specific vector allocation brought under
corresponding check to solve memory leak.

AMD Internal: CPUPL-7352
- Re-implemented memory leak fixes overwritten by LAPACKE update.
- Implemented fixes for use of uninitialized variables in extreme cases.

AMD-Internal: CPUPL-6488
The following commit adds BRT support for the following APIs
GEBRD, GETRF, GETRF2, GETRFNP, GETRFNPI, GETRI, GETRS, GGEV, GGEVX, GGHRD

An additional logger mode has also been implemented for jenkins integration.
The store and check reproducibility functions have also been refactored into variadic functions

AMD-Internal : CPUPL-7287
* Implement random-init mode in main test
* Modified buffer indexing to 64-bit to prevent overflow with large sizes

Added random initialization mode for generating random input matrices
and skip validation code for faster performance runs

AMD-Internal: CPUPL-7435
AOCL-LAPACK - Test suite: Ctest failures in windows

-> Fixed Overflow/underflow test failures in windows for potri, getri, getrfnp and getrfnpi
-> Fixed Extreme case test failures in trtrs.

AMD-Internal: [CPUPL-7411], [CPUPL-7412]
LF_ISA_CONFIG cmake flag now has a new value that enables users
to fix the ISA path of optimzed path. This will override dynamic
dispatch behaviour and environment variable AOCL_ENABLE_INSTRUCTIONS.
The 2 new values LF_ISA_CONFIG includes are avx2-strict and
avx512-strict.

Example if LF_ISA_CONFIG=avx2-strict is set during build, all code paths
including optimized kernels will use avx2 ISA

AMD-Internal: [CPUPL-7470]
Changed implementation of APIs to use 64-bit integer for all dimensional variables.
User facing APIs retain standard signature for compatibility. Only internal implementation changed
by creating compatibility layer.
Only LAPACK interfaces and FLAME interfaces are changed whereas LAPACKE interface remains the same.
All test suites accordingly changed (Netlib, legacy and main test).
This PR makes minor modifications to improve compatibility across the codebase.
The changes focus on correcting data type usage and updating error messages for consistency.

Names of elements of complex types made uniform across code base.
Whole code-base uses now {real,imag}.
Fixed data type mismatch in BSVD sorting function
Updated error messages to use consistent terminology
Removed unnecessary macro undefining
GEMM calls deleted during 64-bit integer changes added back.
Guards for memory allocation also added.
Both the changes made in spffrt2 optimized implementation
Added ctest cases.

AMD Internal: CPUPL-7427
…ACK API test drivers (#132)

BRT support has been added to the test drivers for the following APIs:
gbsvx, gbtf2, getc2, getf2, gtsv, hetrf, hetrf_rook, hetri_rook, hgeqz, hseqr, lange, larf, larfg, lartg, org2r, orgqr, ormlq, ormqr, potf2, potrf, potrf2, potri, potrs, rot, spffrt2, spffrtx, stedc, steqr, stevd, syev, syevd, syevx, sygvd, sytrd, sytrf, sytrf_rook, trtri, trtrs

This PR uses AI generated commits as a starting point and 20 API test drivers were written purely using AI.

AMD-Internal : CPUPL-7424
AOCL-LAPACK: LAPACKE_dgesv performance regression

Updated code path in dgetrf to overcome the regression.
For 128 < m,n < 160(multi-thread) -> aocl_lapack_dgetrf2 api 
For 80 < m,n < 127(multi-thread) -> getrf_avx512 path 
For 80 < m,n < 160(single thread) -> aocl_lapack_dgetrf2 api

AMD-Internal: [CPUPL-7469]
Tuned block sizes and thread count for large size optimisations

AMD-Internal: CPUPL-7457
Setting number of BLIS threads to 1 for calls to DGEMV,
This was done as isolated threading in DGEMV causing cache misses
in subsequent single threaded code.

AMD Internal: CPUPL-6963, 6981, 6983
* DLASWP optimization to address regression for all sizes of DGETRS, DGBTRF.
* Created dlaswp_st for single threaded dlaswp with 32 tile size and used in DGBTRF.

AMD Internal: CPUPL-7253
Changes in threshold made to include some of medium sizes into small path.
Direct call to dlasv2 for NN cases avaoided as it is causing regression
Guard for malloc in DGELQF small path

AMD Internal: CPUPL-7430
- Fixed regression for medium/large size DORGQR regression by rolling
  back optimization for dlarft.

Change-Id: Ieb8ea82dbe547833b4d235edb70cb739e00ebf37
AMD-Internal: CPUPL-6962
@pradeeptrgit
Copy link
Collaborator

Thank you for this PR. Is it possible to raise this PR on "dev" branch of the repo? We push latest updates into dev branch in bi-weekly cadence and we can include this as part of the next update.

If AOCL-BLAS is compiled with "auto" on Zen3 it does not include
the zen4 functions, so compilation errors occur for the *zen4*
function calls. Using the BLIS_KERNELS_ZEN4 define in blis.h
we can make these calls conditional to zen4 support actually
available in AOCL-BLAS
@bartoldeman bartoldeman force-pushed the zen4-conditional-compilation branch from 796e010 to d5f93bc Compare February 6, 2026 12:37
@bartoldeman bartoldeman changed the base branch from master to dev February 6, 2026 12:46
@bartoldeman
Copy link
Author

Thank you for this PR. Is it possible to raise this PR on "dev" branch of the repo? We push latest updates into dev branch in bi-weekly cadence and we can include this as part of the next update.

no problem, I've rebased this PR to dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants