Add conditional compilation for zen4 calls to AOCL-BLAS#4
Open
bartoldeman wants to merge 177 commits intoamd:devfrom
Open
Add conditional compilation for zen4 calls to AOCL-BLAS#4bartoldeman wants to merge 177 commits intoamd:devfrom
bartoldeman wants to merge 177 commits intoamd:devfrom
Conversation
NOTICES file with Third Party Licence information Change-Id: I8b91edd3481bef3e5378a2c91e67c1b4eb81d1b1
… multi-thread is enabled CPUPL-5677 Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com> Change-Id: Iaa1b79ca8e7e7c15df31b02e7fef450eb530a4e7
Added LAPACKE interface testing support in the main test suite for GBTRF and GBTRS APIs. Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com> Change-Id: I97b7aaf8b697a1d1779bbebb69dc7cec9348d1f8
Correction in condition checks for invalid input params in sgesdd_fla_check, dgesdd_fla_check APIs. Modified condition to check ldvt based on jobvt instead of jobu. Modified jobz comparison logic to check regardless of case. AMD-Internal: CPUPL-5889 Signed-off-by: dnikku <Deepika.Nikku@amd.com> Change-Id: Ib3fed47fbd05163aa2b496cec6cd59cd79156880
Update License text with latest Third Party Notices Change-Id: I79589b0821cab802157406415d49d898a7a83d2f
This reverts commit 3d7b017. Change-Id: I1002f00295f77caf2097ff67bf1b51889067bf96
APIs from AOCL-Utils to get ISA information have changed in 4.2 release and the ones used in libflame have been deprecated. Use the latest AOCL-Utils API. AMD Internal : [CPUPL-5906] Change-Id: I84d576f9749a399aea23d96b5d2d636497bed540
Earlier these APIs were just a wrapper around sgelss/dgelss. gelsd APIs provide significant performance gains over gelss APIs. Signed-off-by: samahmad <Sameer.Ahmad@amd.com> AMD-Internal: CPUPL-5766 Change-Id: I867c107816fcd50fb16a890ca5947c6b9ff80e3d
Corrected error in calculation of length parameter while handling of values less than safe-min in norm calculation. Also, fixed issue in usage of lda in macro inside macro. Signed-off-by: Vasanthakumar R <varajago@amd.com> AMD-Internal: SWLCSG-3226, SWLCSG-3217 Change-Id: Ic75a10aa7020f043e71f1501bb4219f4498be901
1. Optimized the LAPACK_GETRI_SMALL_D_3x3 kernel by reducing the internal repetitive memory load/stores AMD Internal : [CPUPL-5865] Change-Id: Ib4e33cccdc4a78820014a676bcb9808f4c685797
… major Fixed init_matrix_from_file() API to read matrix from input file in column major format Added fix for gtsv AOCC issue. AMD-Internal: CPUPL-5922 Signed-off-by: dnikku <Deepika.Nikku@amd.com> Change-Id: I159583837028dd11ee1dbc844b0d495d0c1be1fc
Initialization of first column/row of U added. Previous optimization to remove the initialization caused test failures in few performance cases. Corrected modification of signs of Vt in 2x2 cases. Signed-off-by: Vasanthakumar R <varajago@amd.com> AMD-Internal: CPUPL-6074 Change-Id: I19b7044029b4580013ea7decb58c3500097d0f88
- Overflow/underflow tests for sygvd/hegvd - Memory leak fixes for sygvd test cases - Enabling lapacke interfaces for sygvd/hegvd Change-Id: I79ec9c009e6ba52df17bc6247bb726e60193d5ed Signed-off-by: samahmad <Sameer.Ahmad@amd.com> AMD-Internal: CPUPL-6037
… cases Fixed the copy matrix sizes in validate_gesdd/gesvd to avoid out of bound memory access while testing corner cases. for gesdd, jobz = O, m >= n, ldu = 1, m < n, ldvt = 1 for gesvd, jobu/vt = O, ldu/ldvt = 1 cases Fixed gtsv test2 under validate_gtsv(). Scaling down the residual by 10 times to fall in the expected threshold range as input matrix Xact is randomly generated. AMD-Internal: CPUPL-5926 Signed-off-by: dnikku <Deepika.Nikku@amd.com> Change-Id: Ia99bb6d81b76de394265ffded0069fb440de979f
details: Datatype alignment changes for the structures used in test suite Signed-off-by: ksaithar <katteboina.saitharun@amd.com> Change-Id: I08ce86f5d642189b6f9142c74af41a633415b1f3
Components added: 1. Test run/validation 2. Negative test cases 3. Extreme test cases 4. Overflow/Underflow tests 5. Lapacke test Signed-off-by: Vibhav Gupta <Vibhav.Gupta@amd.com> AMD-Internal: CPUPL-5903 Change-Id: I38fa28ac0216740e0669e41509ca7870fd3adab8
1. Move block size computation to a separate function for each of the 4 types. 2. Optimal block sizes for various input sizes vary as OMP_NUM_THREADS is varied. Set optimal block sizes based on input size ranges only when OMP_NUM_THREADS=1 3. For small sizes, take the un-optimized path because with the optimized path there are regressions due to overhead of openmp calls. Gains obtained for single threaded runs - Upto 15% on genoa and 28% on turin Signed-off-by: Vibhav Gupta <Vibhav.Gupta@amd.com> AMD-Internal: CPUPL-5876 Change-Id: I8fdeccdf0debdacec3913f8192711d86e9d62314
Port Netlib Lapack 3.12 FORTRAN code to C files for double precision APIs Signed-off-by: Venkatesha <vprasada@amd.com> AMD-Internal: [CPUPL-5708] Change-Id: If2ddc85a9ad0818c96155945340b9cea23b40c8e
- Added avx2, avx512 and parallel version for sgetrf Change-Id: I724cc5c9bf98f42014bcaf680a2fa7373195f10d Signed-off-by: samahmad <Sameer.Ahmad@amd.com> AMD-Internal: CPUPL-6060
Following features are implemented in this commit:
1. Library path and include path for aocl-utils and blis can be automatically inferred while building libflame if pkg-config files
for these libraries are available. Only works on Linux for now.
2. Various cmake configure/build/install/test/workflow presets.
3. Cmake presets for Windows (msvc and ninja). As of now test presets do not work!
4. Minimum cmake version upgraded to 3.26.0
Preset names follow the convention: <os>-<make/ninja>-<compiler>-<st/mt>-<lp/ilp>-<static/shared>-<isa-mode>-<other optional commands>
Usage:
$ cmake --build --list-presets
-- Without aocl-utils pkgconfig file
$ cmake --preset {chosen-preset} -DLIBAOCLUTILS_INCLUDE_PATH={aoclutils header path} -DLIBAOCLUTILS_LIBRARY_PATH={aoclutils library path}
-- With aocl-utils pkgconfig file
$ cmake --preset {chosen-preset}
$ cmake --build --preset {chosen-preset}
Build and test workflow
-- If aocl-utils and blis pkg-config files are available
$ cmake --workflow --preset {chosen-preset}
More info in BUILD.md
Change-Id: I8bc54b3eabed9a18c305e911df9aa76d8ff746d0
Signed-off-by: samahmad <Sameer.Ahmad@amd.com>
AMD-Internal: CPUPL-5862
Port Netlib Lapack-3.12 newly added double precision fortran files to c Files added : dgedmd.c, dgedmdq.c, dgeqp3rk.c, dlaqp2rk.c, dlaqp3rk.c. Netlib test for lapack-3.12 included. Signed-off-by: Venkatesha <vprasada@amd.com> AMD-Internal: [CPUPL-5708] Change-Id: I60b5c47505162882a19f2086e4842c858e0586e8
a. Added separate invoke functions in CPP for each API b. Added support for cmake and make c. Added support for --interface in cmd line d. Resolved warnings in existing interface header file e. Resolved errors/warnings in Windows f. Added ENABLE_CPP_TEST flag in cmake, make files to enable/disable CPP test interface. g. Updated Readme as per latest changes. h. Added CPP changes for 25+ test APIs. Change-Id: I6b77d24e204833134401c69813a3a1672de02c18
Port Netlib Lapack 3.12 FORTRAN code to C files for double precision complex APIs Note: Retained lapack-3.11 zlaqr5.c, to overcome netlib test failures. Signed-off-by: Venkatesha <vprasada@amd.com> AMD-Internal: [CPUPL-6150] Change-Id: I40df2b270d82159cd0bc16a0158951139054b90a
Signed-off-by: Vibhav Gupta <Vibhav.Gupta@amd.com> AMD-Internal: CPUPL-6155 Change-Id: Icbd84fdfa434875bf3bfd2072b09d8e77c326702
…ex New files Port Netlib Lapack-3.12 newly added double precision complex fortran files to c Files added : zgedmd.c, zgedmdq.c, zgeqp3rk.c, zlaqp2rk.c, zlaqp3rk.c, zrscl.c. Updating DTL logging in dgedmd.c, dgedmdq.c, dgeqp3rk.c, dlaqp2rk.c, dlaqp3rk.c Signed-off-by: Venkatesha <vprasada@amd.com> AMD-Internal: [CPUPL-6150] Change-Id: Ia043bdeace2efc61fb25c64e47c2a45bdd8bda9c
Updated test code to display output status as INVALID_PARAM when
1) illegal inputs are passed
2) LAPACK API returns illegal input warning
NOTE: Existing behaviour is to display status = FAIL for these cases.
Modified FLA_TEST_CHECK_EINFO macro.
Signed-off-by: dnikku <Deepika.Nikku@amd.com>
AMD-Internal: CPUPL-6250
Change-Id: I913fbef39f1f2e58142c21765033b2688de4585a
Port Netlib Lapack 3.12 FORTRAN code to C files for single precision APIs Note: Retained lapack-3.11 slaqr5.c, to overcome netlib test failures. Signed-off-by: Venkatesha <vprasada@amd.com> AMD-Internal: [CPUPL-6149] Change-Id: Id4b96fd45e78246640ca681b5b9236bde09cab52
1. Optimized the DNRM2 blas api with avx2 and avx512 intrinsics. AMD Internal : [CPUPL-6122] Change-Id: I8d8822f5a300997bda3cee15b730489892d938f9
-> Removed unused variables. -> Initialized variables where they were not Signed-off-by: Venkatesha <vprasada@amd.com> Change-Id: If6e4b04a1d23b008812afa920efd226ac923e2c1
…iles Port Netlib Lapack-3.12 newly added single precision fortran files to c Files added : sgedmd.c, sgedmdq.c, sgeqp3rk.c, slaqp2rk.c, slaqp3rk.c. Signed-off-by: Venkatesha <vprasada@amd.com> AMD-Internal: [CPUPL-6149] Change-Id: Idea38525527da5893fa61760f58e62953274708e
Regression is observed for small sizes due to multi-threading overhead. Updated the code to take single-threaded path for sizes below 30. AMD-Internal: [CPUPL-6977]
Fixed regression for small sizes caused by thread calculation function invoked even when not required for small sizes. AMD-Internal: CPUPL-6961
Change-Id: I362556afb63548b5bc43aab301255e6e1489f267
* Add AI Code Review Self-enablement file This will enable triggering AI Code Review tool when PR is raised
* DGEMV direct call to AOCL-BLAS Kernel Added direct call to AOCL-BLAS kernel for DGEMV to make a single threaded call. For FLAME path, this was done inside 'bl1_dgemv_blas' which is a common wrapper in FLAME path. For DLARFT, direct calls to DGEMV kernels were added by replacing DGEMV calls. DGESVD thresholding for one of the small paths changed. AMD Internal: CPUPL-6998, 6997, 6981
* AOCL LAPACK: dgesvd fix to make singular values positive Added code changes for making singular values non negative and ctest fixes in potri, hetrf AMD-Internal: CPUPL-7326, CPUPL-7316 Signed-off-by: dnikku <Deepika.Nikku@amd.com>
Add unistd.h to test_libflame.c to resolve compilation errors caused by its removal from FLAME.h in the legacy test suite. AMD_Internal: CPUPL-7346 Signed-off-by: samahmad <Sameer.Ahmad@amd.com>
Overflow test specific vector allocation brought under corresponding check to solve memory leak. AMD Internal: CPUPL-7352
- Re-implemented memory leak fixes overwritten by LAPACKE update. - Implemented fixes for use of uninitialized variables in extreme cases. AMD-Internal: CPUPL-6488
The following commit adds BRT support for the following APIs GEBRD, GETRF, GETRF2, GETRFNP, GETRFNPI, GETRI, GETRS, GGEV, GGEVX, GGHRD An additional logger mode has also been implemented for jenkins integration. The store and check reproducibility functions have also been refactored into variadic functions AMD-Internal : CPUPL-7287
* Implement random-init mode in main test * Modified buffer indexing to 64-bit to prevent overflow with large sizes Added random initialization mode for generating random input matrices and skip validation code for faster performance runs AMD-Internal: CPUPL-7435
AOCL-LAPACK - Test suite: Ctest failures in windows -> Fixed Overflow/underflow test failures in windows for potri, getri, getrfnp and getrfnpi -> Fixed Extreme case test failures in trtrs. AMD-Internal: [CPUPL-7411], [CPUPL-7412]
LF_ISA_CONFIG cmake flag now has a new value that enables users to fix the ISA path of optimzed path. This will override dynamic dispatch behaviour and environment variable AOCL_ENABLE_INSTRUCTIONS. The 2 new values LF_ISA_CONFIG includes are avx2-strict and avx512-strict. Example if LF_ISA_CONFIG=avx2-strict is set during build, all code paths including optimized kernels will use avx2 ISA AMD-Internal: [CPUPL-7470]
Changed implementation of APIs to use 64-bit integer for all dimensional variables. User facing APIs retain standard signature for compatibility. Only internal implementation changed by creating compatibility layer. Only LAPACK interfaces and FLAME interfaces are changed whereas LAPACKE interface remains the same. All test suites accordingly changed (Netlib, legacy and main test).
This PR makes minor modifications to improve compatibility across the codebase.
The changes focus on correcting data type usage and updating error messages for consistency.
Names of elements of complex types made uniform across code base.
Whole code-base uses now {real,imag}.
Fixed data type mismatch in BSVD sorting function
Updated error messages to use consistent terminology
Removed unnecessary macro undefining
GEMM calls deleted during 64-bit integer changes added back. Guards for memory allocation also added. Both the changes made in spffrt2 optimized implementation Added ctest cases. AMD Internal: CPUPL-7427
…ACK API test drivers (#132) BRT support has been added to the test drivers for the following APIs: gbsvx, gbtf2, getc2, getf2, gtsv, hetrf, hetrf_rook, hetri_rook, hgeqz, hseqr, lange, larf, larfg, lartg, org2r, orgqr, ormlq, ormqr, potf2, potrf, potrf2, potri, potrs, rot, spffrt2, spffrtx, stedc, steqr, stevd, syev, syevd, syevx, sygvd, sytrd, sytrf, sytrf_rook, trtri, trtrs This PR uses AI generated commits as a starting point and 20 API test drivers were written purely using AI. AMD-Internal : CPUPL-7424
AOCL-LAPACK: LAPACKE_dgesv performance regression Updated code path in dgetrf to overcome the regression. For 128 < m,n < 160(multi-thread) -> aocl_lapack_dgetrf2 api For 80 < m,n < 127(multi-thread) -> getrf_avx512 path For 80 < m,n < 160(single thread) -> aocl_lapack_dgetrf2 api AMD-Internal: [CPUPL-7469]
Tuned block sizes and thread count for large size optimisations AMD-Internal: CPUPL-7457
Setting number of BLIS threads to 1 for calls to DGEMV, This was done as isolated threading in DGEMV causing cache misses in subsequent single threaded code. AMD Internal: CPUPL-6963, 6981, 6983
* DLASWP optimization to address regression for all sizes of DGETRS, DGBTRF. * Created dlaswp_st for single threaded dlaswp with 32 tile size and used in DGBTRF. AMD Internal: CPUPL-7253
Changes in threshold made to include some of medium sizes into small path. Direct call to dlasv2 for NN cases avaoided as it is causing regression Guard for malloc in DGELQF small path AMD Internal: CPUPL-7430
- Fixed regression for medium/large size DORGQR regression by rolling back optimization for dlarft. Change-Id: Ieb8ea82dbe547833b4d235edb70cb739e00ebf37 AMD-Internal: CPUPL-6962
Collaborator
|
Thank you for this PR. Is it possible to raise this PR on "dev" branch of the repo? We push latest updates into dev branch in bi-weekly cadence and we can include this as part of the next update. |
If AOCL-BLAS is compiled with "auto" on Zen3 it does not include the zen4 functions, so compilation errors occur for the *zen4* function calls. Using the BLIS_KERNELS_ZEN4 define in blis.h we can make these calls conditional to zen4 support actually available in AOCL-BLAS
796e010 to
d5f93bc
Compare
Author
no problem, I've rebased this PR to dev. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
If AOCL-BLAS is compiled with "auto" on Zen3 it does not include the zen4 functions, so compilation errors occur for the zen4 function calls. Using the BLIS_KERNELS_ZEN4 define in blis.h we can make these calls conditional to zen4 support actually available in AOCL-BLAS