Description: miniGMG is a compact benchmark for understanding the performance challenges associated with geometric multigrid solvers found in applications built from AMR MG frameworks like CHOMBO or BoxLib when running on modern multi- and manycore-based supercomputers. It includes both productive reference examples as well as highly-optimized implementations for CPUs and GPUs. It is sufficiently general that it has been used to evaluate a broad range of research topics including PGAS programming models and algorithmic tradeoffs inherit in multigrid. miniGMG was developed under the CACHE Joint Math-CS Institute. Note, miniGMG code has been supersceded by HPGMG.
URL: http://crd.lbl.gov/departments/computer-science/PAR/research/previous-projects/miniGMG/
Team: WolfPack
Details of any changes to the Spack recipe used.
We add a new dependent package since we apply our arm/simde optimizations to this application, and add the corresponding simde
variant. For other compiler optimization flags, we add a variant opt
to enable all of them.
Git commit hash of checkout for pacakage:
https://github.com/spack/spack/pull/24926/commits/949b7a644c6677fa6ccf824099b2ec32688000ba
https://github.com/spack/spack/commit/2f3d651b1967050523919a881b883982d2351eeb
Pull request for Spack recipe changes:
spack install -j 64 minigmg@local%[email protected]
$ spack spec -Il minigmg@local%[email protected]
- sstmxbz minigmg@local%[email protected]~debug~opt~simde arch=linux-amzn2-graviton2
[+] zvamksn ^[email protected]%[email protected]~atomics~cuda~cxx~cxx_exceptions+gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker+pmi~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=ofi patches=60ce20bc14d98c572ef7883b9fcd254c3f232c2f3a13377480f96466169ac4c8 schedulers=slurm arch=linux-amzn2-graviton2
[+] cukmqbg ^[email protected]%[email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml+pci+shared arch=linux-amzn2-graviton2
[+] asgtk6a ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] z2uysov ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] ebhjpix ^[email protected]%[email protected]+sigsegv patches=3877ab548f88597ab2327a2230ee048d2d07ace1062efe81fc92e91b7f39cd00,fc9b61654a3ba1a8d6cd78ce087e7c96366c290bc8d2c299f09828d793b853c8 arch=linux-amzn2-graviton2
[+] ltbv6bk ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] s4pw7zm ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] 4xr3hhh ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] iyhm3wi ^[email protected]%[email protected]~python arch=linux-amzn2-graviton2
[+] y5ei3cm ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] ye3kcvv ^[email protected]%[email protected]~pic libs=shared,static arch=linux-amzn2-graviton2
[+] qepjcvj ^[email protected]%[email protected]+optimize+pic+shared arch=linux-amzn2-graviton2
[+] iwzirqc ^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-amzn2-graviton2
[+] tadxrfp ^[email protected]%[email protected]+openssl arch=linux-amzn2-graviton2
[+] 5i3lgfb ^[email protected]%[email protected]~docs+systemcerts arch=linux-amzn2-graviton2
[+] 4m7exgb ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-amzn2-graviton2
[+] y42m6yr ^[email protected]%[email protected]+cxx~docs+stl patches=b231fcc4d5cff05e5c3a4814f6a5af0e9a966428dc2176540d2c05aff41de522 arch=linux-amzn2-graviton2
[+] rqrpmap ^[email protected]%[email protected]~debug~pic+shared arch=linux-amzn2-graviton2
[+] 2w7bert ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] wjwqncx ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] 3zy7kxk ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] 72f5gvk ^[email protected]%[email protected]~debug~kdreg fabrics=sockets,tcp,udp arch=linux-amzn2-graviton2
[+] mhav5gn ^[email protected]%[email protected] patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006 arch=linux-amzn2-graviton2
[+] jkuhz64 ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] xb2w5nc ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] wturp6c ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] ivotdt7 ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] wqpuvmh ^slurm@20-02-4-1%[email protected]~gtk~hdf5~hwloc~mariadb~pmix+readline~restd sysconfdir=PREFIX/etc arch=linux-amzn2-graviton2
spack install -j 64 minigmg@local%[email protected]
$ spack spec -Il minigmg@local%[email protected]
- 33ilsbt minigmg@local%[email protected]~debug~opt~simde arch=linux-amzn2-graviton2
[+] huifkle ^[email protected]%[email protected]~atomics~cuda~cxx~cxx_exceptions+gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker+pmi~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=ofi patches=60ce20bc14d98c572ef7883b9fcd254c3f232c2f3a13377480f96466169ac4c8 schedulers=slurm arch=linux-amzn2-graviton2
[+] xsh5tug ^[email protected]%[email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml+pci+shared arch=linux-amzn2-graviton2
[+] heo5xlh ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] xcqslvj ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] guhrr3n ^[email protected]%[email protected]+sigsegv patches=3877ab548f88597ab2327a2230ee048d2d07ace1062efe81fc92e91b7f39cd00,fc9b61654a3ba1a8d6cd78ce087e7c96366c290bc8d2c299f09828d793b853c8 arch=linux-amzn2-graviton2
[+] q27ybb5 ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] s6jl232 ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] 6eey55q ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] 7og6524 ^[email protected]%[email protected]~python arch=linux-amzn2-graviton2
[+] 4fpawwk ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] 3uhexv5 ^[email protected]%[email protected]~pic libs=shared,static arch=linux-amzn2-graviton2
[+] kfhtmo3 ^[email protected]%[email protected]+optimize+pic+shared arch=linux-amzn2-graviton2
[+] 5fshnbc ^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-amzn2-graviton2
[+] hj5l7x5 ^[email protected]%[email protected]+openssl arch=linux-amzn2-graviton2
[+] b6rhpqo ^[email protected]%[email protected]~docs+systemcerts arch=linux-amzn2-graviton2
[+] aoyzxyq ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-amzn2-graviton2
[+] rd3hv7n ^[email protected]%[email protected]+cxx~docs+stl patches=b231fcc4d5cff05e5c3a4814f6a5af0e9a966428dc2176540d2c05aff41de522 arch=linux-amzn2-graviton2
[+] qaavobd ^[email protected]%[email protected]~debug~pic+shared arch=linux-amzn2-graviton2
[+] qchmimy ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] jbenr5m ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] 7fjq32x ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] v75lszn ^[email protected]%[email protected]~debug~kdreg fabrics=sockets,tcp,udp arch=linux-amzn2-graviton2
[+] 325gh7i ^[email protected]%[email protected] patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006 arch=linux-amzn2-graviton2
[+] mbkv7qv ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] toijtok ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] 7cmi2lb ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] qytqrqe ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] uxllonc ^slurm@20-02-4-1%[email protected]~gtk~hdf5~hwloc~mariadb~pmix+readline~restd sysconfdir=PREFIX/etc arch=linux-amzn2-graviton2
spack install -j 64 minigmg@local%[email protected]
$ spack spec -Il minigmg@local%[email protected]
- fjnn2h7 minigmg@local%[email protected]~debug~opt~simde arch=linux-amzn2-graviton2
[+] krxyvbc ^[email protected]%[email protected]~atomics~cuda~cxx~cxx_exceptions+gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker+pmi~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=ofi patches=60ce20bc14d98c572ef7883b9fcd254c3f232c2f3a13377480f96466169ac4c8,fba0d3a784a9723338722b48024a22bb32f6a951db841a4e9f08930a93f41d7a schedulers=slurm arch=linux-amzn2-graviton2
[+] jroqews ^[email protected]%[email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml+pci+shared arch=linux-amzn2-graviton2
[+] e4m4ued ^[email protected]%[email protected] patches=6e08dc445ece06e9e8b1344397f2d3f169005703ddc0f2ae24f366cde78c7377 arch=linux-amzn2-graviton2
[+] kk4ax3i ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] 6c4kz5g ^[email protected]%[email protected]+sigsegv patches=3877ab548f88597ab2327a2230ee048d2d07ace1062efe81fc92e91b7f39cd00,5746cf51f45b405661c3edae7a78c33d41e54d83f635d16e2bf1f956dbfbf635,fc9b61654a3ba1a8d6cd78ce087e7c96366c290bc8d2c299f09828d793b853c8 arch=linux-amzn2-graviton2
[+] pa6wm5j ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] vtiml6g ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] 4imdwuy ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] wo4l72s ^[email protected]%[email protected]~python patches=05ff238cf435825ef835c7ae39376b52dc83d8caf19e962f0766c841386a305a,10a88ad47f9797cf7cf2d7d07241f665a3b6d1f31fa026728c8c2ae93e1664e9 arch=linux-amzn2-graviton2
[+] r7mmkdp ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] br733tn ^[email protected]%[email protected]~pic libs=shared,static arch=linux-amzn2-graviton2
[+] 4js6ect ^[email protected]%[email protected]+optimize+pic+shared arch=linux-amzn2-graviton2
[+] asgm7mt ^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-amzn2-graviton2
[+] uttaumr ^[email protected]%[email protected]+openssl arch=linux-amzn2-graviton2
[+] j2qhi7h ^[email protected]%[email protected]~docs+systemcerts arch=linux-amzn2-graviton2
[+] gn4fgp5 ^[email protected]%[email protected]+cpanm+shared+threads patches=21cf6a73cec16760f8de2e8895ace1299aff2d8e92dc581cd18f1d95a4503048 arch=linux-amzn2-graviton2
[+] 5uyf3k4 ^[email protected]%[email protected]+cxx~docs+stl patches=b231fcc4d5cff05e5c3a4814f6a5af0e9a966428dc2176540d2c05aff41de522 arch=linux-amzn2-graviton2
[+] wsi7g3j ^[email protected]%[email protected]~debug~pic+shared arch=linux-amzn2-graviton2
[+] s4mb5no ^[email protected]%[email protected] patches=6e42dc243f17aab29fd167f060f5bc1f08813e03368eb301b43c95d4b1386681 arch=linux-amzn2-graviton2
[+] m2wdbeo ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] zori3wf ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] xl6zavq ^[email protected]%[email protected]~debug~kdreg fabrics=sockets,tcp,udp arch=linux-amzn2-graviton2
[+] 5yq4tpw ^[email protected]%[email protected] patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006 arch=linux-amzn2-graviton2
[+] fo57byt ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] gmd4264 ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] cl3ohqo ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] yvqpq74 ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+] zehhooy ^slurm@20-02-4-1%[email protected]~gtk~hdf5~hwloc~mariadb~pmix+readline~restd sysconfdir=PREFIX/etc arch=linux-amzn2-graviton2
reframe -c benchmark_v_test1_gcc.py -v -r --performance-report --keep-stage-files
reframe -c benchmark_v_test2_gcc.py -v -r --performance-report --keep-stage-files
reframe -c benchmark_v_test3_gcc.py -v -r --performance-report --keep-stage-files
reframe -c benchmark_v_test4_gcc.py -v -r --performance-report --keep-stage-files
miniGMG reduces the norm until it is less than 1e-15. If the norm is still greater than 1e-15 after maxVCycles
v-cycles, the program will ends with incorrect results. So we check if all the norms produced by last v-cycles are less than 1e-15.
arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark
arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark
arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark
arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark
==============================================================================
PERFORMANCE REPORT
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_1
- aws:c6gn
- builtin
* num_tasks: 1
* Total Time: 127.3 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_2
- builtin
* num_tasks: 1
* Total Time: 84.14 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_4
- builtin
* num_tasks: 1
* Total Time: 52.06 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_8
- builtin
* num_tasks: 1
* Total Time: 41.18 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_16
- builtin
* num_tasks: 1
* Total Time: 51.58 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_32
- builtin
* num_tasks: 1
* Total Time: 82.87 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_64
- builtin
* num_tasks: 1
* Total Time: 131.02 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_1
- aws:c6gn
- builtin
* num_tasks: 1
* Total Time: 124.5 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_2
- builtin
* num_tasks: 1
* Total Time: 83.3 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_4
- builtin
* num_tasks: 1
* Total Time: 52.5 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_8
- builtin
* num_tasks: 1
* Total Time: 35.18 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_16
- builtin
* num_tasks: 1
* Total Time: 25.2 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_32
- builtin
* num_tasks: 1
* Total Time: 18.27 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_64
- builtin
* num_tasks: 1
* Total Time: 11.56 s
------------------------------------------------------------------------------
arm: [email protected]
gcc: [email protected]
Performance comparison of two compilers.
arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark
Cores | arm | gcc |
---|---|---|
8 | 4.35 s | 6.31 s |
16 | 3.58 s | 7.81 s |
32 | 3.71 s | 11.39 s |
64 | 3.91 s | 18.33 s |
arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark
Cores | arm | gcc |
---|---|---|
8 | 35.18 s | 41.18 s |
16 | 25.2 s | 51.58 s |
32 | 18.27 s | 82.87 s |
64 | 11.56 s | 131.02 s |
arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark
Cores | arm | gcc |
---|---|---|
8 | 273.77 s | 314.23 s |
16 | 199.31 s | 418.16 s |
32 | 134.02 s | 686.54 s |
64 | 81.45 s | 1097.58 s |
arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark
Cores | arm | gcc |
---|---|---|
8 | 535.99 s | 789.04 s |
16 | 520.7 s | 1048.72 s |
32 | 375.39 s | 1906.81 s |
64 | 269.4 s | 2499.68 s |
List of top-10 functions / code locations from a serial profile.
Profiling script:
ReFrame Benchmark
Profiling script:
ReFrame Benchmark
Profiling script:
ReFrame Benchmark
Profiling script:
ReFrame Benchmark
List of top-10 functions / code locations from a full node profile.
Profiling script:
ReFrame Benchmark
Profiling script:
ReFrame Benchmark
Profiling script:
ReFrame Benchmark
Profiling script:
ReFrame Benchmark
Profiling script:
ReFrame Benchmark
Profiling script:
ReFrame Benchmark
Profiling script:
ReFrame Benchmark
arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark
Cores | arm | gcc |
---|---|---|
8 | 4.35 s | 6.31 s |
16 | 3.58 s | 7.81 s |
32 | 3.71 s | 11.39 s |
64 | 3.91 s | 19.33 s |
arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark
Cores | arm | gcc |
---|---|---|
8 | 35.18 s | 41.18 s |
16 | 25.2 s | 51.58 s |
32 | 18.27 s | 82.87 s |
64 | 11.56 s | 131.02 s |
arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark
Cores | arm | gcc |
---|---|---|
8 | 273.77 s | 314.23 s |
16 | 199.31 s | 418.16 s |
32 | 134.02 s | 686.54 s |
64 | 81.45 s | 1097.58 s |
arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark
Cores | arm | gcc |
---|---|---|
8 | 535.99 s | 789.04 s |
16 | 520.7 s | 1048.72 s |
32 | 375.39 s | 1906.81 s |
64 | 269.4 s | 2499.68 s |
arm script:
ReFrame Benchmark 1
ReFrame Benchmark 2
gcc script:
ReFrame Benchmark 1
ReFrame Benchmark 2
Nodes | Cores | arm | gcc |
---|---|---|---|
1 | 32 | 4.35 s | 11.39 s |
1 | 64 | 3.58 s | 19.33 s |
2 | 128 | 60.24 s | 11.34 s |
4 | 256 | 63.33 s | 7.67 s |
arm script:
ReFrame Benchmark 1
ReFrame Benchmark 2
gcc script:
ReFrame Benchmark 1
ReFrame Benchmark 2
Nodes | Cores | arm | gcc |
---|---|---|---|
1 | 32 | 35.18 s | 41.18 s |
1 | 64 | 25.2 s | 51.58 s |
2 | 128 | 7.67 s | 68.77 s |
4 | 256 | 66.17 s | 37.38 s |
arm script:
ReFrame Benchmark 1
ReFrame Benchmark 2
gcc script:
ReFrame Benchmark 1
ReFrame Benchmark 2
Nodes | Cores | arm | gcc |
---|---|---|---|
1 | 32 | 273.77 s | 314.23 s |
1 | 64 | 199.31 s | 418.16 s |
2 | 128 | 46.77 s | 559.38 s |
4 | 256 | 29.05 s | 283.53 s |
arm script:
ReFrame Benchmark 1
ReFrame Benchmark 2
gcc script:
ReFrame Benchmark 1
ReFrame Benchmark 2
Nodes | Cores | arm | gcc |
---|---|---|---|
1 | 32 | 535.99 s | 789.04 s |
1 | 64 | 520.7 s | 1048.72 s |
2 | 128 | 147.97 s | 1879.59 s |
4 | 256 | 82.66 s | 942.39 s |
On-node scaling study for two architectures.
Compiler: gcc
X86 script:
ReFrame Benchmark
Cores | C6gn (ARM) | C5n (X86) |
---|---|---|
8 | 6.31 s | 19.91 s |
16 | 7.81 s | 22.06 s |
32 | 11.39 s | 30.28 s |
64 | 19.33 s | 46.11 s |
X86 script:
ReFrame Benchmark
Cores | C6gn (ARM) | C5n (X86) |
---|---|---|
8 | 41.18 s | 72.85 s |
16 | 51.58 s | 84.45 s |
32 | 82.87 s | 142.46 s |
64 | 131.02 s | 252.93 s |
X86 script:
ReFrame Benchmark
Cores | C6gn (ARM) | C5n (X86) |
---|---|---|
8 | 314.23 s | 543.84 s |
16 | 418.16 s | 640.72 s |
32 | 686.54 s | 1087.96 s |
64 | 1097.58 s | 2025.36 s |
X86 script:
ReFrame Benchmark
Cores | C6gn (ARM) | C5n (X86) |
---|---|---|
8 | 789.04 s | 1305.74 s |
16 | 1048.72 s | 1915.15 s |
32 | 1906.81 s | 3563.73 s |
64 | 2499.68 s | 7220.35 s |
Off-node scaling study for two architectures. Compiler: gcc
X86 script:
ReFrame Benchmark
ReFrame Benchmark
Nodes | Cores | C6gn (ARM) | C5n (X86) |
---|---|---|---|
1 | 32 | 11.39 s | 30.28 s |
1 | 64 | 19.33 s | 46.11 s |
2 | 128 | 11.34 s | 33.52 s |
4 | 256 | 7.67 s | 26.4 s |
X86 script:
ReFrame Benchmark
ReFrame Benchmark
Nodes | Cores | C6gn (ARM) | C5n (X86) |
---|---|---|---|
1 | 32 | 82.87 s | 142.46 s |
1 | 64 | 131.02 s | 252.93 s |
2 | 128 | 68.77 s | 141.92 s |
4 | 256 | 37.38 s | 82.26 s |
X86 script:
ReFrame Benchmark
ReFrame Benchmark
Nodes | Cores | C6gn (ARM) | C5n (X86) |
---|---|---|---|
1 | 32 | 686.54 s | 1087.96 s |
1 | 64 | 1097.58 s | 2025.36 s |
2 | 128 | 559.38 s | 1027.75 s |
4 | 256 | 283.53 s | 526.23 s |
X86 script:
ReFrame Benchmark
ReFrame Benchmark
Nodes | Cores | C6gn (ARM) | C5n (X86) |
---|---|---|---|
1 | 32 | 1906.81 s | 3563.73 s |
1 | 64 | 2499.68 s | 7220.35 s |
2 | 128 | 1879.59 s | 3421.94 s |
4 | 256 | 942.39 s | 1732.57 s |
Details of steps taken to optimise performance of the application. Please document work with compiler flags, maths libraries, system libraries, code optimisations, etc.
Compiler flags before:
CFLAGS= -O3 -fopenmp -lm -Daarch64 -D__MPI -D__COLLABORATIVE_THREADING=6 -D__TEST_MG_CONVERGENCE -D__PRINT_NORM -D__USE_BICGSTAB
FFLAGS=
Compiler flags after:
CFLAGS= -O3 -Ofast -fopenmp -lm -Daarch64 -D__MPI -D__TEST_MG_CONVERGENCE -D__PREFETCH_NEXT_PLANE_FROM_DRAM -D__FUSION_RESIDUAL_RESTRICTION -D__PRINT_NORM -D__USE_BICGSTAB
FFLAGS=
The following table is based on [email protected] with Test Case 3. The speedup is up to 14.67X when there are 64 threads.
Cores | Original Flags | New Flags | Speedup |
---|---|---|---|
1 | 1060.55 | 1030.97 | 1.03 |
2 | 643.84 | 619.43 | 1.04 |
4 | 401.97 | 371.58 | 1.08 |
8 | 314.23 | 270.56 | 1.16 |
16 | 418.16 | 146.35 | 2.86 |
32 | 686.54 | 105.15 | 6.53 |
64 | 1097.58 | 74.81 | 14.67 |
The following table is based on [email protected] with Test Case 3. The speedup is up to 23.8X when there are 64 threads. The percentage of reduction of runtime is 95.81%.
Cores | Original | Optimization | Speedup |
---|---|---|---|
1 | 1060.55 | 720.42 | 1.47 |
2 | 643.84 | 349.17 | 1.84 |
4 | 401.97 | 233.63 | 1.72 |
8 | 314.23 | 144.44 | 2.18 |
16 | 418.16 | 91.6 | 4.57 |
32 | 686.54 | 64.02 | 10.72 |
64 | 1097.58 | 46.04 | 23.84 |
From 04ff2d513531aaf60871d306d19d4f0a326aa8a0 Mon Sep 17 00:00:00 2001
From: dolanzhao <[email protected]>
Date: Fri, 16 Jul 2021 17:11:38 +0000
Subject: [PATCH 5/5] support simd
---
operators.avx.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/operators.avx.c b/operators.avx.c
index fee85d4..22ab6fd 100755
--- a/operators.avx.c
+++ b/operators.avx.c
@@ -8,7 +8,12 @@
#include <stdint.h>
#include <string.h>
#include <math.h>
-#include <immintrin.h>
+// #include <immintrin.h>
+#define SIMDE_ENABLE_NATIVE_ALIASES
+#define SIMDE_X86_SSE_ENABLE_NATIVE_ALIASES
+#define _MM_HINT_T0 1
+#define _MM_HINT_T1 2
+#include "simde/x86/avx2.h"
//------------------------------------------------------------------------------------------------------------------------------
#include "timer.h"
#include "defines.h"
--
2.32.0
-Ofast -D__PREFETCH_NEXT_PLANE_FROM_DRAM -D__FUSION_RESIDUAL_RESTRICTION
We have attached the spack package.py.
We use the following commands to install our optimization version.
spack install -j 64 minigmg@local+simde+opt
We have attached the reframe script for optimization version.
Porting miniGMG to ARM needs some adaptation of the code. We changed the rdtsc()
function in the timer.x86.c from x86 assemblies to aarch64 assemblies to ensure the success of the compilation. Furthermore, our compilation enables both OpenMP and MPI. We found some issues in both Spack and nvhpc compiler. For Spack, it does not give a special build option for arm compiler on graviton2, but only a general aarch64 option; we added a arm compiler section in graviton2 in Spack and submitted to archspec-json. Our PR has been merged to archspec-json. For nvhpc, we found it produces wrong code for OpenMP programs when running with more than 1 thread, while nvhpc works correctly for sequential and MPI (with no threads) programs. Given the short time frame, we didn’t pinpoint the exact bug in nvhpc code genration, but it inspires to investigate its OpenMP implementation and checks whether it exactly follows the OpenMP standard.
We have observed very interesting performance insights in miniGMG. When there are 4 cores or less, ARM21 and GCC have comparable performanc e. However, ARM21 gives much better performance over GCC when the execution scales to 8 and more cores. Moreover, ARM server gives much bet ter performance than Intel server.
We have tested multiple optimization strategies, which includes:
-
(1) using SIMD instructions on aarch64 (Neon)
-
(2) using -Ofast (especially using fast math library)
-
(3) enabling algorithm optimization (residual operator fusion)
-
(4) using active waiting policy for OpenMP threads.