Skip to content

Latest commit

 

History

History
740 lines (615 loc) · 32.8 KB

File metadata and controls

740 lines (615 loc) · 32.8 KB

miniGMG

Description: miniGMG is a compact benchmark for understanding the performance challenges associated with geometric multigrid solvers found in applications built from AMR MG frameworks like CHOMBO or BoxLib when running on modern multi- and manycore-based supercomputers. It includes both productive reference examples as well as highly-optimized implementations for CPUs and GPUs. It is sufficiently general that it has been used to evaluate a broad range of research topics including PGAS programming models and algorithmic tradeoffs inherit in multigrid. miniGMG was developed under the CACHE Joint Math-CS Institute. Note, miniGMG code has been supersceded by HPGMG.

URL: http://crd.lbl.gov/departments/computer-science/PAR/research/previous-projects/miniGMG/

Team: WolfPack

Compilation

Spack Package Modification

Details of any changes to the Spack recipe used.

We add a new dependent package since we apply our arm/simde optimizations to this application, and add the corresponding simde variant. For other compiler optimization flags, we add a variant opt to enable all of them.

Git commit hash of checkout for pacakage:

https://github.com/spack/spack/pull/24926/commits/949b7a644c6677fa6ccf824099b2ec32688000ba

https://github.com/spack/spack/commit/2f3d651b1967050523919a881b883982d2351eeb

Pull request for Spack recipe changes:

spack/spack#24926

Building miniGMG

GCC 10.3.0

spack install -j 64 minigmg@local%[email protected]
$ spack spec -Il minigmg@local%[email protected]

 -   sstmxbz  minigmg@local%[email protected]~debug~opt~simde arch=linux-amzn2-graviton2
[+]  zvamksn      ^[email protected]%[email protected]~atomics~cuda~cxx~cxx_exceptions+gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker+pmi~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=ofi patches=60ce20bc14d98c572ef7883b9fcd254c3f232c2f3a13377480f96466169ac4c8 schedulers=slurm arch=linux-amzn2-graviton2
[+]  cukmqbg          ^[email protected]%[email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml+pci+shared arch=linux-amzn2-graviton2
[+]  asgtk6a              ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  z2uysov                  ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  ebhjpix                      ^[email protected]%[email protected]+sigsegv patches=3877ab548f88597ab2327a2230ee048d2d07ace1062efe81fc92e91b7f39cd00,fc9b61654a3ba1a8d6cd78ce087e7c96366c290bc8d2c299f09828d793b853c8 arch=linux-amzn2-graviton2
[+]  ltbv6bk                          ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  s4pw7zm                  ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  4xr3hhh                  ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  iyhm3wi              ^[email protected]%[email protected]~python arch=linux-amzn2-graviton2
[+]  y5ei3cm                  ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  ye3kcvv                  ^[email protected]%[email protected]~pic libs=shared,static arch=linux-amzn2-graviton2
[+]  qepjcvj                  ^[email protected]%[email protected]+optimize+pic+shared arch=linux-amzn2-graviton2
[+]  iwzirqc              ^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-amzn2-graviton2
[+]  tadxrfp          ^[email protected]%[email protected]+openssl arch=linux-amzn2-graviton2
[+]  5i3lgfb              ^[email protected]%[email protected]~docs+systemcerts arch=linux-amzn2-graviton2
[+]  4m7exgb                  ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-amzn2-graviton2
[+]  y42m6yr                      ^[email protected]%[email protected]+cxx~docs+stl patches=b231fcc4d5cff05e5c3a4814f6a5af0e9a966428dc2176540d2c05aff41de522 arch=linux-amzn2-graviton2
[+]  rqrpmap                      ^[email protected]%[email protected]~debug~pic+shared arch=linux-amzn2-graviton2
[+]  2w7bert                          ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  wjwqncx                      ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  3zy7kxk                          ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  72f5gvk          ^[email protected]%[email protected]~debug~kdreg fabrics=sockets,tcp,udp arch=linux-amzn2-graviton2
[+]  mhav5gn          ^[email protected]%[email protected] patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006 arch=linux-amzn2-graviton2
[+]  jkuhz64              ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  xb2w5nc              ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  wturp6c          ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  ivotdt7              ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  wqpuvmh          ^slurm@20-02-4-1%[email protected]~gtk~hdf5~hwloc~mariadb~pmix+readline~restd sysconfdir=PREFIX/etc arch=linux-amzn2-graviton2

ARM 21.0.0.879

spack install -j 64 minigmg@local%[email protected]
$ spack spec -Il minigmg@local%[email protected]

 -   33ilsbt  minigmg@local%[email protected]~debug~opt~simde arch=linux-amzn2-graviton2
[+]  huifkle      ^[email protected]%[email protected]~atomics~cuda~cxx~cxx_exceptions+gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker+pmi~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=ofi patches=60ce20bc14d98c572ef7883b9fcd254c3f232c2f3a13377480f96466169ac4c8 schedulers=slurm arch=linux-amzn2-graviton2
[+]  xsh5tug          ^[email protected]%[email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml+pci+shared arch=linux-amzn2-graviton2
[+]  heo5xlh              ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  xcqslvj                  ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  guhrr3n                      ^[email protected]%[email protected]+sigsegv patches=3877ab548f88597ab2327a2230ee048d2d07ace1062efe81fc92e91b7f39cd00,fc9b61654a3ba1a8d6cd78ce087e7c96366c290bc8d2c299f09828d793b853c8 arch=linux-amzn2-graviton2
[+]  q27ybb5                          ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  s6jl232                  ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  6eey55q                  ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  7og6524              ^[email protected]%[email protected]~python arch=linux-amzn2-graviton2
[+]  4fpawwk                  ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  3uhexv5                  ^[email protected]%[email protected]~pic libs=shared,static arch=linux-amzn2-graviton2
[+]  kfhtmo3                  ^[email protected]%[email protected]+optimize+pic+shared arch=linux-amzn2-graviton2
[+]  5fshnbc              ^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-amzn2-graviton2
[+]  hj5l7x5          ^[email protected]%[email protected]+openssl arch=linux-amzn2-graviton2
[+]  b6rhpqo              ^[email protected]%[email protected]~docs+systemcerts arch=linux-amzn2-graviton2
[+]  aoyzxyq                  ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-amzn2-graviton2
[+]  rd3hv7n                      ^[email protected]%[email protected]+cxx~docs+stl patches=b231fcc4d5cff05e5c3a4814f6a5af0e9a966428dc2176540d2c05aff41de522 arch=linux-amzn2-graviton2
[+]  qaavobd                      ^[email protected]%[email protected]~debug~pic+shared arch=linux-amzn2-graviton2
[+]  qchmimy                          ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  jbenr5m                      ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  7fjq32x                          ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  v75lszn          ^[email protected]%[email protected]~debug~kdreg fabrics=sockets,tcp,udp arch=linux-amzn2-graviton2
[+]  325gh7i          ^[email protected]%[email protected] patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006 arch=linux-amzn2-graviton2
[+]  mbkv7qv              ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  toijtok              ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  7cmi2lb          ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  qytqrqe              ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  uxllonc          ^slurm@20-02-4-1%[email protected]~gtk~hdf5~hwloc~mariadb~pmix+readline~restd sysconfdir=PREFIX/etc arch=linux-amzn2-graviton2

NVHPC 21.2

spack install -j 64 minigmg@local%[email protected]
$ spack spec -Il minigmg@local%[email protected]

 -   fjnn2h7  minigmg@local%[email protected]~debug~opt~simde arch=linux-amzn2-graviton2
[+]  krxyvbc      ^[email protected]%[email protected]~atomics~cuda~cxx~cxx_exceptions+gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker+pmi~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=ofi patches=60ce20bc14d98c572ef7883b9fcd254c3f232c2f3a13377480f96466169ac4c8,fba0d3a784a9723338722b48024a22bb32f6a951db841a4e9f08930a93f41d7a schedulers=slurm arch=linux-amzn2-graviton2
[+]  jroqews          ^[email protected]%[email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml+pci+shared arch=linux-amzn2-graviton2
[+]  e4m4ued              ^[email protected]%[email protected] patches=6e08dc445ece06e9e8b1344397f2d3f169005703ddc0f2ae24f366cde78c7377 arch=linux-amzn2-graviton2
[+]  kk4ax3i                  ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  6c4kz5g                      ^[email protected]%[email protected]+sigsegv patches=3877ab548f88597ab2327a2230ee048d2d07ace1062efe81fc92e91b7f39cd00,5746cf51f45b405661c3edae7a78c33d41e54d83f635d16e2bf1f956dbfbf635,fc9b61654a3ba1a8d6cd78ce087e7c96366c290bc8d2c299f09828d793b853c8 arch=linux-amzn2-graviton2
[+]  pa6wm5j                          ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  vtiml6g                  ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  4imdwuy                  ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  wo4l72s              ^[email protected]%[email protected]~python patches=05ff238cf435825ef835c7ae39376b52dc83d8caf19e962f0766c841386a305a,10a88ad47f9797cf7cf2d7d07241f665a3b6d1f31fa026728c8c2ae93e1664e9 arch=linux-amzn2-graviton2
[+]  r7mmkdp                  ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  br733tn                  ^[email protected]%[email protected]~pic libs=shared,static arch=linux-amzn2-graviton2
[+]  4js6ect                  ^[email protected]%[email protected]+optimize+pic+shared arch=linux-amzn2-graviton2
[+]  asgm7mt              ^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-amzn2-graviton2
[+]  uttaumr          ^[email protected]%[email protected]+openssl arch=linux-amzn2-graviton2
[+]  j2qhi7h              ^[email protected]%[email protected]~docs+systemcerts arch=linux-amzn2-graviton2
[+]  gn4fgp5                  ^[email protected]%[email protected]+cpanm+shared+threads patches=21cf6a73cec16760f8de2e8895ace1299aff2d8e92dc581cd18f1d95a4503048 arch=linux-amzn2-graviton2
[+]  5uyf3k4                      ^[email protected]%[email protected]+cxx~docs+stl patches=b231fcc4d5cff05e5c3a4814f6a5af0e9a966428dc2176540d2c05aff41de522 arch=linux-amzn2-graviton2
[+]  wsi7g3j                      ^[email protected]%[email protected]~debug~pic+shared arch=linux-amzn2-graviton2
[+]  s4mb5no                          ^[email protected]%[email protected] patches=6e42dc243f17aab29fd167f060f5bc1f08813e03368eb301b43c95d4b1386681 arch=linux-amzn2-graviton2
[+]  m2wdbeo                      ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  zori3wf                          ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  xl6zavq          ^[email protected]%[email protected]~debug~kdreg fabrics=sockets,tcp,udp arch=linux-amzn2-graviton2
[+]  5yq4tpw          ^[email protected]%[email protected] patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006 arch=linux-amzn2-graviton2
[+]  fo57byt              ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  gmd4264              ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  cl3ohqo          ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  yvqpq74              ^[email protected]%[email protected] arch=linux-amzn2-graviton2
[+]  zehhooy          ^slurm@20-02-4-1%[email protected]~gtk~hdf5~hwloc~mariadb~pmix+readline~restd sysconfdir=PREFIX/etc arch=linux-amzn2-graviton2

Test Case 1

ReFrame Benchmark 1

reframe -c benchmark_v_test1_gcc.py -v -r --performance-report --keep-stage-files

Test Case 2

ReFrame Benchmark 2

reframe -c benchmark_v_test2_gcc.py -v -r --performance-report --keep-stage-files

Test Case 3

ReFrame Benchmark 3

reframe -c benchmark_v_test3_gcc.py -v -r --performance-report --keep-stage-files

Test Case 4

ReFrame Benchmark 4

reframe -c benchmark_v_test4_gcc.py -v -r --performance-report --keep-stage-files

Validation

miniGMG reduces the norm until it is less than 1e-15. If the norm is still greater than 1e-15 after maxVCycles v-cycles, the program will ends with incorrect results. So we check if all the norms produced by last v-cycles are less than 1e-15.

Test Case 1

arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark

Test Case 2

arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark

Test Case 3

arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark

Test Case 4

arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark

ReFrame Output

==============================================================================
PERFORMANCE REPORT
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_1
- aws:c6gn
   - builtin
      * num_tasks: 1
      * Total Time: 127.3 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_2
   - builtin
      * num_tasks: 1
      * Total Time: 84.14 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_4
   - builtin
      * num_tasks: 1
      * Total Time: 52.06 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_8
   - builtin
      * num_tasks: 1
      * Total Time: 41.18 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_16
   - builtin
      * num_tasks: 1
      * Total Time: 51.58 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_32
   - builtin
      * num_tasks: 1
      * Total Time: 82.87 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_gcc_minigmg_local_gcc_10_3_0_N_1_MPI_1_OMP_64
   - builtin
      * num_tasks: 1
      * Total Time: 131.02 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_1
- aws:c6gn
   - builtin
      * num_tasks: 1
      * Total Time: 124.5 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_2
   - builtin
      * num_tasks: 1
      * Total Time: 83.3 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_4
   - builtin
      * num_tasks: 1
      * Total Time: 52.5 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_8
   - builtin
      * num_tasks: 1
      * Total Time: 35.18 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_16
   - builtin
      * num_tasks: 1
      * Total Time: 25.2 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_32
   - builtin
      * num_tasks: 1
      * Total Time: 18.27 s
------------------------------------------------------------------------------
miniGMG_minigmg_on_node_test2_arm_minigmg_local_arm_21_0_0_879_N_1_MPI_1_OMP_64
   - builtin
      * num_tasks: 1
      * Total Time: 11.56 s
------------------------------------------------------------------------------

On-node Compiler Comparison

arm: [email protected]
gcc: [email protected]

Performance comparison of two compilers.

Test Case 1

arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark

Cores arm gcc
8 4.35 s 6.31 s
16 3.58 s 7.81 s
32 3.71 s 11.39 s
64 3.91 s 18.33 s

Test Case 2

arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark

Cores arm gcc
8 35.18 s 41.18 s
16 25.2 s 51.58 s
32 18.27 s 82.87 s
64 11.56 s 131.02 s

Test Case 3

arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark

Cores arm gcc
8 273.77 s 314.23 s
16 199.31 s 418.16 s
32 134.02 s 686.54 s
64 81.45 s 1097.58 s

Test Case 4

arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark

Cores arm gcc
8 535.99 s 789.04 s
16 520.7 s 1048.72 s
32 375.39 s 1906.81 s
64 269.4 s 2499.68 s

Serial Hot-spot Profile

List of top-10 functions / code locations from a serial profile.

Test Case 1

Profiling script:
ReFrame Benchmark
st1

Test Case 2

Profiling script:
ReFrame Benchmark
st2

Test Case 3

Profiling script:
ReFrame Benchmark
st3

Test Case 4

Profiling script:
ReFrame Benchmark
st4

Full Node Hot-spot Profile

List of top-10 functions / code locations from a full node profile.

gcc

Test Case 1

Profiling script:
ReFrame Benchmark
w1g

Test Case 2

Profiling script:
ReFrame Benchmark
w2g

Test Case 3

Profiling script:
ReFrame Benchmark
w3g

Test Case 4

HPCToolkit
w4g

arm

Test Case 1

Profiling script:
ReFrame Benchmark
w1a

Test Case 2

Profiling script:
ReFrame Benchmark
w2a

Test Case 3

Profiling script:
ReFrame Benchmark
w3a

Test Case 4

Profiling script:
ReFrame Benchmark
w4a

Strong Scaling Study

On-node scaling study for two compilers:

Test Case 1

arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark

Cores arm gcc
8 4.35 s 6.31 s
16 3.58 s 7.81 s
32 3.71 s 11.39 s
64 3.91 s 19.33 s

Test Case 2

arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark

Cores arm gcc
8 35.18 s 41.18 s
16 25.2 s 51.58 s
32 18.27 s 82.87 s
64 11.56 s 131.02 s

Test Case 3

arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark

Cores arm gcc
8 273.77 s 314.23 s
16 199.31 s 418.16 s
32 134.02 s 686.54 s
64 81.45 s 1097.58 s

Test Case 4

arm script:
ReFrame Benchmark
gcc script:
ReFrame Benchmark

Cores arm gcc
8 535.99 s 789.04 s
16 520.7 s 1048.72 s
32 375.39 s 1906.81 s
64 269.4 s 2499.68 s

Off-node scaling study for two compilers:

Test Case 1

arm script:
ReFrame Benchmark 1
ReFrame Benchmark 2
gcc script:
ReFrame Benchmark 1
ReFrame Benchmark 2

Nodes Cores arm gcc
1 32 4.35 s 11.39 s
1 64 3.58 s 19.33 s
2 128 60.24 s 11.34 s
4 256 63.33 s 7.67 s

Test Case 2

arm script:
ReFrame Benchmark 1
ReFrame Benchmark 2
gcc script:
ReFrame Benchmark 1
ReFrame Benchmark 2

Nodes Cores arm gcc
1 32 35.18 s 41.18 s
1 64 25.2 s 51.58 s
2 128 7.67 s 68.77 s
4 256 66.17 s 37.38 s

Test Case 3

arm script:
ReFrame Benchmark 1
ReFrame Benchmark 2
gcc script:
ReFrame Benchmark 1
ReFrame Benchmark 2

Nodes Cores arm gcc
1 32 273.77 s 314.23 s
1 64 199.31 s 418.16 s
2 128 46.77 s 559.38 s
4 256 29.05 s 283.53 s

Test Case 4

arm script:
ReFrame Benchmark 1
ReFrame Benchmark 2
gcc script:
ReFrame Benchmark 1
ReFrame Benchmark 2

Nodes Cores arm gcc
1 32 535.99 s 789.04 s
1 64 520.7 s 1048.72 s
2 128 147.97 s 1879.59 s
4 256 82.66 s 942.39 s

On-Node Architecture Comparison

On-node scaling study for two architectures.
Compiler: gcc

Test Case 1

X86 script:
ReFrame Benchmark

Cores C6gn (ARM) C5n (X86)
8 6.31 s 19.91 s
16 7.81 s 22.06 s
32 11.39 s 30.28 s
64 19.33 s 46.11 s

Test Case 2

X86 script:
ReFrame Benchmark

Cores C6gn (ARM) C5n (X86)
8 41.18 s 72.85 s
16 51.58 s 84.45 s
32 82.87 s 142.46 s
64 131.02 s 252.93 s

Test Case 3

X86 script:
ReFrame Benchmark

Cores C6gn (ARM) C5n (X86)
8 314.23 s 543.84 s
16 418.16 s 640.72 s
32 686.54 s 1087.96 s
64 1097.58 s 2025.36 s

Test Case 4

X86 script:
ReFrame Benchmark

Cores C6gn (ARM) C5n (X86)
8 789.04 s 1305.74 s
16 1048.72 s 1915.15 s
32 1906.81 s 3563.73 s
64 2499.68 s 7220.35 s

Off-Node Architecture Comparison

Off-node scaling study for two architectures. Compiler: gcc

Test Case 1

X86 script:
ReFrame Benchmark
ReFrame Benchmark

Nodes Cores C6gn (ARM) C5n (X86)
1 32 11.39 s 30.28 s
1 64 19.33 s 46.11 s
2 128 11.34 s 33.52 s
4 256 7.67 s 26.4 s

Test Case 2

X86 script:
ReFrame Benchmark
ReFrame Benchmark

Nodes Cores C6gn (ARM) C5n (X86)
1 32 82.87 s 142.46 s
1 64 131.02 s 252.93 s
2 128 68.77 s 141.92 s
4 256 37.38 s 82.26 s

Test Case 3

X86 script:
ReFrame Benchmark
ReFrame Benchmark

Nodes Cores C6gn (ARM) C5n (X86)
1 32 686.54 s 1087.96 s
1 64 1097.58 s 2025.36 s
2 128 559.38 s 1027.75 s
4 256 283.53 s 526.23 s

Test Case 4

X86 script:
ReFrame Benchmark
ReFrame Benchmark

Nodes Cores C6gn (ARM) C5n (X86)
1 32 1906.81 s 3563.73 s
1 64 2499.68 s 7220.35 s
2 128 1879.59 s 3421.94 s
4 256 942.39 s 1732.57 s

Optimisation

Details of steps taken to optimise performance of the application. Please document work with compiler flags, maths libraries, system libraries, code optimisations, etc.

Compiler Flag Tuning

Compiler flags before:

CFLAGS= -O3 -fopenmp  -lm -Daarch64 -D__MPI -D__COLLABORATIVE_THREADING=6 -D__TEST_MG_CONVERGENCE -D__PRINT_NORM -D__USE_BICGSTAB 
FFLAGS=

Compiler flags after:

CFLAGS= -O3 -Ofast -fopenmp -lm -Daarch64 -D__MPI -D__TEST_MG_CONVERGENCE -D__PREFETCH_NEXT_PLANE_FROM_DRAM -D__FUSION_RESIDUAL_RESTRICTION -D__PRINT_NORM -D__USE_BICGSTAB 
FFLAGS=

Compiler Flag Performance

The following table is based on [email protected] with Test Case 3. The speedup is up to 14.67X when there are 64 threads.

Cores Original Flags New Flags Speedup
1 1060.55 1030.97 1.03
2 643.84 619.43 1.04
4 401.97 371.58 1.08
8 314.23 270.56 1.16
16 418.16 146.35 2.86
32 686.54 105.15 6.53
64 1097.58 74.81 14.67

Performance Regression

The following table is based on [email protected] with Test Case 3. The speedup is up to 23.8X when there are 64 threads. The percentage of reduction of runtime is 95.81%.

Cores Original Optimization Speedup
1 1060.55 720.42 1.47
2 643.84 349.17 1.84
4 401.97 233.63 1.72
8 314.23 144.44 2.18
16 418.16 91.6 4.57
32 686.54 64.02 10.72
64 1097.58 46.04 23.84

Code modifications

From 04ff2d513531aaf60871d306d19d4f0a326aa8a0 Mon Sep 17 00:00:00 2001
From: dolanzhao <[email protected]>
Date: Fri, 16 Jul 2021 17:11:38 +0000
Subject: [PATCH 5/5] support simd

---
 operators.avx.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/operators.avx.c b/operators.avx.c
index fee85d4..22ab6fd 100755
--- a/operators.avx.c
+++ b/operators.avx.c
@@ -8,7 +8,12 @@
 #include <stdint.h>
 #include <string.h>
 #include <math.h>
-#include <immintrin.h>
+// #include <immintrin.h>
+#define SIMDE_ENABLE_NATIVE_ALIASES
+#define SIMDE_X86_SSE_ENABLE_NATIVE_ALIASES
+#define _MM_HINT_T0 1
+#define _MM_HINT_T1 2
+#include "simde/x86/avx2.h"
 //------------------------------------------------------------------------------------------------------------------------------
 #include "timer.h"
 #include "defines.h"
-- 
2.32.0

Compiler flags

-Ofast -D__PREFETCH_NEXT_PLANE_FROM_DRAM -D__FUSION_RESIDUAL_RESTRICTION

Spack recipe and build line

We have attached the spack package.py.

We use the following commands to install our optimization version.

spack install -j 64 minigmg@local+simde+opt

ReFrame script

We have attached the reframe script for optimization version.

Report

Compilation Summary

Porting miniGMG to ARM needs some adaptation of the code. We changed the rdtsc() function in the timer.x86.c from x86 assemblies to aarch64 assemblies to ensure the success of the compilation. Furthermore, our compilation enables both OpenMP and MPI. We found some issues in both Spack and nvhpc compiler. For Spack, it does not give a special build option for arm compiler on graviton2, but only a general aarch64 option; we added a arm compiler section in graviton2 in Spack and submitted to archspec-json. Our PR has been merged to archspec-json. For nvhpc, we found it produces wrong code for OpenMP programs when running with more than 1 thread, while nvhpc works correctly for sequential and MPI (with no threads) programs. Given the short time frame, we didn’t pinpoint the exact bug in nvhpc code genration, but it inspires to investigate its OpenMP implementation and checks whether it exactly follows the OpenMP standard.

Performance Summary

We have observed very interesting performance insights in miniGMG. When there are 4 cores or less, ARM21 and GCC have comparable performanc e. However, ARM21 gives much better performance over GCC when the execution scales to 8 and more cores. Moreover, ARM server gives much bet ter performance than Intel server.

Optimisation Summary

We have tested multiple optimization strategies, which includes:

  • (1) using SIMD instructions on aarch64 (Neon)

  • (2) using -Ofast (especially using fast math library)

  • (3) enabling algorithm optimization (residual operator fusion)

  • (4) using active waiting policy for OpenMP threads.