Don't observe performance improvement for built-in tests with propeller

Hi,
I'm not able to observe the performance benefit due to propeller toolchain for the  included test program (main.cc, callee.cc). Followed the steps given in Propeller_RFC.pdf.

High level observations:
1) Elapsed time doesn't show any improvement.
2) cycles and instruction, branch mispredicts are almost same
3) overall cache-misses are lower but L1-icache-load-misses are similar

$ time ./a.out.orig.labels 1000000000 2 >& /dev/null
real    0m21.094s
user    0m20.489s
sys     0m0.604s

$ time ./a.out.labels 1000000000 2 >& /dev/null
real    0m20.357s
user    0m19.908s
sys     0m0.448s

Elapsed time varies from 1 to 5%.

# Perf data
$ perf stat -e cycles,instructions,cache-misses,L1-icache-load-misses,br_misp_retired.all_branches,br_inst_retired.all_branches,icache_64b.iftag_stall  ./a.out.o
rig.labels 1000000000 1> /dev/null

 Performance counter stats for './a.out.orig.labels 1000000000':

    80,231,347,233      cycles                                                        (66.67%)
   243,314,361,618      instructions              #    3.03  insn per cycle           (83.33%)
            22,522      cache-misses                                                  (83.33%)
         2,644,077      L1-icache-load-misses                                         (83.33%)
        20,400,061      br_misp_retired.all_branches                                     (83.33%)
    53,442,616,374      br_inst_retired.all_branches                                     (83.34%)
          68,554,744      icache_64b.iftag_stall                                        (57.14%)

      21.191516400 seconds time elapsed

# Optimized binary
$ perf stat -e cycles,instructions,cache-misses,L1-icache-load-misses,br_misp_retired.all_branches,br_inst_retired.all_branches,icache_64b.iftag_stall  ./a.out.l
abels 1000000000 1> /dev/null

 Performance counter stats for './a.out.labels 1000000000':

    81,446,698,907      cycles                                                        (66.66%)
   243,218,220,681      instructions              #    2.99  insn per cycle           (83.33%)
            14,907      cache-misses                                                  (83.34%)
         2,533,002      L1-icache-load-misses                                         (83.34%)
        20,571,010      br_misp_retired.all_branches                                     (83.34%)
    53,455,580,211      br_inst_retired.all_branches                                     (83.33%)
          68,847,492      icache_64b.iftag_stall                                        (57.14%)

      21.512644234 seconds time elapsed

The referenced paper doesn't mention the benefit for the included test program. What is expected improvement for the included test?

Please see more details (build, runtime steps, etc.) in following gist.
https://gist.github.com/uttampawar/5407f998bc3f02f58c4b83b0b4dc20fe

Any hint is appreciated. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't observe performance improvement for built-in tests with propeller #3

Perf data

Optimized binary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Don't observe performance improvement for built-in tests with propeller #3

Description

Perf data

Optimized binary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions