Slightly improve performance of zend_copy_extra_args #18146

nielsdos · 2025-03-25T23:21:44Z

This patch aims to improve the performance when a callback needs zend_copy_extra_args. This turns out to be common with some array functions like array_walk and the array_find family of functions. In these cases, the callback is often a short function and often only takes a single argument. Therefore, zend_copy_extra_args takes measurable time in the profile. Looking at VTune reveals that my system stalls on memory loads for op_array and the argument count. By passing op_array as an argument, we eliminate the load for op_array and due to GCC's inter-procedural analysis it also can use the already-loaded argument counts.

The following synthetic benchmark (courtesy of Tim) improves about 11% in run time performance:

$array = range(1, 10000);
$result = 0;
for ($i = 0; $i < 5000; $i++) {
    $result += array_find($array, static function ($item) {
            return $item === 5000;
    });
}
var_dump($result);

Hyperfine stats (on an i7-1185G7) for this benchmark:

Benchmark 1: ./sapi/cli/php x.php
  Time (mean ± σ):     528.5 ms ±   4.8 ms    [User: 524.8 ms, System: 3.4 ms]
  Range (min … max):   521.0 ms … 534.4 ms    10 runs

Benchmark 2: ./sapi/cli/php_old x.php
  Time (mean ± σ):     586.2 ms ±   5.3 ms    [User: 581.8 ms, System: 4.0 ms]
  Range (min … max):   578.9 ms … 592.6 ms    10 runs

Summary
  ./sapi/cli/php x.php ran
    1.11 ± 0.01 times faster than ./sapi/cli/php_old x.php

On an intel i7-4790 I get about a 5% +-1% performance improvement. Ilija measured a improvement of around 7% +-2% on his intel i7-12800H. For neither of my systems I measured a noticeable difference in bench.php, micro_bench.php or Symfony demo. This means we do not see a regression for these other benchmarks.

For reference, this is the resulting hyperfine benchmark on the i7-1185G7 for Symfony demo:

Benchmark 1: ../php-src/sapi/cli/php_old --repeat 50 public/index.php
  Time (mean ± σ):     742.3 ms ±   4.5 ms    [User: 600.8 ms, System: 139.5 ms]
  Range (min … max):   736.7 ms … 749.8 ms    10 runs

Benchmark 2: ../php-src/sapi/cli/php --repeat 50 public/index.php
  Time (mean ± σ):     738.5 ms ±   3.8 ms    [User: 601.2 ms, System: 135.3 ms]
  Range (min … max):   735.3 ms … 747.3 ms    10 runs

To further confirm no regressions take place, valgrind instruction count for 500 runs on php-cgi on Symfony demo:
Before patch: 393,938,765
After patch: 393,950,974
So there is a noticeable increase, but it makes no effect on the run time at least on my system.
It is possible to get rid of this increase by applying the NOIPA attribute to zend_copy_extra_args, which makes the attached benchmark only a few percent faster, but keep the instruction count for Symfony the same.

Looking at the effect on the assembly of zend_init_func_execute_data. We see on the regular path of execution one small change (besides instruction reordering), resulting in an extra instruction.
This is around the code that compares the argument count with EX_NUM_ARGS().

Before patch:

je     zend_init_func_execute_data+297
mov    0x68(%rbx),%rax
mov    0x2c(%r14),%r9d
movq   $0x0,0x8(%r14)
mov    %r13,0x10(%r14)
mov    0x4(%rbx),%edx
mov    %rax,%r15
cmp    %r9d,0x20(%rbx)
jb     zend_init_func_execute_data+320

After patch:

je     zend_init_func_execute_data+297
mov    0x68(%rbx),%rax
mov    0x20(%rbx),%esi
mov    %r13,0x10(%r14)
mov    0x2c(%r14),%r9d
mov    0x4(%rbx),%edi
movq   $0x0,0x8(%r14)
mov    %rax,%r15
cmp    %r9d,%esi
jb     zend_init_func_execute_data+320

Where previously 0x20(%rbx) was compared directly with %r9d, it is now stored in a register %esi so that it can be reused without reloading in zend_copy_extra_args (caused by inter-procedural analysis). Still, there's the same number of memory loads, just now via an extra move.

There is some changes to the code that calls zend_copy_extra_args, where some memory loads happen prior to the call.
The memory loads at the start of zend_copy_extra_args have been eliminated however.

This patch aims to improve the performance when a callback needs `zend_copy_extra_args`. This turns out to be common with some array functions like array_walk and the array_find family of functions. In these cases, the callback is often a short function and often only takes a single argument. Therefore, `zend_copy_extra_args` takes measurable time in the profile. Looking at VTune reveals that my system stalls on memory loads for op_array and the argument count. By passing op_array as an argument, we eliminate the load for op_array and due to GCC's inter-procedural analysis it also can use the already-loaded argument counts. The following synthetic benchmark (courtesy of Tim) improves about 11% in run time performance: ```php $array = range(1, 10000); $result = 0; for ($i = 0; $i < 5000; $i++) { $result += array_find($array, static function ($item) { return $item === 5000; }); } var_dump($result); ``` Hyperfine stats (on an i7-1185G7) for this benchmark: ``` Benchmark 1: ./sapi/cli/php x.php Time (mean ± σ): 528.5 ms ± 4.8 ms [User: 524.8 ms, System: 3.4 ms] Range (min … max): 521.0 ms … 534.4 ms 10 runs Benchmark 2: ./sapi/cli/php_old x.php Time (mean ± σ): 586.2 ms ± 5.3 ms [User: 581.8 ms, System: 4.0 ms] Range (min … max): 578.9 ms … 592.6 ms 10 runs Summary ./sapi/cli/php x.php ran 1.11 ± 0.01 times faster than ./sapi/cli/php_old x.php ``` On an intel i7-4790 I get about a 5% +-1% performance improvement. Ilija measured a improvement of around 7% +-2% on his intel i7-12800H. For neither of _my_ systems I measured a noticeable difference in bench.php, micro_bench.php or Symfony demo. This means we do not see a regression for these other benchmarks. For reference, this is the resulting hyperfine benchmark on the i7-1185G7 for Symfony demo: ``` Benchmark 1: ../php-src/sapi/cli/php_old --repeat 50 public/index.php Time (mean ± σ): 742.3 ms ± 4.5 ms [User: 600.8 ms, System: 139.5 ms] Range (min … max): 736.7 ms … 749.8 ms 10 runs Benchmark 2: ../php-src/sapi/cli/php --repeat 50 public/index.php Time (mean ± σ): 738.5 ms ± 3.8 ms [User: 601.2 ms, System: 135.3 ms] Range (min … max): 735.3 ms … 747.3 ms 10 runs ``` To further confirm no regressions take place, valgrind instruction count for 50 runs on Symfony demo: Before patch: 4,452,217,516 After patch: 4,452,205,233 The difference is just due to noise. --- Looking at the effect on the assembly of zend_init_func_execute_data. We see on the regular path of execution one small change (besides instruction reordering), resulting in an extra instruction. This is around the code that compares the argument count with EX_NUM_ARGS(). Before patch: ``` je zend_init_func_execute_data+297 mov 0x68(%rbx),%rax mov 0x2c(%r14),%r9d movq $0x0,0x8(%r14) mov %r13,0x10(%r14) mov 0x4(%rbx),%edx mov %rax,%r15 cmp %r9d,0x20(%rbx) jb zend_init_func_execute_data+320 ``` After patch: ``` je zend_init_func_execute_data+297 mov 0x68(%rbx),%rax mov 0x20(%rbx),%esi mov %r13,0x10(%r14) mov 0x2c(%r14),%r9d mov 0x4(%rbx),%edi movq $0x0,0x8(%r14) mov %rax,%r15 cmp %r9d,%esi jb zend_init_func_execute_data+320 ``` Where previously 0x20(%rbx) was compared directly with %r9d, it is now stored in a register %esi so that it can be reused without reloading in zend_copy_extra_args (caused by inter-procedural analysis). Still, there's the same number of memory loads, just now via an extra move. There is some changes to the code that calls zend_copy_extra_args, where some memory loads happen prior to the call. The memory loads at the start of zend_copy_extra_args have been eliminated however.

dstogov · 2025-03-26T07:37:08Z

I'm not sure if the improvement comes from a single eliminated load from L1.
It may be the effect of the different code alignment. The presentation, that I sent you yesterday, talk about up to 30% difference on short loops caused by bad/good alignment.

In my tests I see ~2% improvement on your test above, and slight degradation on bench.php (both real-time and callgrind). This is because i_init_func_execute_data() was optimized for case without extra args (that is used more often).

I'm indifferent to this.

nielsdos · 2025-04-02T19:08:31Z

Given that, and that there are ideas to implement some functions in PHP itself (which would also avoid this overhead when JITted for example), I'll leave this for now.

github-actions bot added the Category: Engine label Mar 25, 2025

nielsdos closed this Apr 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slightly improve performance of zend_copy_extra_args #18146

Slightly improve performance of zend_copy_extra_args #18146

Uh oh!

nielsdos commented Mar 25, 2025 •

edited

Loading

Uh oh!

dstogov commented Mar 26, 2025

Uh oh!

nielsdos commented Apr 2, 2025

Uh oh!

Uh oh!

Slightly improve performance of zend_copy_extra_args #18146

Slightly improve performance of zend_copy_extra_args #18146

Uh oh!

Conversation

nielsdos commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstogov commented Mar 26, 2025

Uh oh!

nielsdos commented Apr 2, 2025

Uh oh!

Uh oh!

nielsdos commented Mar 25, 2025 •

edited

Loading