Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly improve performance of zend_copy_extra_args #18146

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nielsdos
Copy link
Member

@nielsdos nielsdos commented Mar 25, 2025

This patch aims to improve the performance when a callback needs zend_copy_extra_args. This turns out to be common with some array functions like array_walk and the array_find family of functions. In these cases, the callback is often a short function and often only takes a single argument. Therefore, zend_copy_extra_args takes measurable time in the profile. Looking at VTune reveals that my system stalls on memory loads for op_array and the argument count. By passing op_array as an argument, we eliminate the load for op_array and due to GCC's inter-procedural analysis it also can use the already-loaded argument counts.

The following synthetic benchmark (courtesy of Tim) improves about 11% in run time performance:

$array = range(1, 10000);
$result = 0;
for ($i = 0; $i < 5000; $i++) {
    $result += array_find($array, static function ($item) {
            return $item === 5000;
    });
}
var_dump($result);

Hyperfine stats (on an i7-1185G7) for this benchmark:

Benchmark 1: ./sapi/cli/php x.php
  Time (mean ± σ):     528.5 ms ±   4.8 ms    [User: 524.8 ms, System: 3.4 ms]
  Range (min … max):   521.0 ms … 534.4 ms    10 runs

Benchmark 2: ./sapi/cli/php_old x.php
  Time (mean ± σ):     586.2 ms ±   5.3 ms    [User: 581.8 ms, System: 4.0 ms]
  Range (min … max):   578.9 ms … 592.6 ms    10 runs

Summary
  ./sapi/cli/php x.php ran
    1.11 ± 0.01 times faster than ./sapi/cli/php_old x.php

On an intel i7-4790 I get about a 5% +-1% performance improvement. Ilija measured a improvement of around 7% +-2% on his intel i7-12800H. For neither of my systems I measured a noticeable difference in bench.php, micro_bench.php or Symfony demo. This means we do not see a regression for these other benchmarks.

For reference, this is the resulting hyperfine benchmark on the i7-1185G7 for Symfony demo:

Benchmark 1: ../php-src/sapi/cli/php_old --repeat 50 public/index.php
  Time (mean ± σ):     742.3 ms ±   4.5 ms    [User: 600.8 ms, System: 139.5 ms]
  Range (min … max):   736.7 ms … 749.8 ms    10 runs

Benchmark 2: ../php-src/sapi/cli/php --repeat 50 public/index.php
  Time (mean ± σ):     738.5 ms ±   3.8 ms    [User: 601.2 ms, System: 135.3 ms]
  Range (min … max):   735.3 ms … 747.3 ms    10 runs

To further confirm no regressions take place, valgrind instruction count for 500 runs on php-cgi on Symfony demo:
Before patch: 393,938,765
After patch: 393,950,974
So there is a noticeable increase, but it makes no effect on the run time at least on my system.
It is possible to get rid of this increase by applying the NOIPA attribute to zend_copy_extra_args, which makes the attached benchmark only a few percent faster, but keep the instruction count for Symfony the same.


Looking at the effect on the assembly of zend_init_func_execute_data. We see on the regular path of execution one small change (besides instruction reordering), resulting in an extra instruction.
This is around the code that compares the argument count with EX_NUM_ARGS().

Before patch:

je     zend_init_func_execute_data+297
mov    0x68(%rbx),%rax
mov    0x2c(%r14),%r9d
movq   $0x0,0x8(%r14)
mov    %r13,0x10(%r14)
mov    0x4(%rbx),%edx
mov    %rax,%r15
cmp    %r9d,0x20(%rbx)
jb     zend_init_func_execute_data+320

After patch:

je     zend_init_func_execute_data+297
mov    0x68(%rbx),%rax
mov    0x20(%rbx),%esi
mov    %r13,0x10(%r14)
mov    0x2c(%r14),%r9d
mov    0x4(%rbx),%edi
movq   $0x0,0x8(%r14)
mov    %rax,%r15
cmp    %r9d,%esi
jb     zend_init_func_execute_data+320

Where previously 0x20(%rbx) was compared directly with %r9d, it is now stored in a register %esi so that it can be reused without reloading in zend_copy_extra_args (caused by inter-procedural analysis). Still, there's the same number of memory loads, just now via an extra move.

There is some changes to the code that calls zend_copy_extra_args, where some memory loads happen prior to the call.
The memory loads at the start of zend_copy_extra_args have been eliminated however.

This patch aims to improve the performance when a callback needs `zend_copy_extra_args`.
This turns out to be common with some array functions like array_walk and the array_find family of functions.
In these cases, the callback is often a short function and often only takes a single argument.
Therefore, `zend_copy_extra_args` takes measurable time in the profile.
Looking at VTune reveals that my system stalls on memory loads for op_array and the argument count.
By passing op_array as an argument, we eliminate the load for op_array and due to GCC's
inter-procedural analysis it also can use the already-loaded argument counts.

The following synthetic benchmark (courtesy of Tim) improves about 11% in run time performance:
```php
$array = range(1, 10000);
$result = 0;
for ($i = 0; $i < 5000; $i++) {
    $result += array_find($array, static function ($item) {
            return $item === 5000;
    });
}
var_dump($result);
```

Hyperfine stats (on an i7-1185G7) for this benchmark:
```
Benchmark 1: ./sapi/cli/php x.php
  Time (mean ± σ):     528.5 ms ±   4.8 ms    [User: 524.8 ms, System: 3.4 ms]
  Range (min … max):   521.0 ms … 534.4 ms    10 runs

Benchmark 2: ./sapi/cli/php_old x.php
  Time (mean ± σ):     586.2 ms ±   5.3 ms    [User: 581.8 ms, System: 4.0 ms]
  Range (min … max):   578.9 ms … 592.6 ms    10 runs

Summary
  ./sapi/cli/php x.php ran
    1.11 ± 0.01 times faster than ./sapi/cli/php_old x.php
```

On an intel i7-4790 I get about a 5% +-1% performance improvement.
Ilija measured a improvement of around 7% +-2% on his intel i7-12800H.
For neither of _my_ systems I measured a noticeable difference in bench.php, micro_bench.php or Symfony demo.
This means we do not see a regression for these other benchmarks.

For reference, this is the resulting hyperfine benchmark on the i7-1185G7 for Symfony demo:
```
Benchmark 1: ../php-src/sapi/cli/php_old --repeat 50 public/index.php
  Time (mean ± σ):     742.3 ms ±   4.5 ms    [User: 600.8 ms, System: 139.5 ms]
  Range (min … max):   736.7 ms … 749.8 ms    10 runs

Benchmark 2: ../php-src/sapi/cli/php --repeat 50 public/index.php
  Time (mean ± σ):     738.5 ms ±   3.8 ms    [User: 601.2 ms, System: 135.3 ms]
  Range (min … max):   735.3 ms … 747.3 ms    10 runs
```

To further confirm no regressions take place, valgrind instruction count for 50 runs on Symfony demo:
Before patch: 4,452,217,516
After patch:  4,452,205,233
The difference is just due to noise.

---

Looking at the effect on the assembly of zend_init_func_execute_data.
We see on the regular path of execution one small change (besides instruction reordering),
resulting in an extra instruction.
This is around the code that compares the argument count with EX_NUM_ARGS().

Before patch:
```
je     zend_init_func_execute_data+297
mov    0x68(%rbx),%rax
mov    0x2c(%r14),%r9d
movq   $0x0,0x8(%r14)
mov    %r13,0x10(%r14)
mov    0x4(%rbx),%edx
mov    %rax,%r15
cmp    %r9d,0x20(%rbx)
jb     zend_init_func_execute_data+320
```

After patch:
```
je     zend_init_func_execute_data+297
mov    0x68(%rbx),%rax
mov    0x20(%rbx),%esi
mov    %r13,0x10(%r14)
mov    0x2c(%r14),%r9d
mov    0x4(%rbx),%edi
movq   $0x0,0x8(%r14)
mov    %rax,%r15
cmp    %r9d,%esi
jb     zend_init_func_execute_data+320
```

Where previously 0x20(%rbx) was compared directly with %r9d,
it is now stored in a register %esi so that it can be reused
without reloading in zend_copy_extra_args (caused by inter-procedural analysis).
Still, there's the same number of memory loads, just now via an extra move.

There is some changes to the code that calls zend_copy_extra_args,
where some memory loads happen prior to the call.
The memory loads at the start of zend_copy_extra_args have been eliminated however.
@dstogov
Copy link
Member

dstogov commented Mar 26, 2025

I'm not sure if the improvement comes from a single eliminated load from L1.
It may be the effect of the different code alignment. The presentation, that I sent you yesterday, talk about up to 30% difference on short loops caused by bad/good alignment.

In my tests I see ~2% improvement on your test above, and slight degradation on bench.php (both real-time and callgrind). This is because i_init_func_execute_data() was optimized for case without extra args (that is used more often).

I'm indifferent to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants