Closed
Conversation
51c64a4 to
9806445
Compare
12ad48c to
b3dec4a
Compare
…de bloat
The goal of this PR is to reduce some of the code bloat induced by fast-ZPP.
Reduced code bloat results in fewer cache misses (and better DSB coverage),
and fewer instructions executed.
If we take a look at a simple function:
```c
PHP_FUNCTION(twice) {
zend_long foo;
ZEND_PARSE_PARAMETERS_START(1, 1)
Z_PARAM_LONG(foo)
ZEND_PARSE_PARAMETERS_END();
RETURN_LONG(foo * 2);
}
```
We obtain the following assembly on x86-64 in the non-cold blocks:
```s
<+0>: push %r12
<+2>: push %rbp
<+3>: push %rbx
<+4>: sub $0x10,%rsp
<+8>: mov %fs:0x28,%r12
<+17>: mov %r12,0x8(%rsp)
<+22>: mov 0x2c(%rdi),%r12d
<+26>: cmp $0x1,%r12d
<+30>: jne 0x22beaf <zif_twice-3212257>
<+36>: cmpb $0x4,0x58(%rdi)
<+40>: mov %rsi,%rbp
<+43>: jne 0x53c2f0 <zif_twice+96>
<+45>: mov 0x50(%rdi),%rax
<+49>: add %rax,%rax
<+52>: movl $0x4,0x8(%rbp)
<+59>: mov %rax,0x0(%rbp)
<+63>: mov 0x8(%rsp),%rax
<+68>: sub %fs:0x28,%rax
<+77>: jne 0x53c312 <zif_twice+130>
<+79>: add $0x10,%rsp
<+83>: pop %rbx
<+84>: pop %rbp
<+85>: pop %r12
<+87>: ret
<+88>: nopl 0x0(%rax,%rax,1)
<+96>: lea 0x50(%rdi),%rbx
<+100>: mov %rsp,%rsi
<+103>: mov $0x1,%edx
<+108>: mov %rbx,%rdi
<+111>: call 0x620240 <zend_parse_arg_long_slow>
<+116>: test %al,%al
<+118>: je 0x22be96 <zif_twice.cold>
<+124>: mov (%rsp),%rax
<+128>: jmp 0x53c2c1 <zif_twice+49>
<+130>: call 0x201050 <__stack_chk_fail@plt>
```
Notice how we get the stack protector overhead in this function
and also have to reload the parsed value on the slow path.
This happens because the parsed value is returned via a pointer.
If instead we were to return struct with a value pair
(similar to optional in C++ / Option in Rust), then the values are returned
via registers. This means that we no longer have stack protector overhead
and we also don't need to reload a value, resulting in better register usage.
This is the resulting assembly for the sample function after this patch:
```s
<+0>: push %r12
<+2>: push %rbp
<+3>: push %rbx
<+4>: mov 0x2c(%rdi),%r12d
<+8>: cmp $0x1,%r12d
<+12>: jne 0x22d482 <zif_twice-3205454>
<+18>: cmpb $0x4,0x58(%rdi)
<+22>: mov %rsi,%rbp
<+25>: jne 0x53be08 <zif_twice+56>
<+27>: mov 0x50(%rdi),%rax
<+31>: add %rax,%rax
<+34>: movl $0x4,0x8(%rbp)
<+41>: mov %rax,0x0(%rbp)
<+45>: pop %rbx
<+46>: pop %rbp
<+47>: pop %r12
<+49>: ret
<+50>: nopw 0x0(%rax,%rax,1)
<+56>: lea 0x50(%rdi),%rbx
<+60>: mov $0x1,%esi
<+65>: mov %rbx,%rdi
<+68>: call 0x61e7b0 <zend_parse_arg_long_slow>
<+73>: test %dl,%dl
<+75>: je 0x22d46a <zif_twice.cold>
<+81>: jmp 0x53bdef <zif_twice+31>
```
The following uses the default benchmark programs we use in CI.
Each program is ran on php-cgi with the appropriate `-T` argument, then repeated 15 times.
It shows a small performance improvement on Symfony both with and without JIT, and a small improvement
on WordPress with JIT.
For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well.
| Test | Old Mean | Old Stddev | New Mean | New Stddev |
|---------------------------------|----------|------------|----------|------------|
| Symfony, no JIT (-T10,50) | 0.5324 | 0.0050 | 0.5272 | 0.0042 |
| Symfony, tracing JIT (-T10,50) | 0.5301 | 0.0029 | 0.5264 | 0.0036 |
| WordPress, no JIT (-T5,25) | 0.7408 | 0.0049 | 0.7404 | 0.0048 |
| WordPress, tracing JIT (-T5,25) | 0.6814 | 0.0052 | 0.6770 | 0.0055 |
I was not able to measure any meaningful difference for our micro benchmarks
`Zend/bench.php` and `Zend/micro_bench.php`.
The Valgrind instruction counts also show a decrease:
-0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).
Owner
Author
|
Now open on php-src |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.